Data Buff: October 2013

Thursday, October 3, 2013

How to delete _$folder$ file from AWS S3 directories ?

The _$folder$ file gets created win S3 directory structures because because of use of tools to interact with S3 file system (like S3fox).
The files are visible only with AWS S3 console or with s3cmd from CLI.
The files causes no harm to the file system, but if you want to delete it you can chose any of the way.
1. Delete it from AWS S3 console.
2. From CLI with S3cmd
s3cmd del s3://<s3_bucket_name>/<dir_name>/_\$folder\$

I have deleted the files present recursively from S3 directories with following script :

 dir_list=`hadoop fs -ls s3://<s3_bucket_name>/<dir_name>/*/| cut -d' ' -f17 `  
  for dir in $dir_list  
  do  
     file_list=`s3cmd ls s3://<s3_bucket_name>${dir}/* | grep folder | cut -d' ' -f14`  
     for file in $file_list  
        do  
            s3cmd del `echo ${file} |sed -n 1'p' | tr '\$' '\\\$'`  
        done  
  done

Tuesday, October 1, 2013

How to create AWS job flow for HBase from CLI?

1. Download the Amazon Elastic MapReduce CLI from the location below
wget http://elasticmapreduce.s3.amazonaws.com/elastic-mapreduce-ruby.zip

2. Unzip it
unzip elastic-mapreduce-ruby.zip

3. Create a shell script with following code (create_jf.sh)

 ruby elastic-mapreduce \  
 -v \  
 —create \  
 —alive \  
 —region “us-east-1” \  
 —access-id <your_access_id> \  
 —private-key <your_private_kay> \  
 —key-pair <your_key_pair> \  
 —ami-version latest \  
 —visible-to-all-users \  
 —hbase \  
 —name “HBASE from CLI” \  
 —instance-group MASTER \  
 —instance-count 1 \  
 —instance-type m1.large \  
 —instance-group CORE \  
 —instance-count 1 \  
 —instance-type m1.large \  
 —pig-interactive \  
 —pig-versions latest \  
 —hive-interactive \  
 —hive-versions latest \  
 —bootstrap-action “s3://elasticmapreduce/bootstrap-actions/configure-hadoop” \  
 —args “-m,mapred.tasktracker.map.tasks.maximum=6,-m,mapred.tasktracker.reduce.tasks.maximum=2”

4. Create job flow
bash create_jf.sh

5. Monitor job flow from AWS console
Copy the job flow Id you get after successful run of create_jf.sh and search it with AWS EMR console.