Creating an EMR cluster using CLI

I normally start a cluster from the UI and decided I post how to create a cluster from the CLI. This is assuming the AWS CLI is installed and configured on your machine.

The command you want to use is aws emr create-cluster. You will have to figure out what release of emr do you want to create, what all applications you need to include , the instance type you want to use, the number of instances, if you are using the fleet (which I am here), etc. Once you have chosen all that , open a notepad and create the command. Mine looks like this :

aws emr create-cluster \
–applications Name=Hadoop Name=Hive Name=Tez \
–tags ‘Project=Covid-19 Analysis’ ‘region=us’ ‘Contact=Raju Pillai’ ‘Name=Covid-19 Analysis’ \
–ec2-attributes ‘{
“KeyName”:”xxxx”,
“InstanceProfile”:”EMR_EC2_DefaultRole”,
“SubnetId”:”subnet-XXX”,
“EmrManagedSlaveSecurityGroup”:”sg-xxx”,
“EmrManagedMasterSecurityGroup”:”sg-yyyy”
}’ \
–release-label emr-6.1.0 \
–log-uri ‘s3n://raju-datalake-emr/logs/’ \
–configurations ‘[
{
“Classification”:”emrfs-site”,
“Properties”:{
“fs.s3.consistent.retryPeriodSeconds”:”10″,
“fs.s3.consistent”:”true”,
“fs.s3.consistent.retryCount”:”5″,
“fs.s3.consistent.metadata.tableName”:”EmrFSMetadata”
}
},
{
“Classification”:”hive-site”,
“Properties”:{
“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
]’ \
–instance-fleets ‘[
{
“InstanceFleetType”:”MASTER”,
“TargetOnDemandCapacity”:1,
“TargetSpotCapacity”:0,
“InstanceTypeConfigs”:[
{
“WeightedCapacity”:1,
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5d.xlarge”
}
],
“Name”:”Master – 1″
},
{
“InstanceFleetType”:”CORE”,
“TargetOnDemandCapacity”:2,
“TargetSpotCapacity”:2,
“LaunchSpecifications”:{
“SpotSpecification”:{
“TimeoutDurationMinutes”:10,
“TimeoutAction”:”SWITCH_TO_ON_DEMAND”
}
},
“InstanceTypeConfigs”:[
{
“WeightedCapacity”:4,
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5d.xlarge”
}
],
“Name”:”Core – 2″
},
{
“InstanceFleetType”:”TASK”,
“TargetOnDemandCapacity”:1,
“TargetSpotCapacity”:3,
“LaunchSpecifications”:{
“SpotSpecification”:{
“TimeoutDurationMinutes”:30,
“TimeoutAction”:”TERMINATE_CLUSTER”
}
},
“InstanceTypeConfigs”:[
{
“WeightedCapacity”:4,
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5d.xlarge”
},
{
“WeightedCapacity”:4,
“EbsConfiguration”:{
“EbsBlockDeviceConfigs”:[
{
“VolumeSpecification”:{
“SizeInGB”:32,
“VolumeType”:”gp2″
},
“VolumesPerInstance”:2
}
]
},
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5.xlarge”
}
],
“Name”:”Task – 3″
}
]’ \
–bootstrap-actions ‘[
{
“Path”:”s3://raju-datalake-emr/scripts/dev/bootstrap_scripts/Covid_Sync_emrfs.sh”,
“Name”:”Copy Scripts”
}
]’ \
–ebs-root-volume-size 50 \
–service-role EMR_DefaultRole \
–enable-debugging \
–name ‘Covid-19-EMR6.1-AutoScale’ \
–scale-down-behavior TERMINATE_AT_TASK_COMPLETION \
–region us-east-1

Posted in db2 | Leave a comment

Copy S3 objects across AWS Accounts

This will show you how to copy objects between S3 buckets across different AWS Accounts. Its not an easy drag and drop. Not sure why Amazon doesn’t provide an easy “SFTP” like feature. Here are the steps:

Prerequisites

  1. You would need access to both the AWS accounts
  2. You need IAM user access on the destination
  3. AWS account number of the destination.
  4. You need to have the AWS CLI configured on your machine with the IAM user that you created/used from earlier step.

Get AWS Account number

  1. Login to the destination AWS account
  2. Go to My Account page and copy the Account ID

Set S3 policy on source account

  1. Login to the source AWS account
  2. Go to the S3 bucket
  3. Create the following policy to the bucket

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Sid”: “DelegateS3Access”,
“Effect”: “Allow”,
“Principal”: {
“AWS”: “arn:aws:iam::DESTINATION_BUCKET_ACCOUNT_NUMBER:root”
},
“Action”: [
“s3:ListBucket”,
“s3:GetObject”
],
“Resource”: [
“arn:aws:s3:::SOURCE_BUCKET_NAME/*”,
“arn:aws:s3:::SOURCE_BUCKET_NAME
]
}
]
}

Replace DESTINATION_BUCKET_ACCOUNT_NUMBER with the account ID that you copied earlier. Replace the SOURCE_BUCKET_NAME with the actual bucket name.

Attach policy on the destination account

  1. Login to the destination AWS account
  2. Go to my security credentials
  3. Select policies
  4. Add the following as the new policy for the IAM user

{
“Version”: “2012-10-17”,
“Statement”: [
{
“Effect”: “Allow”,
“Action”: [
“s3:ListBucket”,
“s3:GetObject”
],
“Resource”: [
“arn:aws:s3:::SOURCE_BUCKET_NAME“,
“arn:aws:s3:::SOURCE_BUCKET_NAME/
]
},
{
“Effect”: “Allow”,
“Action”: [
“s3:ListBucket”,
“s3:PutObject”,
“s3:PutObjectAcl”
],
“Resource”: [
“arn:aws:s3:::DESTINATION_BUCKET_NAME“,
“arn:aws:s3:::DESTINATION_BUCKET_NAME/
]
}
]
}

Replace DESTINATION_BUCKET_NAME with the actual bucket name of the destination. Replace the SOURCE_BUCKET_NAME with the actual source bucket name.

Sync the S3 from AWS CLI

Using AWS CLI on your computer issue the following command after replacing the BUCKET_NAME with the appropriate actual names.
Its important to use destination AWS IAM user account credentials.

aws s3 sync s3://SOURCE-BUCKET-NAME s3://DESTINATION-BUCKET-NAME –source-region SOURCE-REGION-NAME –region DESTINATION-REGION-NAME

This would sync the S3 buckets. As usual use due diligence before using this on your production system.

Posted in AWS, how to, Misc, S3 | Tagged , | Comments Off on Copy S3 objects across AWS Accounts

Setting up JAVA_HOME on a mac

Installing and setting up JAVA_HOME was a bit of a research for me. So thought I would post it here so next time anyone else or myself wonders how to do it .

Run the following command /usr/libexec/java_home -V  to get the list of installed JDK. The command will print out something like the following depending on the available JDK in your computer.
On my Mac I have the following version of Java.
/usr/libexec/java_home -V
Matching Java Virtual Machines (1):
    1.8.0_152, x86_64: “Java SE 8” /Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home
/Library/Java/JavaVirtualMachines/jdk1.8.0_152.jdk/Contents/Home
If you have multiple JDK, it will list all of them.
From the list above pick which version you want to be the default JDK. For example I will choose the 1.8.0_152 version to be my default JDK. To set it run the command below.
export JAVA_HOME=`/usr/libexec/java_home -v 1.8.0_152`
If the major version of the available JDK is unique you can just use the major version, like:
export JAVA_HOME=`/usr/libexec/java_home -v 1.8`
After setting the JAVA_HOME and you run the java -version command you will see that JDK 1.8 is the new default JDK in your computer.
java version “1.8.0_152”
Java(TM) SE Runtime Environment (build 1.8.0_152-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.152-b16, mixed mode)
The change above will only active in the current running shell. If you close or terminate the shell, next time you open the shell you will need to set it again. To make this change permanent you need to set it in your shell init file. For example if you are using bash then you can set the command in the .bash_profile. Add the following lines at the end of the file.
# Setting default JDK to version 1.8.
export JAVA_HOME=`/usr/libexec/java_home -v 1.8`
To activate this configuration right away your can run source .bash_profile. This command reads and executes the .bash_profile in the current shell.
Posted in Apps, big data, how to, java, Mac | Tagged , , | Leave a comment

Creating a new file system in Linux

Here how you would create a new file system :

  1. First create a new partition using fdisk. eg:
    fdisk /dev/sdb     –> Options (m , n,p,1,  ,t,8e,w)
  2. Create volume group
    vgcreate myvg /dev/sdb1
  3. Create a logical volume
    lvcreate -L 512G -n my_lv myvg
  4. mkfs.ext4 /dev/myvg/my_lv
  5. You can optionally add the file system in /etc/fstab to automatically mount
  6. Mount /myfilesystem
Posted in how to, linux | Tagged , , , , | Leave a comment