Creating an EMR cluster using CLI

I normally start a cluster from the UI and decided I post how to create a cluster from the CLI. This is assuming the AWS CLI is installed and configured on your machine.

The command you want to use is aws emr create-cluster. You will have to figure out what release of emr do you want to create, what all applications you need to include , the instance type you want to use, the number of instances, if you are using the fleet (which I am here), etc. Once you have chosen all that , open a notepad and create the command. Mine looks like this :

aws emr create-cluster \
–applications Name=Hadoop Name=Hive Name=Tez \
–tags ‘Project=Covid-19 Analysis’ ‘region=us’ ‘Contact=Raju Pillai’ ‘Name=Covid-19 Analysis’ \
–ec2-attributes ‘{
“KeyName”:”xxxx”,
“InstanceProfile”:”EMR_EC2_DefaultRole”,
“SubnetId”:”subnet-XXX”,
“EmrManagedSlaveSecurityGroup”:”sg-xxx”,
“EmrManagedMasterSecurityGroup”:”sg-yyyy”
}’ \
–release-label emr-6.1.0 \
–log-uri ‘s3n://raju-datalake-emr/logs/’ \
–configurations ‘[
{
“Classification”:”emrfs-site”,
“Properties”:{
“fs.s3.consistent.retryPeriodSeconds”:”10″,
“fs.s3.consistent”:”true”,
“fs.s3.consistent.retryCount”:”5″,
“fs.s3.consistent.metadata.tableName”:”EmrFSMetadata”
}
},
{
“Classification”:”hive-site”,
“Properties”:{
“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
]’ \
–instance-fleets ‘[
{
“InstanceFleetType”:”MASTER”,
“TargetOnDemandCapacity”:1,
“TargetSpotCapacity”:0,
“InstanceTypeConfigs”:[
{
“WeightedCapacity”:1,
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5d.xlarge”
}
],
“Name”:”Master – 1″
},
{
“InstanceFleetType”:”CORE”,
“TargetOnDemandCapacity”:2,
“TargetSpotCapacity”:2,
“LaunchSpecifications”:{
“SpotSpecification”:{
“TimeoutDurationMinutes”:10,
“TimeoutAction”:”SWITCH_TO_ON_DEMAND”
}
},
“InstanceTypeConfigs”:[
{
“WeightedCapacity”:4,
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5d.xlarge”
}
],
“Name”:”Core – 2″
},
{
“InstanceFleetType”:”TASK”,
“TargetOnDemandCapacity”:1,
“TargetSpotCapacity”:3,
“LaunchSpecifications”:{
“SpotSpecification”:{
“TimeoutDurationMinutes”:30,
“TimeoutAction”:”TERMINATE_CLUSTER”
}
},
“InstanceTypeConfigs”:[
{
“WeightedCapacity”:4,
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5d.xlarge”
},
{
“WeightedCapacity”:4,
“EbsConfiguration”:{
“EbsBlockDeviceConfigs”:[
{
“VolumeSpecification”:{
“SizeInGB”:32,
“VolumeType”:”gp2″
},
“VolumesPerInstance”:2
}
]
},
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5.xlarge”
}
],
“Name”:”Task – 3″
}
]’ \
–bootstrap-actions ‘[
{
“Path”:”s3://raju-datalake-emr/scripts/dev/bootstrap_scripts/Covid_Sync_emrfs.sh”,
“Name”:”Copy Scripts”
}
]’ \
–ebs-root-volume-size 50 \
–service-role EMR_DefaultRole \
–enable-debugging \
–name ‘Covid-19-EMR6.1-AutoScale’ \
–scale-down-behavior TERMINATE_AT_TASK_COMPLETION \
–region us-east-1

About rpillai

I am a technology enthusiasts and love to work with databases and other technology. Learning new things everyday and don't think the path ever ends ...
This entry was posted in db2. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *