Creating an EMR cluster using CLI
I normally start a cluster from the UI and decided I post how to create a cluster from the CLI. This is assuming the AWS CLI is installed and configured on your machine.
The command you want to use is aws emr create-cluster. You will have to figure out what release of emr do you want to create, what all applications you need to include , the instance type you want to use, the number of instances, if you are using the fleet (which I am here), etc. Once you have chosen all that , open a notepad and create the command. Mine looks like this :
aws emr create-cluster \
–applications Name=Hadoop Name=Hive Name=Tez \
–tags ‘Project=Covid-19 Analysis’ ‘region=us’ ‘Contact=Raju Pillai’ ‘Name=Covid-19 Analysis’ \
–ec2-attributes ‘{
“KeyName”:”xxxx”,
“InstanceProfile”:”EMR_EC2_DefaultRole”,
“SubnetId”:”subnet-XXX”,
“EmrManagedSlaveSecurityGroup”:”sg-xxx”,
“EmrManagedMasterSecurityGroup”:”sg-yyyy”
}’ \
–release-label emr-6.1.0 \
–log-uri ‘s3n://raju-datalake-emr/logs/’ \
–configurations ‘[
{
“Classification”:”emrfs-site”,
“Properties”:{
“fs.s3.consistent.retryPeriodSeconds”:”10″,
“fs.s3.consistent”:”true”,
“fs.s3.consistent.retryCount”:”5″,
“fs.s3.consistent.metadata.tableName”:”EmrFSMetadata”
}
},
{
“Classification”:”hive-site”,
“Properties”:{
“hive.metastore.client.factory.class”:”com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory”
}
}
]’ \
–instance-fleets ‘[
{
“InstanceFleetType”:”MASTER”,
“TargetOnDemandCapacity”:1,
“TargetSpotCapacity”:0,
“InstanceTypeConfigs”:[
{
“WeightedCapacity”:1,
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5d.xlarge”
}
],
“Name”:”Master – 1″
},
{
“InstanceFleetType”:”CORE”,
“TargetOnDemandCapacity”:2,
“TargetSpotCapacity”:2,
“LaunchSpecifications”:{
“SpotSpecification”:{
“TimeoutDurationMinutes”:10,
“TimeoutAction”:”SWITCH_TO_ON_DEMAND”
}
},
“InstanceTypeConfigs”:[
{
“WeightedCapacity”:4,
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5d.xlarge”
}
],
“Name”:”Core – 2″
},
{
“InstanceFleetType”:”TASK”,
“TargetOnDemandCapacity”:1,
“TargetSpotCapacity”:3,
“LaunchSpecifications”:{
“SpotSpecification”:{
“TimeoutDurationMinutes”:30,
“TimeoutAction”:”TERMINATE_CLUSTER”
}
},
“InstanceTypeConfigs”:[
{
“WeightedCapacity”:4,
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5d.xlarge”
},
{
“WeightedCapacity”:4,
“EbsConfiguration”:{
“EbsBlockDeviceConfigs”:[
{
“VolumeSpecification”:{
“SizeInGB”:32,
“VolumeType”:”gp2″
},
“VolumesPerInstance”:2
}
]
},
“BidPriceAsPercentageOfOnDemandPrice”:100,
“InstanceType”:”m5.xlarge”
}
],
“Name”:”Task – 3″
}
]’ \
–bootstrap-actions ‘[
{
“Path”:”s3://raju-datalake-emr/scripts/dev/bootstrap_scripts/Covid_Sync_emrfs.sh”,
“Name”:”Copy Scripts”
}
]’ \
–ebs-root-volume-size 50 \
–service-role EMR_DefaultRole \
–enable-debugging \
–name ‘Covid-19-EMR6.1-AutoScale’ \
–scale-down-behavior TERMINATE_AT_TASK_COMPLETION \
–region us-east-1