Pre-requisites before setting up EMR cluster and setup datasets

Pre-requisites before setting up EMR cluster

To set up EMR Cluster, the prerequisites are

  • Sign up for AWS
  • Create Amazon S3 Bucket
  • Create an Amazon EC2 Key Pair

In an Enterprise, we will be getting a root account which we will not be used to setup the cluster. We will create IAM user and assign the required policies to setup the cluster.

Steps to create EMR cluster

  • Create a group related to emr
    • To create a group, go to IAM –> Groups –> Create New Group -> Enter group name (E.g.: itversityemr)
    • Attach ElasticMapReduceFull Access to the group.
    • Validate the group permissions and add a user to that group.
  • Log-into console as IAM user
    • We have to create Amazon S3 Bucket and Amazon EC2 Key Pair.
    • Go to S3, it displays S3 Management Console.
    • Create two buckets named, itversityemrsource and itversityemrtarget as IAM User Account.
    • To create Amazon EC2 Key Pair, Click on EC2 –> Key Pairs –> Create Key Pair.

Note: Unless AmazonEC2FullAccess policy is attached to the group, IAM user won’t be able to create the key pair. (E.g.: AmazonEC2FullAccess policy attached to itversityemr group)

  • Place that key pair in Cygwin –> Home –> Username –> .aws
  • Check the key pair using the command ‘ls -ltr .aws’, since we will be using this keypair to create a EMR cluster.

Setup Data Sets

In this video, we will explore

  • Download the datasets from the Github account with a name data_master typically.
  • The dataset has multiple datasets, but we will be uploading “retail_db” as part of this topic.
  • Go to S3 bucket, upload the directory retail_db into itversityemrsource bucket.
  • In this directory having some folders like categories, customers, departments and etc.