Introduction to EMR and concepts

Introduction to EMR

What is AWS EMR?
Amazon EMR is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).

In this Lesson, we will see

  • The pre-requisites which are required to setup EMR Cluster.
  • Setup the cluster and use to process the data and terminate the cluster.
  • How can we manage the lifecycle of EMR Cluster, which includes creating the cluster, processing the data and terminate the cluster?
  • We will the operations of EMR cluster.

EMR concepts


EMR is Elastic MapReduce, MapReduce is primarily to process data at a scale and elastic is to mitigate the costs leveraging cloud pay as you go model.

In an Enterprise, data has to be processed once or twice in a day, but not continuously. For those scenarios, having a cluster is not a good idea. EMR adds value here, by leveraging the cloud to mitigate the costs. EMR is a service in analytics category.

  • To get started with EMR, click on EMR and enter into EMR dashboard.
  • You can access the documentation and management guide to creating the clusters from the dashboard.
  • To know the technologies in EMR, go to Amazon EMR –> Create Cluster –> Advanced Options. It has Hadoop, Tez, Sqoop, Pig, Hive, Zookeeper, Spark and etc.