Run Simple Map Reduce Job

Let us run simple Map Reduce Job and see what happens. We will be using Hadoop examples that come as part of the setup process itself.

  • We can use Hadoop jar or yarn jar to submit map reduce job as YARN application. Let us run an application called randomtextwriter which will generate 10 GB of data per node by default.
  • This job will take some to run.

https://gist.github.com/dgadiraju/f2852840916b1e79f4fb6830d93c8b22

  • Typically data will be processed using map tasks and reduce tasks.
    • Map Tasks read the data and perform row-level transformations.
    • Reduce Tasks read the output of Map Tasks and perform transformations such as joins, aggregations etc.
    • Shuffling Process between Map Tasks and Reduce Tasks take care of grouping and partitioning of data based on keys.
    • We do not have to get into too many details at this time as an administrator.
    • This particular application randomtextwriter is map only job where it tries to create 10 GB data per data node. In our case, we will see 30 GB of data.

Exercise: Run relevant Hadoop fs commands to get the size of data that is created by randomtextwriter.

  • Map tasks and Reduce tasks will run as part of YARN containers on Nodemanagers.
  • The life cycle of the job is managed by per job application master.
  • Typically Map Reduce jobs read data from HDFS, process it and save it back to HDFS.
  • This examples job does not take any data from HDFS, it just randomly generates text and writes it back to HDFS.
  • We can keep track of running jobs as well as troubleshoot completed jobs using Resource Manager UI.

Share this post