configuring files and important properties – Running Jobs

As we have changed the properties with respect to node manager capacity, let us run randomtextwriter again and see how long it take.

We can override individual properties at runtime using -D and multiple properties using -conf and xml file similar to yarn-site.xml or mapred-site.xml.

https://gist.github.com/dgadiraju/f2852840916b1e79f4fb6830d93c8b22

Now let us run word count program from hadoop examples and observe the change in number of map tasks as well as reduce tasks.

  • Let us run word count program with the different number of mappers by overriding mapreduce.input.fileinputformat.split.minsize as well as mapreduce.job.reduces. We are trying to perform word count on 30 GB data (with 30 files of 1 GB each)

https://gist.github.com/dgadiraju/0d3df07693e78d07164af0c14493707d

  • Without overriding the properties, it uses 128 MB (inherited from dfs.blocksize) and created 270 map tasks to read data and then 12 reducers to aggregate and write the data.
  • After overriding split size is 256 MB, the number of mappers are 150 (5 per file) and reducers are 8 as hard coded.

Here the idea is to only show how to override the properties, not how to determine split size and number of reducers. It will be covered as part of Performance Tuning course.

Share this post