Unlike plain vanilla distribution and other vendor distributions, Cloudera manages configuration files a bit different. Typically configuration files will be in /etc/hadoop/conf. But when it comes to Cloudera, /etc/hadoop/conf will only have templates. Actual properties files are managed under /var/run/cloudera-scm-agent/process on each node.
- hadoop-env.sh is for environment variables
- HADOOP_HOME
- JAVA_HOME
- HADOOP_HEAPSIZE – default heap size for Hadoop components such as Namenode, Secondary Namenode, Datanode etc.
- HADOOP_DATANODE_OPTS – JVM settings for Datanode. We can override memory settings for Datanode here.
- HADOOP_NAMENODE_INIT_HEAPSIZE – Override Namenode heap size. We can also use OPTS for Namenode (similar is the case with Secondary Namenode).
- As highlighted earlier, each file as well as block will have metadata associated with them. Size of each entry will be 150 bytes. Replication is not included while counting earlier. With replication it will take 150 bytes times replication factor for each block.
- You can actually see more details from this Cloudera article about Namenode heap sizing.
- core-site.xml will have parameters that are used by HDFS and MapReduce
- fs.defaultFS
- fs.trash.interval
- proxy user configuration
- io.compression.codecs
- net.topology.script.file.name
- hdfs-site.xml will have parameters that are used by HDFS
- dfs.blocksize
- dfs.replication
- dfs.client.read.shortcircuit
- dfs.namenode.http-address
- dfs.datanode.http.address
- dfs.datanode.data.dir
We can have http address bound to 0.0.0.0, if we want to access the URL using any of the ip addresses that are assigned to the server.
https://gist.github.com/dgadiraju/0bc896a2aa7fc7b99a1b6fc52c4dcc4a