Overview about Ambari and HDFS

As part of this session, we have understood

  • Logging into Ambari
  • Ambari Overview
  • Ambari Architecture
  • Categories of Hosts
  • HDFS Overview
  • Copying Files to HDFS
  • Customizing Properties
  • Getting Metadata Information

Logging into Ambari and Overview

  • Sign in on Ambari, which is a cluster management tool.It is provided by Hortonworks. To get the details at hardware level go to hosts tab and to get the details at software level ,go to services tab.Services are softwares deployed on the cluster.It can also show trends in real-time.

Ambari Architecture

  • It follows Agent and server architecture.There is a centralised server such as Ambari server and it uses a centralised database.Someone has to give information on each node in the cluster to the server.
  • Check for services running on Ambari-

ps -ef|grep -i ambari

  • Ambari agent and server can be seen running there on the host.Agent will capture certain metrics for every service deployed (how much CPU,memory or HDFS etc.)and pass information to server and it will store it into database and give information in form of graph and charts.
  • In an Ambari cluster of say 1000 nodes, most of the nodes run agents and few will run server.So, in a typical cluster we have master nodes based on the number of master services we need to run, worker nodes and gateway nodes which helps us to connect with cluster.Gateway nodes deploy the jobs in production environment.
  • One of the services running on Ambari is HDFS Services.

HDFS – Hadoop Distributed File System

  • File System determines the process of storing files in File System.
  • The hard disk is a collection of bits. Every operating system divides the hard disk into blocks and has some default block size. When we need to store some file on hard disk, it will be stored depending upon the block size. For example, if the default block size is 4 KB, to store 100KB file it stores sequentially taking 25 blocks.
  • Due to fragmentation, sometimes even if space is available it can’t store file.
  • In the case of Distributed file systems, we can store files non-sequentially and even on different nodes also.
  • Now, go to Gateway node, list the files present in the local file system using ls command. Go to data and crime directory and check the information about file size. It needs to store sequentially.
  • To interact with HDFS, we can use hadoop fs command.

Copying Files To HDFS

  • Copy file from the local file system to HDFS (use put command and give source file path and target file path) – hadoop fs -put /data/crime/csv/rows.csv/user/training/crime_data.csv

The file will be copied on worker nodes as blocks.HDFS is a logical file system and you can’t find these files with these names. The file will be divided on the basis of block size.

Customizing Properties

Let us see how we can override properties while copying data into HDFS.

  •  Under the /etc directory, there will be one directory for each service. Hadoop is a combination of two services, HDFS, and YARN along with Map Reduce.
  • In /etc/hadoop/conf directory we have xml files like core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml
  • core-site.xml has properties for both the components of Hadoop.
  • hdfs-site.xml has properties like block size. By default, the block size is 128 MB.
  • To override block size while copying the file from the local file system to HDFS from 128MB to 64 MB
    • hadoop fs -Ddfs.blocksize=67108864 -put /data/crime/csv/rows.csv  /user/training/crime_data.csv
  • We can also override the replication factor.
    • hadoop fs -Ddfs.replication=2 -put /data/crime/csv/rows.csv  /user/training/crime_data.csv

Getting Metadata Information

  • The files are divided into the block and each block in turn divided into locations. So, using fsck command we can get the metadata information of a particular file-
    • hdfs fsck /data/crime/csv/rows.csv -files - blocks -locations