Let us understand HDFS Features, Architecture as well as some important commands to copy data back and forth between Local File System and HDFS.
Understanding Storage Servers
Understanding Hadoop Storage (HDFS)
HDFS Architecture
HDFS Commands – Overview
Understanding Storage Servers
Let us understand details about Storage Servers.
A computer is nothing but CPU, Memory and Storage. Even though CPU and Memory are important for making application usable as they are transient we cannot rely on them for the data which need to be permanently stored. Hard Drives store data permanently and hence we use Hard Drives (or SSDs) to permanently store the data.
Earlier we used to have CPU, Memory and Storage on the same server and used to take periodic back ups so that we can restore data in case of Hardware failures. But restoring Hard Drives require downtime of the applications.
Hence, Storage is decoupled from the servers and Storage Racks or Storage Servers are evolved (EMC is one of the pioneers in this space).
Storage Servers or Racks are connected to actualy servers via high speed fiber optic networking.
The storage rack is like a container with a CPU, Memory etc. We can plug in many hard drives in the rack. Software installed in the server that comes as part of rack takes care of the distribution of files across hard drives, load balancing, fault tolerance etc.
Let us take an example here:
Each drive is 2 TB in size
Let us say we have 8 such hard drives, then we have 16 TB of total storage.
With RAID 0, we will achieve the distribution of files across all the hard drives. However, even if we lose one hard drive, then there will be an outage for all the applications which are using storage on that storage rack.
Distribution of the files on multiple hard drives is called as striping.
To make Hard Drives fault tolerant we use Mirroring or configure at different RAID levels.
Understanding Hadoop Storage (HDFS)
HDFS stands for Hadoop Distributed File System.
Distributed
Fault-Tolerant
Highly Reliable
On a regular local file system such as your PC, a file will occupy contiguous blocks of storage. However, in the network file system, blocks of a file need not be contiguous. They might spread across multiple Hard Drives that are part of Storage Server.
Let us first copy data set and understand what is going on under the hood.
Files will be divided into blocks
Blocks will be stored in multiple nodes
There will be multiple copies of each block
Instead of the separate storage server, HDFS is designed to use local file system itself as part of Distributed File System. We will have multiple servers as part of HDFS.
In our cluster, we have worker nodes on which Datanodes are running. These Datanodes are managed by Namenode(s).
Data is typically stored in the form of blocks on the servers where Datanode is running. Block Size is by default 128 MB.
As compared to RAID in legacy systems, in Hadoop Replication takes care of data reliability.
By default, Hadoop creates 3 replicas, it maintains 3 copies of each block.
Rack awareness – In Hadoop, data is stored in rack aware fashion. It means that one block on one rack and other two blocks on another rack. This makes cluster more reliable.
Replication factor and rack awareness give fault-tolerance to HDFS.
HDFS Architecture
There are three HDFS daemons – Namenode (which is the master daemon) and Datanodes (which are slave daemons) and Secondary Namenode.
Namenode and all the Datanodes are connected to the network switch.
hdfs fsck command gives metadata information (like file permissions etc.)
Namenode stores the metadata information.
Datanodes store the actual data.
The client interacts with Namenode and finds out where the file blocks are stored.
Data can be recovered using edit logs and FSImage. Edit logs is a file structure which is a transaction log. FSImage is a snapshot of data at a particular time.
Secondary name node keeps merging Edit Log and FSImage into a new FSimage. This process of merging Edit Log and FSImage is known as Checkpointing.
By default, every 1-second Datanode sends heartbeat to Namenode. If Namenode doesn’t receive any heartbeat from Datanode for pre-configured time, it will be discarded and no more blocks will be copied to that Datanode.
HDFS Commands – Overview
Now let us walk through some of the important commands we use as developers on regular basis.
We can get usage of all commands using hadoop fs or hadoop fs -usage
We can get usage of a single command using hadoop fs -usage COMMAND
Usage for ls – hadoop fs -usage ls
We can get the help of all commands using hadoop fs -help
We can get help for a single command using hadoop fs -help COMMAND
Help for ls – hadoop fs -help ls
List all the files in HDFS – hadoop fs -ls /user/training
hdfs dfs command is an alias for hadoop fs.
We can copy data from local file system to HDFS using hadoop fs -put or hadoop fs -copyFromLocal
Source is from local file system /data/cards
Target is in HDFS /user/training
Example – hadoop fs -put /data/cards /user/training
To see the contents of a smaller file – hadoop fs -cat /user/training/cards/smalldeck.txt
To tail larger files to preview the data – hadoop fs -tail /user/training/cards/smalldeck.txt
Updating a file is not possible in hdfs
To append contents of local file /data/cards/smalldeck.txt to /user/training/cards/largedeck.txt – hadoop fs -appendToFile /data/cards/smalldeck.txt /user/training/cards/largedeck.txt
copyFromLocal is similar to put command. copyTolocal and get command are similar and are used to copy files from hdfs to local file system.
These few commands are good enough for our rest of the course. However, if you are interested in learning all the commands go to the next section where several other important HDFS Commands are covered in detail. It is self-paced.
Exercise
Here is the exercise on some important HDFS Commands.
Copy data from /data/retail_db to your userspace in HDFS /user/YOUR_USERNAME/retail_db
/data/retail_db have 6 sub directories and each sub directory have exactly one file.
List the files recursively to validate that /user/YOUR_USERNAME/retail_db have exactly similar directory structure as /data/retail_db