Section 2:10. overview of HDFS and Setup Datasets

One needs to have a good understanding about HDFS commands. Here is a quick preview of HDFS.

  • hadoop fs is the main command to interact with files in HDFS
  • You should be able to have good understanding of commands for following tasks
    • List files and directories                  hadoop fs -ls
    • Creating directories                        hadoop fs -mkdir
    • Copying data to HDFS                   hadoop fs -copy  From Local  or hadoop fs -put
    • Copying data from HDFS               hadoop fs -copy To Local  or hadoop fs -get
    • Deleting directories                         hadoop fs -rm -R
    • Changing permissions                    hadoop fs -chmod
    • Check size of the data sets            hadoop fs -du -s -h

You need to have the right datasets for practice. More than volume, it is important to have functionally correct data so that one can come up with realistic use cases.

Following are the data sets available in our GitHub account

  • retail_db
  • hr
  • lca
  • nyse
  • and more

If you are using your own environment, it is recommended to set up data sets and copy to your environment. In the lab we made data available both in local as well as HDFS

  • Local Path: /data
  • HDFS: /public

Share this post