One needs to have a good understanding about HDFS commands. Here is a quick preview of HDFS.
- hadoop fs is the main command to interact with files in HDFS
- You should be able to have good understanding of commands for following tasks
- List files and directories hadoop fs -ls
- Creating directories hadoop fs -mkdir
- Copying data to HDFS hadoop fs -copy From Local or hadoop fs -put
- Copying data from HDFS hadoop fs -copy To Local or hadoop fs -get
- Deleting directories hadoop fs -rm -R
- Changing permissions hadoop fs -chmod
- Check size of the data sets hadoop fs -du -s -h
You need to have the right datasets for practice. More than volume, it is important to have functionally correct data so that one can come up with realistic use cases.
Following are the data sets available in our GitHub account
- retail_db
- hr
- lca
- nyse
- and more
If you are using your own environment, it is recommended to set up data sets and copy to your environment. In the lab we made data available both in local as well as HDFS
- Local Path: /data
- HDFS: /public