As part of this section we will get into all important HDFS commands.
- Getting list of commands and help
- Creating directories and changing ownership
- Managing files and file permissions
- Controlling access using ACLs
- Overriding Properties
- HDFS usage and Metadata Commands
- Creating Snapshots
- Using CLI for administration
Getting list of commands and help
Let us explore details about how to list the commands and get the help or usage for given command.
https://www.youtube.com/watch?v=Gn8ngAnqjUs
- Even though we can run commands from almost all the nodes in the clusters, we should only use Gateway to run HDFS Commands.
- First we need to make sure designated Gateway server is Gateway for HDFS service so that we can run commands from Gateway node. In our case we have designated bigdataserver-1 as Gateway.
- Let us make sure that bigdataserver-1 is added as HDFS Gateway so that we can run our commands successfully.
- Also we can run commands by connecting to multiple clusters. However, we cannot configure one server as Gateway for multiple clusters and hence we have to specify the URI for Namenode using -fs. We can get Namenode URI from core-site.xml or Cloudera Manager.
- Typically Namenode process will be running on port number 8020.
hadoop fs
– list all the commands availablehadoop fs -usage
– will give us basic usage for given commandhadoop fs -help
– will give us additional information for all the commands- We can run help on individual commands as well.
- Let us also review very important command
hadoop fs -ls
to list files and directories under given path
Creating Directories and Changing Ownership
Now let us have a look at how to create directories and manage ownership.
https://www.youtube.com/watch?v=09TAlPPxyYA
- By default hdfs is superuser of HDFS
hadoop fs -mkdir
– to create directorieshadoop fs -chown
– to change ownership of files- chown can also be used to change the group. We can change the group using -chgrp command as well. Make sure to run -help on chgrp and check the details.
- Creating user space
- Create directory with user id cloudera under /user
- Change ownership to the same name as the directory created earlier (/user/cloudera)
- You can validate permissions by using hadoop fs -ls command on /user
- Let us create OS users on bigdataserver-1 and then user spaces for cloudera, itversity and demo.
- We will be using these to demonstrate ACLs.
Managing Files and File Permissions
Now let us get into commands related to managing files in HDFS. It includes deleting files, copying files as well as HDFS File Permissions.
Deleting Files from HDFS
Let us see how we can delete files from HDFS.
https://www.youtube.com/watch?v=4bJmfz9c1YI
- As we have already copied data into HDFS, let us start with deleting files using
hadoop fs -rm
command.- When we use rm command, files will be copied to .Trash directory by default. It acts as recycle bin to overcome issue of deleting files accidentally.
- We can use -skipTrash to skip recycle bin and delete data permanently. However, it cannot be undone.
- .Trash can be cleaned up manually by users belonging to superuser group such as HDFS or automatically based on trash related properties defined in core-site.xml.
Copying Files between local file system and HDFS
We can copy files from local file system and vice versa. We can append data into existing files in HDFS.
https://www.youtube.com/watch?v=YA79aa3lNQo
hadoop fs -copyFromLocal
orhadoop fs -put
– to copy files from local filesystem and HDFS. Process of copying data is already covered. File will be divided into blocks and will be stored on Datanodes in distributed fashion based on block size and replication factor.
hadoop fs -copyToLocal
orhadoop fs -get
– to copy files from HDFS to local filesystem. It will read all the blocks using index in sequence and construct the file in local file system.- We can also use
hadoop fs -appendToFile
to append data to existing file. - However, we will not be able to update or fix data in files when they are in HDFS. If we have to fix any data, we have to move file to local file system, fix data and then again copy to HDFS.
- We can move files from local file system to HDFS using
hadoop fs -moveFromLocal
. Even though there is a command moveToLocal, functionality is not implemented yet.
Copying or Moving Files within HDFS
We can also copy files with in HDFS using commands like cp and mv.
https://www.youtube.com/watch?v=kBsa8JbMXwE
hadoop fs -cp
to copy files from one HDFS location to another HDFS locationhadoop fs -mv
to move files from one HDFS location to another HDFS location- mv is faster than cp as mv deals with only metadata where as cp have to copy all the blocks.
- If you have to rename or move the files make sure to run
hadoop fs -mv
Previewing Data
Let us see how we can preview the data in HDFS.
https://www.youtube.com/watch?v=LRuAMMsUg84
- If we are dealing with files contain text data (files of text file format), we can preview contents of the files using different commands as -tail, -cat etc.
- -tail can be used to preview last 1 KB of the file
- -cat can be used to print the whole contents of the file on the screen. Be careful while using -cat as it will take a while for even medium sized files.
- If you want to get first few lines from file you can redirect output of hadoop fs -cat to Linux more command
HDFS File Permissions
Let us go through file permissions in HDFS.
https://www.youtube.com/watch?v=FzThXajwzjk
- As we create the files, we can check the permissions on them using -ls command.
- Typically the owner of the user space will have rwx, while members of the group specified as well as others have rx
- We can change the permissions using hadoop fs -chmod
- We can specify permissions mode (e.g.: +x to grant execute access to owner, grop as well as others) as well as octal mode (e.g.: 755 to grant rwx for owner, rx for group and others)
Let us copy data into all 3 user spaces for the users cloudera, itversity and demo.
Controlling Access using ACLs
ACLs stands for Access Control Lists and it gives finer level access control over files. Without ACLs permissions are controlled at owner, group and others levels only.
https://www.youtube.com/watch?v=06zIoY8ALE4
- To use ACLs in HDFS, we need to set dfs.namenode.acls.enabled to true as part of hdfs-site.xml.
https://www.youtube.com/watch?v=ddDw3mvxIbs
- We can use
hadoop fs -setfacl
to set ACL at file or directory level.
https://www.youtube.com/watch?v=LvH8-KyCwR8
hadoop fs -getfacl
can be used to get details about ACL on a file or a directory.
https://www.youtube.com/watch?v=25hc9GAK6m0
- First, let us see examples at the file level, then directory level and then deleting ACLs.
Overriding Properties
Let us see how we can override properties while running commands such as hadoop fs. Let us first review properties files such as core-site.xml and hdfs-site.xml.
https://www.youtube.com/watch?v=na8CYvX4wa0
- We can override any non-final property using -Dproperty_name=property_value as part of hadoop fs command.
- We can also use options such as -fs to override Namenode URI
- We can also change replication factor using subcommand -setrep
- Some of the properties might have been defined as final as part of the properties files such as core-site.xml or hdfs-site.xml. -D will not have any impact in that case.
- When it comes to Cloudera Manager, we are not supposed to override properties by updating files – instead, we need to use something called Safety Valve that comes as part of Cloudera Manager.
HDFS usage commands and getting metadata
Now let us have a look at HDFS usage commands and also commands used to get the metadata.
https://www.youtube.com/watch?v=oUcUdq3te6I
hadoop fs -df
– to get details about the amount of storage used by HDFS. Use -s to get summarized information and -h to get information in readable format.hadoop fs -du
– to get the size of data that is copiedhdfs fsck
– to get metadata for given directory or files.
Creating Snapshots
HDFS Snapshots are read-only point-in-time copies of the file system. Snapshots can be taken on a subtree of the file system or the entire file system. Some common use cases of snapshots are data backup, protection against user errors and disaster recovery.
https://www.youtube.com/watch?v=i9uoLMOBKUk
- It does not copy actual data. It will keep track of changes to metadata.
- First, we need to make the directory snapshottable – using hdfs dfsadmin -allowSnapshot. Only users in supergroup can allow snapshots on a directory.
- Once snapshots are allowed, we can create snapshot using hadoop fs -createSnapshot
- We can also delete, rename the snapshot using deleteSnapshot or renameSnapshot
- Users in supergroup can also disallow snapshot (using hdfs dfsadmin -disallowSnaphsot)
Using CLI for administration
There are several commands to perform administration using CLI. We need to use HDFS super user to manage HDFS cluster using commands. In our case it is hdfs itself.
https://www.youtube.com/watch?v=o71l6S6CFHM
- Formatting Namenode
- Rolling Edits
- Save Namespace (create fsimage)
- Enter or Leave Safemode
- Running balancer
- Running file system utility (fsck)
- and many more
Make sure to stop services in Cloudera Manager and also shutdown servers provisioned from GCP or AWS to leverage credits or control costs.