Let us see how we copy data within a cluster or between clusters.
- We can use
hadoop fs -cp
to copy andhadoop fs -mv
to move data within a cluster. mv can also be used for renaming the files. We have seen these examples as part of Copying or Moving files within HDFS in important HDFS commands. - We can use hadoop distcp to copy data between clusters.
- We can get the list of control arguments by running hadoop distcp. Here are some important control arguments.
- -filters – local path to a file containing a list of paths to be excluded.
- -append – if the file names match and if underlying file format supports data will be appended.
- -overwrite – if the target files exist, they will be overwritten
- -delete – delete target files if they are missing in the source
- -bandwidth
- -p – preserve properties
- We have to use HDFS URI for both source and target while running hadoop distcp command.
https://gist.github.com/dgadiraju/085ee66867a891108e91cd092dc0f90b