Efficiently copy data within a cluster/between clusters

Let us see how we copy data within a cluster or between clusters.

We can use hadoop fs -cp to copy and hadoop fs -mv to move data within a cluster. mv can also be used for renaming the files. We have seen these examples as part of Copying or Moving files within HDFS in important HDFS commands.
We can use hadoop distcp to copy data between clusters.
We can get the list of control arguments by running hadoop distcp. Here are some important control arguments.
- -filters – local path to a file containing a list of paths to be excluded.
- -append – if the file names match and if underlying file format supports data will be appended.
- -overwrite – if the target files exist, they will be overwritten
- -delete – delete target files if they are missing in the source
- -bandwidth
- -p – preserve properties
We have to use HDFS URI for both source and target while running hadoop distcp command.

Join Our Community