Efficiently copy data within a cluster/between clusters

Let us see how we copy data within a cluster or between clusters.

  • We can use hadoop fs -cp to copy and hadoop fs -mv to move data within a cluster. mv can also be used for renaming the files. We have seen these examples as part of Copying or Moving files within HDFS in important HDFS commands.
  • We can use hadoop distcp to copy data between clusters.
  • We can get the list of control arguments by running hadoop distcp. Here are some important control arguments.
    • -filters – local path to a file containing a list of paths to be excluded.
    • -append – if the file names match and if underlying file format supports data will be appended.
    • -overwrite – if the target files exist, they will be overwritten
    • -delete – delete target files if they are missing in the source
    • -bandwidth
    • -p – preserve properties
  • We have to use HDFS URI for both source and target while running hadoop distcp command.

https://gist.github.com/dgadiraju/085ee66867a891108e91cd092dc0f90b

Share this post