When you see this error default thing to do is the set the HADOOP_CLIENT_OPTS
config on your env
. In My case did a typo HADDOP_CLIENT_OPTS
which caused us spend time on other unwanted options.:(
This is not a informative post but more of a note for myself.
Why does DistCp run out of memory?
If the number of individual files/directories being copied from the source path(s) is extremely large (e.g. 1,000,000 paths), DistCp
might run out of memory while determining the list of paths for copy. This is not unique to the new DistCp
implementation. To get around this, consider changing the -Xmx
JVM heap-size parameters, as follows:
$ export HADOOP_CLIENT_OPTS="-Xms4096m -Xmx4096m"
$ hadoop distcp /source_directory /destination_directory
Copying Dataset to preserve all permissions / ACLs.
We use the below command to copy data from HDFS to encryption zones.
$ hadoop distcp -skipcrccheck -update -prbugpcaxt /source_directory /destination_directory
Here are the options we have above.
-p[rbugpcaxt] : Preserve r: replication number b: block size u: user g: group p: permission c: checksum-type a: ACL x: XAttr t: timestamp
-skipcrccheck : Whether to skip CRC checks between source and target paths. We will skip this if we are using encryption zones as the target `crc` will be different.
-update : Overwrite if source and destination differ in size, blocksize, or checksum (skips checksum is `skipcrccheck` is used )