ERROR - java heap space caused using distcp.

When you see this error default thing to do is the set the HADOOP_CLIENT_OPTS config on your env. In My case did a typo HADDOP_CLIENT_OPTS which caused us spend time on other unwanted options.:( This is not a informative post but more of a note for myself.

Why does DistCp run out of memory?

If the number of individual files/directories being copied from the source path(s) is extremely large (e.g. 1,000,000 paths), DistCp might run out of memory while determining the list of paths for copy. This is not unique to the new DistCp implementation. To get around this, consider changing the -Xmx JVM heap-size parameters, as follows:

$ export HADOOP_CLIENT_OPTS="-Xms4096m -Xmx4096m"
$ hadoop distcp /source_directory /destination_directory

Copying Dataset to preserve all permissions / ACLs.

We use the below command to copy data from HDFS to encryption zones.

$ hadoop distcp -skipcrccheck -update -prbugpcaxt /source_directory /destination_directory

Here are the options we have above.

-p[rbugpcaxt] : Preserve r: replication number b: block size u: user g: group p: permission c: checksum-type a: ACL x: XAttr t: timestamp
-skipcrccheck	: Whether to skip CRC checks between source and target paths. We will skip this if we are using encryption zones as the target `crc` will be different.
-update	      : Overwrite if source and destination differ in size, blocksize, or checksum (skips checksum is `skipcrccheck` is used )

Useful Links.

Apache

Share on

Twitter Facebook Google+ LinkedIn

ERROR - java heap space caused using `distcp`.

AHMED ZBYR

Why does DistCp run out of memory?

Copying Dataset to preserve all permissions / ACLs.

Useful Links.

Share on

You May Also Enjoy

Datastream Workflow - Cloud SQL (MySQL) to BigQuery via Cloud Auth Proxy

Establishing a Datastream from Cloud SQL (MySQL) to BigQuery

Setting Up HashiCorp Vault for Secret Management

Data Export from Datastore and Firestore