Copy data into Azure Data Lake Storage using DistCp - Azure Storage (original) (raw)

You can use DistCp to copy data between a general purpose V2 storage account and a general purpose V2 storage account with hierarchical namespace enabled. This article provides instructions on how to use the DistCp tool.

DistCp provides various command-line parameters and we strongly encourage you to read this article in order to optimize your usage of it. This article shows basic functionality while focusing on its use for copying data to a hierarchical namespace enabled account.

Prerequisites

Use DistCp from an HDInsight Linux cluster

An HDInsight cluster comes with the DistCp utility, which can be used to copy data from different sources into an HDInsight cluster. If you have configured the HDInsight cluster to use Azure Blob Storage and Azure Data Lake Storage together, the DistCp utility can be used out-of-the-box to copy data between as well. In this section, we look at how to use the DistCp utility.

  1. Create an SSH session to your HDInsight cluster. For more information, see Connect to a Linux-based HDInsight cluster.
  2. Verify whether you can access your existing general purpose V2 account (without hierarchical namespace enabled).
hdfs dfs -ls wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/  

The output should provide a list of contents in the container. 3. Similarly, verify whether you can access the storage account with hierarchical namespace enabled from the cluster. Run the following command:

hdfs dfs -ls abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/  

The output should provide a list of files/folders in the Data Lake storage account. 4. Use DistCp to copy data from Windows Azure Storage Blob (WASB) to a Data Lake Storage account.

hadoop distcp wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/example/data/gutenberg abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/myfolder  

The command copies the contents of the /example/data/gutenberg/ folder in Blob storage to /myfolder in the Data Lake Storage account. 5. Similarly, use DistCp to copy data from Data Lake Storage account to Blob Storage (WASB).

hadoop distcp abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/myfolder wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/example/data/gutenberg  

The command copies the contents of /myfolder in the Data Lake Store account to /example/data/gutenberg/ folder in WASB.

Performance considerations while using DistCp

Because DistCp's lowest granularity is a single file, setting the maximum number of simultaneous copies is the most important parameter to optimize it against Data Lake Storage. Number of simultaneous copies is equal to the number of mappers (m) parameter on the command line. This parameter specifies the maximum number of mappers that are used to copy data. Default value is 20.

Example

hadoop distcp -m 100 wasbs://<container-name>@<storage-account-name>.blob.core.windows.net/example/data/gutenberg abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/myfolder

How do I determine the number of mappers to use?

Here's some guidance that you can use.

Example

Let's assume that you have a 4x D14v2s cluster and you're trying to transfer 10 TB of data from 10 different folders. Each of the folders contains varying amounts of data and the file sizes within each folder are different.

If other applications are using memory, then you can choose to only use a portion of your cluster's YARN memory for DistCp.

Copying large datasets

When the size of the dataset to be moved is large (for example, >1 TB) or if you have many different folders, you should consider using multiple DistCp jobs. There's likely no performance gain, but it spreads out the jobs so that if any job fails, you only need to restart that specific job rather than the entire job.

Limitations