Apache Hadoop 3.4.1 – HDFS Erasure Coding (original) (raw)

Purpose

Replication is expensive – the default 3x replication scheme in HDFS has 200% overhead in storage space and other resources (e.g., network bandwidth). However, for warm and cold datasets with relatively low I/O activities, additional block replicas are rarely accessed during normal operations, but still consume the same amount of resources as the first replica.

Therefore, a natural improvement is to use Erasure Coding (EC) in place of replication, which provides the same level of fault-tolerance with much less storage space. In typical Erasure Coding (EC) setups, the storage overhead is no more than 50%. Replication factor of an EC file is meaningless. It is always 1 and cannot be changed via -setrep command.

Background

In storage systems, the most notable usage of EC is Redundant Array of Inexpensive Disks (RAID). RAID implements EC through striping, which divides logically sequential data (such as a file) into smaller units (such as bit, byte, or block) and stores consecutive units on different disks. In the rest of this guide this unit of striping distribution is termed a striping cell (or cell). For each stripe of original data cells, a certain number of parity cells are calculated and stored – the process of which is called encoding. The error on any striping cell can be recovered through decoding calculation based on surviving data and parity cells.

Integrating EC with HDFS can improve storage efficiency while still providing similar data durability as traditional replication-based HDFS deployments. As an example, a 3x replicated file with 6 blocks will consume 6*3 = 18 blocks of disk space. But with EC (6 data, 3 parity) deployment, it will only consume 9 blocks of disk space.

Architecture

In the context of EC, striping has several critical advantages. First, it enables online EC (writing data immediately in EC format), avoiding a conversion phase and immediately saving storage space. Online EC also enhances sequential I/O performance by leveraging multiple disk spindles in parallel; this is especially desirable in clusters with high end networking. Second, it naturally distributes a small file to multiple DataNodes and eliminates the need to bundle multiple files into a single coding group. This greatly simplifies file operations such as deletion, quota reporting, and migration between federated namespaces.

In typical HDFS clusters, small files can account for over 3/4 of total storage consumption. To better support small files, in this first phase of work HDFS supports EC with striping. In the future, HDFS will also support a contiguous EC layout. See the design doc and discussion on HDFS-7285 for more information.

Deployment

Cluster and hardware configuration

Erasure coding places additional demands on the cluster in terms of CPU and network.

Encoding and decoding work consumes additional CPU on both HDFS clients and DataNodes.

Erasure coding requires a minimum of as many DataNodes in the cluster as the configured EC stripe width. For EC policy RS (6,3), this means a minimum of 9 DataNodes.

Erasure coded files are also spread across racks for rack fault-tolerance. This means that when reading and writing striped files, most operations are off-rack. Network bisection bandwidth is thus very important.

For rack fault-tolerance, it is also important to have enough number of racks, so that on average, each rack holds number of blocks no more than the number of EC parity blocks. A formula to calculate this would be (data blocks + parity blocks) / parity blocks, rounding up. For EC policy RS (6,3), this means minimally 3 racks (calculated by (6 + 3) / 3 = 3), and ideally 9 or more to handle planned and unplanned outages. For clusters with fewer racks than the number of the parity cells, HDFS cannot maintain rack fault-tolerance, but will still attempt to spread a striped file across multiple nodes to preserve node-level fault-tolerance. For this reason, it is recommended to setup racks with similar number of DataNodes.

Configuration keys

By default, all built-in erasure coding policies are disabled, except the one defined in dfs.namenode.ec.system.default.policy which is enabled by default. The cluster administrator can enable set of policies through hdfs ec [-enablePolicy -policy <policyName>] command based on the size of the cluster and the desired fault-tolerance properties. For instance, for a cluster with 9 racks, a policy like RS-10-4-1024k will not preserve rack-level fault-tolerance, and RS-6-3-1024k or RS-3-2-1024k might be more appropriate. If the administrator only cares about node-level fault-tolerance, RS-10-4-1024k would still be appropriate as long as there are at least 14 DataNodes in the cluster.

A system default EC policy can be configured via ‘dfs.namenode.ec.system.default.policy’ configuration. With this configuration, the default EC policy will be used when no policy name is passed as an argument in the ‘-setPolicy’ command.

By default, the ‘dfs.namenode.ec.system.default.policy’ is “RS-6-3-1024k”.

The codec implementations for Reed-Solomon and XOR can be configured with the following client and DataNode configuration keys: io.erasurecode.codec.rs.rawcoders for the default RS codec, io.erasurecode.codec.rs-legacy.rawcoders for the legacy RS codec, io.erasurecode.codec.xor.rawcoders for the XOR codec. User can also configure self-defined codec with configuration key like: io.erasurecode.codec.self-defined-codec.rawcoders. The values for these key are lists of coder names with a fall-back mechanism. These codec factories are loaded in the order specified by the configuration values, until a codec is loaded successfully. The default RS and XOR codec configuration prefers native implementation over the pure Java one. There is no RS-LEGACY native codec implementation so the default is pure Java implementation only. All these codecs have implementations in pure Java. For default RS codec, there is also a native implementation which leverages Intel ISA-L library to improve the performance of codec. For XOR codec, a native implementation which leverages Intel ISA-L library to improve the performance of codec is also supported. Please refer to section “Enable Intel ISA-L” for more detail information. The default implementation for RS Legacy is pure Java, and the default implementations for default RS and XOR are native implementations using Intel ISA-L library.

Erasure coding background recovery work on the DataNodes can also be tuned via the following configuration parameters:

  1. dfs.datanode.ec.reconstruction.stripedread.timeout.millis - Timeout for striped reads. Default value is 5000 ms.
  2. dfs.datanode.ec.reconstruction.stripedread.buffer.size - Buffer size for reader service. Default value is 64KB.
  3. dfs.datanode.ec.reconstruction.threads - Number of threads used by the Datanode for background reconstruction work. Default value is 8 threads.
  4. dfs.datanode.ec.reconstruction.xmits.weight - Relative weight of xmits used by EC background recovery task comparing to replicated block recovery. Default value is 0.5. It sets to 0 to disable calculate weights for EC recovery tasks, that is, EC task always has 1 xmits. The xmits of an erasure coding recovery task is calculated as the maximum value between the number of read streams and the number of write streams. For example, if an EC recovery task need to read from 6 nodes and write to 2 nodes, it has xmits of max(6, 2) * 0.5 = 3. Recovery task for replicated file always counts as 1 xmit. NameNode utilizes dfs.namenode.replication.max-streams minus the total xmitsInProgress on the DataNode that combines of the xmits from replicated file and EC files, to schedule recovery tasks to this DataNode.

Enable Intel ISA-L

HDFS native implementation of default RS codec leverages Intel ISA-L library to improve the encoding and decoding calculation. To enable and use Intel ISA-L, there are three steps.

  1. Build ISA-L library. Please refer to the official site “https://github.com/01org/isa-l/” for detail information.
  2. Build Hadoop with ISA-L support. Please refer to “Intel ISA-L build options” section in “Build instructions for Hadoop” in (BUILDING.txt) in the source code.
  3. Use -Dbundle.isal to copy the contents of the isal.lib directory into the final tar file. Deploy Hadoop with the tar file. Make sure ISA-L is available on HDFS clients and DataNodes.

To verify that ISA-L is correctly detected by Hadoop, run the hadoop checknative command.

Administrative commands

HDFS provides an ec subcommand to perform administrative commands related to erasure coding.

hdfs ec [generic options] [-setPolicy -path [-policy ] [-replicate]] [-getPolicy -path ] [-unsetPolicy -path ] [-listPolicies] [-addPolicies -policyFile ] [-listCodecs] [-enablePolicy -policy ] [-disablePolicy -policy ] [-removePolicy -policy ] [-verifyClusterSetup -policy ...] [-help [cmd ...]]

Below are the details about each command.

Limitations

Certain HDFS operations, i.e., hflush, hsync, concat, setReplication, truncate and append, are not supported on erasure coded files due to substantial technical challenges.

A client can use StreamCapabilities API to query whether a OutputStream supports hflush() and hsync(). If the client desires data persistence via hflush() and hsync(), the current remedy is creating such files as regular 3x replication files in a non-erasure-coded directory, or using FSDataOutputStreamBuilder#replicate() API to create 3x replication files in an erasure-coded directory.