Apache Hadoop 3.4.1 – Introduction (original) (raw)

This document defines the required behaviors of a Hadoop-compatible filesystem for implementors and maintainers of the Hadoop filesystem, and for users of the Hadoop FileSystem APIs

Most of the Hadoop operations are tested against HDFS in the Hadoop test suites, initially through MiniDFSCluster, before release by vendor-specific ‘production’ tests, and implicitly by the Hadoop stack above it.

HDFS’s actions have been modeled on POSIX filesystem behavior, using the actions and return codes of Unix filesystem actions as a reference. Even so, there are places where HDFS diverges from the expected behaviour of a POSIX filesystem.

The bundled S3A FileSystem clients make Amazon’s S3 Object Store (“blobstore”) accessible through the FileSystem API. The Azure ABFS, WASB and ADL object storage FileSystems talks to Microsoft’s Azure storage. All of these bind to object stores, which do have different behaviors, especially regarding consistency guarantees, and atomicity of operations.

The “Local” FileSystem provides access to the underlying filesystem of the platform. Its behavior is defined by the operating system and can behave differently from HDFS. Examples of local filesystem quirks include case-sensitivity, action when attempting to rename a file atop another file, and whether it is possible to seek() past the end of the file.

There are also filesystems implemented by third parties that assert compatibility with Apache Hadoop. There is no formal compatibility suite, and hence no way for anyone to declare compatibility except in the form of their own compatibility tests.

These documents do not attempt to provide a normative definition of compatibility. Passing the associated test suites does not guarantee correct behavior of applications.

What the test suites do define is the expected set of actions—failing these tests will highlight potential issues.

By making each aspect of the contract tests configurable, it is possible to declare how a filesystem diverges from parts of the standard contract. This is information which can be conveyed to users of the filesystem.

Naming

This document follows RFC 2119 rules regarding the use of MUST, MUST NOT, MAY, and SHALL. MUST NOT is treated as normative.

Implicit assumptions of the Hadoop FileSystem APIs

The original FileSystem class and its usages are based on an implicit set of assumptions. Chiefly, that HDFS is the underlying FileSystem, and that it offers a subset of the behavior of a POSIX filesystem (or at least the implementation of the POSIX filesystem APIs and model provided by Linux filesystems).

Irrespective of the API, it’s expected that all Hadoop-compatible filesystems present the model of a filesystem implemented in Unix:

Path Names

Security Assumptions

Except in the special section on security, this document assumes the client has full access to the FileSystem. Accordingly, the majority of items in the list do not add the qualification “assuming the user has the rights to perform the operation with the supplied parameters and paths”.

The failure modes when a user lacks security permissions are not specified.

Networking Assumptions

This document assumes that all network operations succeed. All statements can be assumed to be qualified as “assuming the operation does not fail due to a network availability problem”

Core Expectations of a Hadoop Compatible FileSystem

Here are the core expectations of a Hadoop-compatible FileSystem. Some FileSystems do not meet all these expectations; as a result, some programs may not work as expected.

Atomicity

There are some operations that MUST be atomic. This is because they are often used to implement locking/exclusive access between processes in a cluster.

  1. Creating a file. If the overwrite parameter is false, the check and creation MUST be atomic.
  2. Deleting a file.
  3. Renaming a file.
  4. Renaming a directory.
  5. Creating a single directory with mkdir().

Most other operations come with no requirements or guarantees of atomicity.

Consistency

The consistency model of a Hadoop FileSystem is one-copy-update-semantics; that of a traditional local POSIX filesystem. Note that even NFS relaxes some constraints about how fast changes propagate.

Concurrency

There are no guarantees of isolated access to data: if one client is interacting with a remote file and another client changes that file, the changes may or may not be visible.

Operations and failures

Undefined capacity limits

Here are some limits to FileSystem capacity that have never been explicitly defined.

  1. The maximum number of files in a directory.
  2. Max number of directories in a directory
  3. Maximum total number of entries (files and directories) in a filesystem.
  4. The maximum length of a filename under a directory (HDFS: 8000).
  5. MAX_PATH - the total length of the entire directory tree referencing a file. Blobstores tend to stop at ~1024 characters.
  6. The maximum depth of a path (HDFS: 1000 directories).
  7. The maximum size of a single file.

Undefined timeouts

Timeouts for operations are not defined at all, including:

The blocking-operation timeout is in fact variable in HDFS, as sites and clients may tune the retry parameters so as to convert filesystem failures and failovers into pauses in operation. Instead there is a general assumption that FS operations are “fast but not as fast as local FS operations”, and that the latency of data reads and writes scale with the volume of data. This assumption by client applications reveals a more fundamental one: that the filesystem is “close” as far as network latency and bandwidth is concerned.

There are also some implicit assumptions about the overhead of some operations.

  1. seek() operations are fast and incur little or no network delays. [This does not hold on blob stores]
  2. Directory list operations are fast for directories with few entries.
  3. Directory list operations are fast for directories with few entries, but may incur a cost that is O(entries). Hadoop 2 added iterative listing to handle the challenge of listing directories with millions of entries without buffering at the cost of consistency.
  4. A close() of an OutputStream is fast, irrespective of whether or not the file operation has succeeded or not.
  5. The time to delete a directory is independent of the size of the number of child entries

Object Stores vs. Filesystems

This specification refers to Object Stores in places, often using the term Blobstore. Hadoop does provide FileSystem client classes for some of these even though they violate many of the requirements.

Consult the documentation for a specific store to determine its compatibility with specific applications and services.

What is an Object Store?

An object store is a data storage service, usually accessed over HTTP/HTTPS. A PUT request uploads an object/“Blob”; a GET request retrieves it; ranged GET operations permit portions of a blob to retrieved. To delete the object, the HTTP DELETE operation is invoked.

Objects are stored by name: a string, possibly with “/” symbols in them. There is no notion of a directory; arbitrary names can be assigned to objects — within the limitations of the naming scheme imposed by the service’s provider.

The object stores invariably provide an operation to retrieve objects with a given prefix; a GET operation on the root of the service with the appropriate query parameters.

Object stores usually prioritize availability —there is no single point of failure equivalent to the HDFS NameNode(s). They also strive for simple non-POSIX APIs: the HTTP verbs are the operations allowed.

Hadoop FileSystem clients for object stores attempt to make the stores pretend that they are a FileSystem, a FileSystem with the same features and operations as HDFS. This is —ultimately—a pretence: they have different characteristics and occasionally the illusion fails.

  1. Consistency. Object may be Eventually Consistent: it can take time for changes to objects —creation, deletion and updates— to become visible to all callers. Indeed, there is no guarantee a change is immediately visible to the client which just made the change. As an example, an object test/data1.csv may be overwritten with a new set of data, but when a GET test/data1.csv call is made shortly after the update, the original data returned. Hadoop assumes that filesystems are consistent; that creation, updates and deletions are immediately visible, and that the results of listing a directory are current with respect to the files within that directory.
  2. Atomicity. Hadoop assumes that directory rename() operations are atomic, as are delete() operations. Object store FileSystem clients implement these as operations on the individual objects whose names match the directory prefix. As a result, the changes take place a file at a time, and are not atomic. If an operation fails part way through the process, then the state of the object store reflects the partially completed operation. Note also that client code assumes that these operations are O(1) —in an object store they are more likely to be O(child-entries).
  3. Durability. Hadoop assumes that OutputStream implementations write data to their (persistent) storage on a flush() operation. Object store implementations save all their written data to a local file, a file that is then only PUT to the object store in the final close() operation. As a result, there is never any partial data from incomplete or failed operations. Furthermore, as the write process only starts in close() operation, that operation may take a time proportional to the quantity of data to upload, and inversely proportional to the network bandwidth. It may also fail —a failure that is better escalated than ignored.
  4. Authorization. Hadoop uses the FileStatus class to represent core metadata of files and directories, including the owner, group and permissions. Object stores might not have a viable way to persist this metadata, so they might need to populate FileStatus with stub values. Even if the object store persists this metadata, it still might not be feasible for the object store to enforce file authorization in the same way as a traditional file system. If the object store cannot persist this metadata, then the recommended convention is:
    • File owner is reported as the current user.
    • File group also is reported as the current user.
    • Directory permissions are reported as 777.
    • File permissions are reported as 666.
    • File system APIs that set ownership and permissions execute successfully without error, but they are no-ops.

Object stores with these characteristics, can not be used as a direct replacement for HDFS. In terms of this specification, their implementations of the specified operations do not match those required. They are considered supported by the Hadoop development community, but not to the same extent as HDFS.

Timestamps

FileStatus entries have a modification time and an access time.

  1. The exact behavior as to when these timestamps are set and whether or not they are valid varies between filesystems, and potentially between individual installations of a filesystem.
  2. The granularity of the timestamps is again, specific to both a filesystem and potentially individual installations.

The HDFS filesystem does not update the modification time while it is being written to.

Specifically

Other filesystems may have different behaviors. In particular,

Object stores have an even vaguer view of time, which can be summarized as “it varies”.

Finally, note that the Apache Hadoop project cannot make any guarantees about whether the timestamp behavior of a remote object store will remain consistent over time: they are third-party services, usually accessed via third-party libraries.

The best strategy here is “experiment with the exact endpoint you intend to work with”. Furthermore, if you intend to use any caching/consistency layer, test with that feature enabled. Retest after updates to Hadoop releases, and endpoint object store updates.