Git - partial-clone Documentation (original) (raw)

The "Partial Clone" feature is a performance optimization for Git that allows Git to function without having a complete copy of the repository. The goal of this work is to allow Git to better handle extremely large repositories.

During clone and fetch operations, Git downloads the complete contents and history of the repository. This includes all commits, trees, and blobs for the complete life of the repository. For extremely large repositories, clones can take hours (or days) and consume 100+GiB of disk space.

Often in these repositories there are many blobs and trees that the user does not need such as:

  1. files outside of the user’s work area in the tree. For example, in a repository with 500K directories and 3.5M files in every commit, we can avoid downloading many objects if the user only needs a narrow "cone" of the source tree.
  2. large binary assets. For example, in a repository where large build artifacts are checked into the tree, we can avoid downloading all previous versions of these non-mergeable binary assets and only download versions that are actually referenced.

Partial clone allows us to avoid downloading such unneeded objects in advance during clone and fetch operations and thereby reduce download times and disk usage. Missing objects can later be "demand fetched" if/when needed.

A remote that can later provide the missing objects is called a promisor remote, as it promises to send the objects when requested. Initially Git supported only one promisor remote, the origin remote from which the user cloned and that was configured in the "extensions.partialClone" config option. Later support for more than one promisor remote has been implemented.

Use of partial clone requires that the user be online and the origin remote or other promisor remotes be available for on-demand fetching of missing objects. This may or may not be problematic for the user. For example, if the user can stay within the pre-selected subset of the source tree, they may not encounter any missing objects. Alternatively, the user could try to pre-fetch various objects if they know that they are going offline.

Non-Goals

Partial clone is a mechanism to limit the number of blobs and trees downloadedwithin a given range of commits — and is therefore independent of and not intended to conflict with existing DAG-level mechanisms to limit the set of requested commits (i.e. shallow clone, single branch, or fetch ).

Design Overview

Partial clone logically consists of the following parts:

Design Details

Handling Missing Objects

Fetching Missing Objects

Using many promisor remotes

Many promisor remotes can be configured and used.

This allows for example a user to have multiple geographically-close cache servers for fetching missing blobs while continuing to do filtered git-fetch commands from the central server.

When fetching objects, promisor remotes are tried one after the other until all the objects have been fetched.

Remotes that are considered "promisor" remotes are those specified by the following configuration variables:

Only one promisor remote can be configured using theextensions.partialClone config variable. This promisor remote will be the last one tried when fetching objects.

We decided to make it the last one we try, because it is likely that someone using many promisor remotes is doing so because the other promisor remotes are better for some reason (maybe they are closer or faster for some kind of objects) than the origin, and the origin is likely to be the remote specified by extensions.partialClone.

This justification is not very strong, but one choice had to be made, and anyway the long term plan should be to make the order somehow fully configurable.

For now though the other promisor remotes will be tried in the order they appear in the config file.

Current Limitations

Future Work

Non-Tasks

[a] expensive-to-modify list of missing objects: Earlier in the design of partial clone we discussed the need for a single list of missing objects. This would essentially be a sorted linear list of OIDs that were omitted by the server during a clone or subsequent fetches.

This file would need to be loaded into memory on every object lookup. It would need to be read, updated, and re-written (like the .git/index) on every explicit "git fetch" command and on any dynamic object fetch.

The cost to read, update, and write this file could add significant overhead to every command if there are many missing objects. For example, if there are 100M missing blobs, this file would be at least 2GiB on disk.

With the "promisor" concept, we infer a missing object based upon the type of packfile that references it.