Git - partial-clone Documentation (original) (raw)

The "Partial Clone" feature is a performance optimization for Git that allows Git to function without having a complete copy of the repository. The goal of this work is to allow Git to better handle extremely large repositories.

During clone and fetch operations, Git downloads the complete contents and history of the repository. This includes all commits, trees, and blobs for the complete life of the repository. For extremely large repositories, clones can take hours (or days) and consume 100+GiB of disk space.

Often in these repositories there are many blobs and trees that the user does not need such as:

files outside of the user’s work area in the tree. For example, in a repository with 500K directories and 3.5M files in every commit, we can avoid downloading many objects if the user only needs a narrow "cone" of the source tree.
large binary assets. For example, in a repository where large build artifacts are checked into the tree, we can avoid downloading all previous versions of these non-mergeable binary assets and only download versions that are actually referenced.

Partial clone allows us to avoid downloading such unneeded objects in advance during clone and fetch operations and thereby reduce download times and disk usage. Missing objects can later be "demand fetched" if/when needed.

A remote that can later provide the missing objects is called a promisor remote, as it promises to send the objects when requested. Initially Git supported only one promisor remote, the origin remote from which the user cloned and that was configured in the "extensions.partialClone" config option. Later support for more than one promisor remote has been implemented.

Use of partial clone requires that the user be online and the origin remote or other promisor remotes be available for on-demand fetching of missing objects. This may or may not be problematic for the user. For example, if the user can stay within the pre-selected subset of the source tree, they may not encounter any missing objects. Alternatively, the user could try to pre-fetch various objects if they know that they are going offline.

Non-Goals

Partial clone is a mechanism to limit the number of blobs and trees downloadedwithin a given range of commits — and is therefore independent of and not intended to conflict with existing DAG-level mechanisms to limit the set of requested commits (i.e. shallow clone, single branch, or fetch ).

Design Overview

Partial clone logically consists of the following parts:

A mechanism for the client to describe unneeded or unwanted objects to the server.
A mechanism for the server to omit such unwanted objects from packfiles sent to the client.
A mechanism for the client to gracefully handle missing objects (that were previously omitted by the server).
A mechanism for the client to backfill missing objects as needed.

Design Details

A new pack-protocol capability "filter" is added to the fetch-pack and upload-pack negotiation.
This uses the existing capability discovery mechanism. See "filter" in gitprotocol-pack[5].
Clients pass a "filter-spec" to clone and fetch which is passed to the server to request filtering during packfile construction.
There are various filters available to accommodate different situations. See "--filter=" in Documentation/rev-list-options.adoc.
On the server pack-objects applies the requested filter-spec as it creates "filtered" packfiles for the client.
These filtered packfiles are incomplete in the traditional sense because they may contain objects that reference objects not contained in the packfile and that the client doesn’t already have. For example, the filtered packfile may contain trees or tags that reference missing blobs or commits that reference missing trees.
On the client these incomplete packfiles are marked as "promisor packfiles" and treated differently by various commands.
On the client a repository extension is added to the local config to prevent older versions of git from failing mid-operation because of missing objects that they cannot handle. See extensions.partialClone in git-config[1].

Handling Missing Objects

An object may be missing due to a partial clone or fetch, or missing due to repository corruption. To differentiate these cases, the local repository specially indicates such filtered packfiles obtained from promisor remotes as "promisor packfiles".
These promisor packfiles consist of a ".promisor" file with arbitrary contents (like the ".keep" files), in addition to their ".pack" and ".idx" files.
The local repository considers a "promisor object" to be an object that it knows (to the best of its ability) that promisor remotes have promised that they have, either because the local repository has that object in one of its promisor packfiles, or because another promisor object refers to it.
When Git encounters a missing object, Git can see if it is a promisor object and handle it appropriately. If not, Git can report a corruption.
This means that there is no need for the client to explicitly maintain an expensive-to-modify list of missing objects.[a]
Since almost all Git code currently expects any referenced object to be present locally and because we do not want to force every command to do a dry-run first, a fallback mechanism is added to allow Git to attempt to dynamically fetch missing objects from promisor remotes.
When the normal object lookup fails to find an object, Git invokes promisor_remote_get_direct() to try to get the object from a promisor remote and then retry the object lookup. This allows objects to be "faulted in" without complicated prediction algorithms.
For efficiency reasons, no check as to whether the missing object is actually a promisor object is performed.
Dynamic object fetching tends to be slow as objects are fetched one at a time.
checkout (and any other command using unpack-trees) has been taught to bulk pre-fetch all required missing blobs in a single batch.
rev-list has been taught to print missing objects.
This can be used by other commands to bulk prefetch objects. For example, a "git log -p A..B" may internally want to first do something like "git rev-list --objects --quiet --missing=print A..B" and prefetch those objects in bulk.
fsck has been updated to be fully aware of promisor objects.
repack in GC has been updated to not touch promisor packfiles at all, and to only repack other objects.
The global variable "fetch_if_missing" is used to control whether an object lookup will attempt to dynamically fetch a missing object or report an error.
We are not happy with this global variable and would like to remove it, but that requires significant refactoring of the object code to pass an additional flag.

Fetching Missing Objects

Fetching of objects is done by invoking a "git fetch" subprocess.
The local repository sends a request with the hashes of all requested objects, and does not perform any packfile negotiation. It then receives a packfile.
Because we are reusing the existing fetch mechanism, fetching currently fetches all objects referred to by the requested objects, even though they are not necessary.
Fetching with --refetch will request a complete new filtered packfile from the remote, which can be used to change a filter without needing to dynamically fetch missing objects.

Using many promisor remotes

Many promisor remotes can be configured and used.

This allows for example a user to have multiple geographically-close cache servers for fetching missing blobs while continuing to do filtered git-fetch commands from the central server.

When fetching objects, promisor remotes are tried one after the other until all the objects have been fetched.

Remotes that are considered "promisor" remotes are those specified by the following configuration variables:

extensions.partialClone = <name>
remote.<name>.promisor = true
remote.<name>.partialCloneFilter = ...

Only one promisor remote can be configured using theextensions.partialClone config variable. This promisor remote will be the last one tried when fetching objects.

We decided to make it the last one we try, because it is likely that someone using many promisor remotes is doing so because the other promisor remotes are better for some reason (maybe they are closer or faster for some kind of objects) than the origin, and the origin is likely to be the remote specified by extensions.partialClone.

This justification is not very strong, but one choice had to be made, and anyway the long term plan should be to make the order somehow fully configurable.

For now though the other promisor remotes will be tried in the order they appear in the config file.

Current Limitations

It is not possible to specify the order in which the promisor remotes are tried in other ways than the order in which they appear in the config file.
It is also not possible to specify an order to be used when fetching from one remote and a different order when fetching from another remote.
It is not possible to push only specific objects to a promisor remote.
It is not possible to push at the same time to multiple promisor remote in a specific order.
Dynamic object fetching will only ask promisor remotes for missing objects. We assume that promisor remotes have a complete view of the repository and can satisfy all such requests.
Repack essentially treats promisor and non-promisor packfiles as 2 distinct partitions and does not mix them.
Dynamic object fetching invokes fetch-pack once for each itembecause most algorithms stumble upon a missing object and need to have it resolved before continuing their work. This may incur significant overhead — and multiple authentication requests — if many objects are needed.
Dynamic object fetching currently uses the existing pack protocol V0 which means that each object is requested via fetch-pack. The server will send a full set of info/refs when the connection is established. If there are a large number of refs, this may incur significant overhead.

Future Work

Improve the way to specify the order in which promisor remotes are tried.
For example this could allow specifying explicitly something like: "When fetching from this remote, I want to use these promisor remotes in this order, though, when pushing or fetching to that remote, I want to use those promisor remotes in that order."
Allow pushing to promisor remotes.
The user might want to work in a triangular work flow with multiple promisor remotes that each have an incomplete view of the repository.
Allow non-pathname-based filters to make use of packfile bitmaps (when present). This was just an omission during the initial implementation.
Investigate use of a long-running process to dynamically fetch a series of objects, such as proposed in [5,6] to reduce process startup and overhead costs.
It would be nice if pack protocol V2 could allow that long-running process to make a series of requests over a single long-running connection.
Investigate pack protocol V2 to avoid the info/refs broadcast on each connection with the server to dynamically fetch missing objects.
Investigate the need to handle loose promisor objects.
Objects in promisor packfiles are allowed to reference missing objects that can be dynamically fetched from the server. An assumption was made that loose objects are only created locally and therefore should not reference a missing object. We may need to revisit that assumption if, for example, we dynamically fetch a missing tree and store it as a loose object rather than a single object packfile.
This does not necessarily mean we need to mark loose objects as promisor; it may be sufficient to relax the object lookup or is-promisor functions.

Non-Tasks

Every time the subject of "demand loading blobs" comes up it seems that someone suggests that the server be allowed to "guess" and send additional objects that may be related to the requested objects.
No work has gone into actually doing that; we’re just documenting that it is a common suggestion. We’re not sure how it would work and have no plans to work on it.
It is valid for the server to send more objects than requested (even for a dynamic object fetch), but we are not building on that.

[a] expensive-to-modify list of missing objects: Earlier in the design of partial clone we discussed the need for a single list of missing objects. This would essentially be a sorted linear list of OIDs that were omitted by the server during a clone or subsequent fetches.

This file would need to be loaded into memory on every object lookup. It would need to be read, updated, and re-written (like the .git/index) on every explicit "git fetch" command and on any dynamic object fetch.

The cost to read, update, and write this file could add significant overhead to every command if there are many missing objects. For example, if there are 100M missing blobs, this file would be at least 2GiB on disk.

With the "promisor" concept, we infer a missing object based upon the type of packfile that references it.