Comparing 9b6606f43d...1b4c57fa87 · git/git (original) (raw)

Commits on Apr 16, 2020

  1. revision: complicated pathspecs disable filters
    The changed-path Bloom filters work only when we can compute an
    explicit Bloom filter key in advance. When a pathspec is given
    that allows case-insensitive checks or wildcard matching, we
    must disable the Bloom filter performance checks.
    By checking the pathspec in prepare_to_use_bloom_filters(), we
    avoid setting up the Bloom filter data and thus revert to the
    usual logic.
    Before this change, the following tests would fail*:
    t6004-rev-list-path-optim.sh (Tests 6-7)
    t6130-pathspec-noglob.sh (Tests 3-6)
    t6131-pathspec-icase.sh (Tests 3-5)
    *These tests would fail when using GIT_TEST_COMMIT_GRAPH and
    GIT_TEST_COMMIT_GRAPH_BLOOM_FILTERS except that the latter
    environment variable was not set up correctly to write the changed-
    path Bloom filters in the test suite. That will be fixed in the
    next change.
    Helped-by: Taylor Blau me@ttaylorr.com
    Signed-off-by: Derrick Stolee dstolee@microsoft.com
    Signed-off-by: Junio C Hamano gitster@pobox.com
    @derrickstolee @gitster
    authored andgitster committed
    Apr 16, 2020
    Configuration menu
    Browse the repository at this point in the history
  2. tests: write commit-graph with Bloom filters
    The GIT_TEST_COMMIT_GRAPH environment variable updates the commit-
    graph file whenever "git commit" is run, ensuring that we always
    have an updated commit-graph throughout the test suite. The
    GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS environment variable was
    introduced to write the changed-path Bloom filters whenever "git
    commit-graph write" is run. However, the GIT_TEST_COMMIT_GRAPH
    trick doesn't launch a separate process and instead writes it
    directly.
    To expand the number of tests that have commits in the commit-graph
    file, add a helper method that computes the commit-graph and place
    that helper inside "git commit" and "git merge".
    In the helper method, check GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS
    to ensure we are writing changed-path Bloom filters whenever
    possible.
    Signed-off-by: Derrick Stolee dstolee@microsoft.com
    Signed-off-by: Junio C Hamano gitster@pobox.com
    @derrickstolee @gitster
    Configuration menu
    Browse the repository at this point in the history
  3. blame: use changed-path Bloom filters
    The changed-path Bloom filters help reduce the amount of tree
    parsing required during history queries. Before calculating a
    diff, we can ask the filter if a path changed between a commit
    and its first parent. If the filter says "no" then we can move
    on without parsing trees. If the filter says "maybe" then we
    parse trees to discover if the answer is actually "yes" or "no".
    When computing a blame, there is a section in find_origin() that
    computes a diff between a commit and one of its parents. When this
    is the first parent, we can check the Bloom filters before calling
    diff_tree_oid().
    In order to make this work with the blame machinery, we need to
    initialize a struct bloom_key with the initial path. But also, we
    need to add more keys to a list if a rename is detected. We then
    check to see if any of these keys answer "maybe" in the diff.
    During development, I purposefully left out this "add a new key
    when a rename is detected" to see if the test suite would catch
    my error. That is how I discovered the issues with
    GIT_TEST_COMMIT_GRAPH_CHANGED_PATHS from the previous change.
    With that change, we can feel some confidence in the coverage of
    this change.
    If a user requests copy detection using "git blame -C", then there
    are more places where the set of "important" files can expand. I
    do not know enough about how this happens in the blame machinery.
    Thus, the Bloom filter integration is explicitly disabled in this
    mode. A later change could expand the bloom_key data with an
    appropriate call (or calls) to add_bloom_key().
    If we did not disable this mode, then the following tests would
    fail:
    t8003-blame-corner-cases.sh
    t8011-blame-split-file.sh
    Generally, this is a performance enhancement and should not
    change the behavior of 'git blame' in any way. If a repo has a
    commit-graph file with computed changed-path Bloom filters, then
    they should notice improved performance for their 'git blame'
    commands.
    Here are some example timings that I found by blaming some paths
    in the Linux kernel repository:
    git blame arch/x86/kernel/topology.c >/dev/null
    Before: 0.83s
    After: 0.24s
    git blame kernel/time/time.c >/dev/null
    Before: 0.72s
    After: 0.24s
    git blame tools/perf/ui/stdio/hist.c >/dev/null
    Before: 0.27s
    After: 0.11s
    I specifically looked for "deep" paths that were also edited many
    times. As a counterpoint, the MAINTAINERS file was edited many
    times but is located in the root tree. This means that the cost of
    computing a diff relative to the pathspec is very small. Here are
    the timings for that command:
    git blame MAINTAINERS >/dev/null
    Before: 20.1s
    After: 18.0s
    These timings are the best of five. The worst-case runs were on the
    order of 2.5 minutes for both cases. Note that the MAINTAINERS file
    has 18,740 lines across 17,000+ commits. This happens to be one of
    the cases where this change provides the least improvement.
    The lack of improvement for the MAINTAINERS file and the relatively
    modest improvement for the other examples can be easily explained.
    The blame machinery needs to compute line-level diffs to determine
    which lines were changed by each commit. That makes up a large
    proportion of the computation time, and this change does not
    attempt to improve on that section of the algorithm. The
    MAINTAINERS file is large and changed often, so it takes time to
    determine which lines were updated by which commit. In contrast,
    the code files are much smaller, and it takes longer to comute
    the line-by-line diff for a single patch on the Linux mailing
    lists.
    Outside of the "-C" integration, I believe there is little more to
    gain from the changed-path Bloom filters for 'git blame' after this
    patch.
    Signed-off-by: Derrick Stolee dstolee@microsoft.com
    Signed-off-by: Junio C Hamano gitster@pobox.com
    @derrickstolee @gitster
    Configuration menu
    Browse the repository at this point in the history