Welcome to the 52nd edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.
This edition covers what happened during the month of May 2019.
I made a flame graph renderer for git’s trace2 output
Ævar Arnfjörð Bjarmason sent an email saying he developed a script that uses the FlameGraph tool to generate a picture showing where Git’s test suite spends its time.
His script also uses the new Git Trace2 API which has been developed mostly by Jeff Hostetler starting nearly one year ago and then through different versions: RFC, V1, V2, V3, V4, V5, and which has eventually been released in Git v2.22.0 at the beginning of June 2019.
Ævar added that he plans to improve his script over time and maybe submit it in a pull request to the FlameGraph tool, or perhaps integrate it in the Git test suite.
Derrick Stolee, who prefers to be called just Stolee, replied that he liked the idea and sent the commands using Ævar’s script that he used to create another picture from a much smaller test.
Gábor Szeder commented on Stolee’s commands to ask why GIT_TR2
instead of GIT_TRACE2
was used in the environment variables
related to the Trace2 API. Gábor referred to Ken Thompson “who
(allegedly?) later regretted spelling creat()
/O_CREAT
without the
‘e’…”.
Jeff King, alias Peff, replied to Ævar’s initial email asking
“doesn’t perf record -g make test
already give us that granular
data?” referring to the Linux perf tool
which is already supported by the FlameGraph tool. Peff also
wondered about the usefulness of such a graph:
But having generated such a flamegraph, it’s not all that helpful. It mainly tells us that we spend a lot of time on fork/exec. Which is no surprise, since the test suite is geared not towards heavy workloads, but lots of tiny functionality tests.
Though he agreed that it could help in some cases:
I think the trace2 flamegraph would be most useful if you were collecting across a broad spectrum of workloads done by a user. You can do that with perf or similar tools, but it can be a bit awkward.
Ævar replied that his “actual use-case for this is to see what production nodes are spending their time on, similar to what Microsoft is doing with their use of this facility”, and that he used the test suite because it is a good way to test his script and the Trace2 API as “we’re pretty much guaranteed to run all our commands, and cover a lot of unusual cases”.
Ævar pointed that his work “shows that we’ve got a long way to go in improving the trace2 facility, i.e. adding region enter/leave for some of the things we spend the most time on.”
Jeff Hostetler, who authored the Trace2 API and works for Microsoft along with Stolee, then replied “Very Nice!” to Ævar and agreed with him about the work still needed “to get more granular data for interesting/problematic things”.
Ævar and Jeff then discussed this future work further in a few emails.
Jeff also replied to Gábor that he was ok to change TR2
to
TRACE2
, and Gábor sent
two patches
to get that change done. These patches were agreed on and merged before
Git v2.22.0 was released on June 7th, 2019.
[PATCH 00/17] [RFC] Commit-graph: Write incremental files
The road to incremental serialized commit-graph started with an attempt to create commit-graph file format v2
The commit-graph file format has some shortcomings that were discussed on-list:
It doesn’t use the 4-byte format ID from the_hash_algo.
There is no way to change the reachability index from generation numbers to corrected commit date.
The unused byte in the format could be used to signal the file is incremental, but current clients ignore the value even if it is non-zero.
This series adds a new version (2) to the commit-graph file. The fifth byte already specified the file format, so existing clients will gracefully respond to files with a different version number. The only real change now is that the header takes 12 bytes instead of 8, due to using the 4-byte format ID for the hash algorithm.
(Note that switching to corrected commit date as generation number v2 was covered in Git Rev News edition 45, November 2018).
It turned out however that the statement “existing clients will gracefully respond to files with a different version number” unfortunately turned out to be not true. Ævar Arnfjörð Bjarmason noticed that older Git responds with a hard error to commit-graph v2, instead of simply turning the serialized-graph feature off in such case.
[…] writing a v2 file would make most things (e.g. “status”) hard error on e.g. v2.21.0:
$ git status error: graph version 2 does not match version 1 $
Now as noted in my series we now on ‘master’ downgrade that to a warning (along with the rest of the errors):
$ ~/g/git/git --exec-path=$PWD status error: commit-graph version 2 does not match version 1 On branch master [...]
…and this series sets the default version for all new graphs to v2.
I think this is way too aggressive of an upgrade path. If these patches go into v2.22.0 then git clients on all older versions that grok the commit graph (IIRC v2.18 and above) will have their git completely broken if they’re in a mixed-git-version environment.
The workaround is easy: removing .git/info/commit-graph
, or using
“git -c core.commitGraph=false ...
”. However it is not possible
to e.g. add advice describing the workaround to past Git versions
(and new versions would simply not fail hard on v2).
It turned out that there is no need to introduce new commit-graph file format to achieve [almost] all stated goals. The goal 1.) turned out to be not important.
Let’s just live with “1” as the marker for SHA-1.
Yeah it would be cute to use 0x73686131 instead like “struct git_hash_algo”, but we can live with a 1=0x73686131 (“sha1”), 2=0x73323536 (“s256”) mapping somewhere. It’s not like we’re going to be running into the 255 limit of hash algorithms Git will support any time soon.
For 2.), Stolee noticed that generation number v2 (corrected commit date) can be made backward compatible
Since we can make the “corrected commit date” offset for a commit be strictly larger than the offset of a parent, we can make it so an old client will not give incorrect values when we use the new values. The only downside would be that we would fail on ‘
git commit-graph verify
’ since the offsets are not actually generation numbers in all cases.
This is discussed in a bit more detail in Re: Revision walking, commit dates, slop thread.
The issue of incremental commit-graph file, i.e. 3.), turned out to be better solved by keeping the base as backward-compatibile commit-graph, and deltas as separate files. Thus “Create commit-graph file format v2” turned into Commit-graph write refactor, and Stolee started a separate [RFC] Commit-graph: Write incremental files thread.
The original idea was to store subsequent deltas (incremental
additions to serialized commit-graph data) in files named
commit-graph-2
, commit-graph-3
, etc. After involved discussion,
considering problems of concurrency, at V6
the design has changed, to having commit-graphs/graph-{hash}.graph
together with commit-graph-chain
index file.
The commit-graph is a valuable performance feature for repos with large commit histories, but suffers from the same problem as git repack: it rewrites the entire file every time. This can be slow when there are millions of commits, especially after we stopped reading from the commit-graph file during a write in 43d3561 (commit-graph write: don’t die if the existing graph is corrupt).
Instead, create a “chain” of commit-graphs in the
.git/objects/info/commit-graphs
folder with name graph-{hash}.graph. The list of hashes is given by the commit-graph-chain file, and also in a “base graph chunk” in the commit-graph format. As we read a chain, we can verify that the hashes match the trailing hash of each commit-graph we read along the way and each hash below a level is expected by that graph file.When writing, we don’t always want to add a new level to the stack. This would eventually result in performance degradation, especially when searching for a commit (before we know its graph position). We decide to merge levels of the stack when the new commits we will write is less than half of the commits in the level above. This can be tweaked by the
--size-multiple
and--max-commits
options.
This series, as ‘ds/commit-graph-incremental’ branch, is currently marked as ready to be merged into ‘next’.
Who are you and what do you do?
My name is Jeff Hostetler and I work for Microsoft. I’ve been working on Git and on Git-related technologies for the last 5 years. Primarily focusing on performance and scale.
Prior to joining Microsoft, I worked for SourceGear and built the Veracity DVCS and the DiffMerge visual diff and merge tool.
A long, long time ago I was Architect for Spyglass Mosaic.
What would you name your most important contribution to Git?
I’d have to say the Trace2 facility that is now in v2.22. With this in place, it will be much easier to understand performance bottlenecks at scale.
Second to that would be the beginnings of the Partial Clone feature. There’s still a lot of work to do in this area, but I think long term, it will be central to solving certain enterprise-level scale problems.
What are you doing on the Git project these days, and why?
I’m currently working on a series of blog posts explaining Trace2 and how it can be used to measure and track Git performance.
Within Microsoft we continue to study Trace2 data generated by our Windows and Office developers and look for opportunities to improve the developer experience, such as making status and checkout faster. And we are using the data to guide how/where we should invest our engineering time for future performance gains.
Hopefully, I can encourage others to start using Trace2, gather their own data and look for opportunities where they can help improve Git.
If you could get a team of expert developers to work full time on something in Git for a full year, what would it be?
Bring together the Partial Clone and Sparse Checkout features to scale to large repos. This includes completing the end-user experience, so that it just works and doesn’t require any wizardry.
There are several dimensions that have similar, but not identical needs.
For example, a moderately-sized work tree with a few large blobs might use Partial Clone with the blob-size filter and only demand-fetch large blobs when actually needed. This could be seen as an easier to use solution than LFS.
Alternatively, a repo with a gigantic tree might use Partial Clone with the sparse filter (to get “cones” of the work tree). That could be coordinated with the sparse-checkout file to populate just the desired parts of the work tree. For some users this would be simpler than using GVFS.
Let’s add new porcelain commands to create, grow, and shrink the sparse-checkout file and automatically update the index, so that the user doesn’t have to manually manipulate it.
Investigate a bulk pre-fetch command or hook, such as before a checkout, to reduce the overhead of individually demand-loading missing objects.
Finally, update Protocol V2 to include whatever verbs we need to make all of this work efficiently.
With this we could probably retire most if not all of GVFS and hopefully let our Windows and Office developers use core Git and not need a private fork.
If you could remove something from Git without worrying about backwards compatibility, what would it be?
I’d like to revisit the design of the index. Switch to a sparse and hierarchical format, for example. This is a large task and touches everything from the on-disk format to every index-related loop in the program.
Ben Peart and Derrick Stolee both touched upon this in earlier issues.
What is your favorite Git-related tool/library, outside of Git itself?
I’m mostly a terminal user, so I don’t use very many third-party tools. I do highly recommend GitGitGadget. I use it to run CI builds on all Windows, Mac, and Linux and to send patches to the mailing list.
Various
GitHub Maintainer Security Advisories are a way to privately report vulnerabilities to OSS projects on GitHub, currently in a public beta. Those are a part of a larger effort by GitHub to support security (especially targeting OSS projects). For more information about this effort see the following GitHub Help pages:
Light reading
Git tools and sites
This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com>, Jakub Narębski <jnareb@gmail.com>, Markus Jansen <mja@jansen-preisler.de> and Gabriel Alcaras <gabriel.alcaras@telecom-paristech.fr> with help from Jeff Hostetler, David Pursehouse and Johannes Schindelin.