Git Rev News Edition 52 (June 28th, 2019)

Git Rev News: Edition 52 (June 28th, 2019)

Welcome to the 52nd edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.

This edition covers what happened during the month of May 2019.

Discussions

General

I made a flame graph renderer for git’s trace2 output

Ævar Arnfjörð Bjarmason sent an email saying he developed a script that uses the FlameGraph tool to generate a picture showing where Git’s test suite spends its time.

His script also uses the new Git Trace2 API which has been developed mostly by Jeff Hostetler starting nearly one year ago and then through different versions: RFC, V1, V2, V3, V4, V5, and which has eventually been released in Git v2.22.0 at the beginning of June 2019.

Ævar added that he plans to improve his script over time and maybe submit it in a pull request to the FlameGraph tool, or perhaps integrate it in the Git test suite.

Derrick Stolee, who prefers to be called just Stolee, replied that he liked the idea and sent the commands using Ævar’s script that he used to create another picture from a much smaller test.

Gábor Szeder commented on Stolee’s commands to ask why GIT_TR2 instead of GIT_TRACE2 was used in the environment variables related to the Trace2 API. Gábor referred to Ken Thompson “who (allegedly?) later regretted spelling creat()/O_CREAT without the ‘e’…”.

Jeff King, alias Peff, replied to Ævar’s initial email asking “doesn’t perf record -g make test already give us that granular data?” referring to the Linux perf tool which is already supported by the FlameGraph tool. Peff also wondered about the usefulness of such a graph:

But having generated such a flamegraph, it’s not all that helpful. It mainly tells us that we spend a lot of time on fork/exec. Which is no surprise, since the test suite is geared not towards heavy workloads, but lots of tiny functionality tests.

Though he agreed that it could help in some cases:

I think the trace2 flamegraph would be most useful if you were collecting across a broad spectrum of workloads done by a user. You can do that with perf or similar tools, but it can be a bit awkward.

Ævar replied that his “actual use-case for this is to see what production nodes are spending their time on, similar to what Microsoft is doing with their use of this facility”, and that he used the test suite because it is a good way to test his script and the Trace2 API as “we’re pretty much guaranteed to run all our commands, and cover a lot of unusual cases”.

Ævar pointed that his work “shows that we’ve got a long way to go in improving the trace2 facility, i.e. adding region enter/leave for some of the things we spend the most time on.”

Jeff Hostetler, who authored the Trace2 API and works for Microsoft along with Stolee, then replied “Very Nice!” to Ævar and agreed with him about the work still needed “to get more granular data for interesting/problematic things”.

Ævar and Jeff then discussed this future work further in a few emails.

Jeff also replied to Gábor that he was ok to change TR2 to TRACE2, and Gábor sent two patches to get that change done. These patches were agreed on and merged before Git v2.22.0 was released on June 7th, 2019.

Reviews

[PATCH 00/17] [RFC] Commit-graph: Write incremental files

The road to incremental serialized commit-graph started with an attempt to create commit-graph file format v2
The commit-graph file format has some shortcomings that were discussed on-list:
1. It doesn’t use the 4-byte format ID from the_hash_algo.
2. There is no way to change the reachability index from generation numbers to corrected commit date.
3. The unused byte in the format could be used to signal the file is incremental, but current clients ignore the value even if it is non-zero.
This series adds a new version (2) to the commit-graph file. The fifth byte already specified the file format, so existing clients will gracefully respond to files with a different version number. The only real change now is that the header takes 12 bytes instead of 8, due to using the 4-byte format ID for the hash algorithm.
(Note that switching to corrected commit date as generation number v2 was covered in Git Rev News edition 45, November 2018).

It turned out however that the statement “existing clients will gracefully respond to files with a different version number” unfortunately turned out to be not true. Ævar Arnfjörð Bjarmason noticed that older Git responds with a hard error to commit-graph v2, instead of simply turning the serialized-graph feature off in such case.
[…] writing a v2 file would make most things (e.g. “status”) hard error on e.g. v2.21.0:
```
$ git status
error: graph version 2 does not match version 1
$
```
Now as noted in my series we now on ‘master’ downgrade that to a warning (along with the rest of the errors):
```
$ ~/g/git/git --exec-path=$PWD status
error: commit-graph version 2 does not match version 1
On branch master
[...]
```
…and this series sets the default version for all new graphs to v2.

I think this is way too aggressive of an upgrade path. If these patches go into v2.22.0 then git clients on all older versions that grok the commit graph (IIRC v2.18 and above) will have their git completely broken if they’re in a mixed-git-version environment.
The workaround is easy: removing .git/info/commit-graph, or using “git -c core.commitGraph=false ...”. However it is not possible to e.g. add advice describing the workaround to past Git versions (and new versions would simply not fail hard on v2).

It turned out that there is no need to introduce new commit-graph file format to achieve [almost] all stated goals. The goal 1.) turned out to be not important.
- Let’s just live with “1” as the marker for SHA-1.
  
  Yeah it would be cute to use 0x73686131 instead like “struct git_hash_algo”, but we can live with a 1=0x73686131 (“sha1”), 2=0x73323536 (“s256”) mapping somewhere. It’s not like we’re going to be running into the 255 limit of hash algorithms Git will support any time soon.
For 2.), Stolee noticed that generation number v2 (corrected commit date) can be made backward compatible

Since we can make the “corrected commit date” offset for a commit be strictly larger than the offset of a parent, we can make it so an old client will not give incorrect values when we use the new values. The only downside would be that we would fail on ‘git commit-graph verify’ since the offsets are not actually generation numbers in all cases.

This is discussed in a bit more detail in Re: Revision walking, commit dates, slop thread.

The issue of incremental commit-graph file, i.e. 3.), turned out to be better solved by keeping the base as backward-compatibile commit-graph, and deltas as separate files. Thus “Create commit-graph file format v2” turned into Commit-graph write refactor, and Stolee started a separate [RFC] Commit-graph: Write incremental files thread.

The original idea was to store subsequent deltas (incremental additions to serialized commit-graph data) in files named commit-graph-2, commit-graph-3, etc. After involved discussion, considering problems of concurrency, at V6 the design has changed, to having commit-graphs/graph-{hash}.graph together with commit-graph-chain index file.

The commit-graph is a valuable performance feature for repos with large commit histories, but suffers from the same problem as git repack: it rewrites the entire file every time. This can be slow when there are millions of commits, especially after we stopped reading from the commit-graph file during a write in 43d3561 (commit-graph write: don’t die if the existing graph is corrupt).

Instead, create a “chain” of commit-graphs in the .git/objects/info/commit-graphs folder with name graph-{hash}.graph. The list of hashes is given by the commit-graph-chain file, and also in a “base graph chunk” in the commit-graph format. As we read a chain, we can verify that the hashes match the trailing hash of each commit-graph we read along the way and each hash below a level is expected by that graph file.

When writing, we don’t always want to add a new level to the stack. This would eventually result in performance degradation, especially when searching for a commit (before we know its graph position). We decide to merge levels of the stack when the new commits we will write is less than half of the commits in the level above. This can be tweaked by the --size-multiple and --max-commits options.

This series, as ‘ds/commit-graph-incremental’ branch, is currently marked as ready to be merged into ‘next’.

Developer Spotlight: Jeff Hostetler

Who are you and what do you do?

My name is Jeff Hostetler and I work for Microsoft. I’ve been working on Git and on Git-related technologies for the last 5 years. Primarily focusing on performance and scale.

Prior to joining Microsoft, I worked for SourceGear and built the Veracity DVCS and the DiffMerge visual diff and merge tool.

A long, long time ago I was Architect for Spyglass Mosaic.
What would you name your most important contribution to Git?

I’d have to say the Trace2 facility that is now in v2.22. With this in place, it will be much easier to understand performance bottlenecks at scale.

Second to that would be the beginnings of the Partial Clone feature. There’s still a lot of work to do in this area, but I think long term, it will be central to solving certain enterprise-level scale problems.
What are you doing on the Git project these days, and why?

I’m currently working on a series of blog posts explaining Trace2 and how it can be used to measure and track Git performance.

Within Microsoft we continue to study Trace2 data generated by our Windows and Office developers and look for opportunities to improve the developer experience, such as making status and checkout faster. And we are using the data to guide how/where we should invest our engineering time for future performance gains.

Hopefully, I can encourage others to start using Trace2, gather their own data and look for opportunities where they can help improve Git.
If you could get a team of expert developers to work full time on something in Git for a full year, what would it be?

Bring together the Partial Clone and Sparse Checkout features to scale to large repos. This includes completing the end-user experience, so that it just works and doesn’t require any wizardry.

There are several dimensions that have similar, but not identical needs.

For example, a moderately-sized work tree with a few large blobs might use Partial Clone with the blob-size filter and only demand-fetch large blobs when actually needed. This could be seen as an easier to use solution than LFS.

Alternatively, a repo with a gigantic tree might use Partial Clone with the sparse filter (to get “cones” of the work tree). That could be coordinated with the sparse-checkout file to populate just the desired parts of the work tree. For some users this would be simpler than using GVFS.

Let’s add new porcelain commands to create, grow, and shrink the sparse-checkout file and automatically update the index, so that the user doesn’t have to manually manipulate it.

Investigate a bulk pre-fetch command or hook, such as before a checkout, to reduce the overhead of individually demand-loading missing objects.

Finally, update Protocol V2 to include whatever verbs we need to make all of this work efficiently.

With this we could probably retire most if not all of GVFS and hopefully let our Windows and Office developers use core Git and not need a private fork.
If you could remove something from Git without worrying about backwards compatibility, what would it be?

I’d like to revisit the design of the index. Switch to a sparse and hierarchical format, for example. This is a large task and touches everything from the on-disk format to every index-related loop in the program.

Ben Peart and Derrick Stolee both touched upon this in earlier issues.
What is your favorite Git-related tool/library, outside of Git itself?

I’m mostly a terminal user, so I don’t use very many third-party tools. I do highly recommend GitGitGadget. I use it to run CI builds on all Windows, Mac, and Linux and to send patches to the mailing list.

Releases

Git 2.22.0, 2.22.0-rc3, 2.22.0-rc2
Git for Windows 2.22.0
libgit2 0.28.2
GitHub Enterprise 2.17.2, 2.16.11, 2.15.16, 2.14.23, 2.17.1, 2.16.10, 2.15.15, 2.14.22, 2.17.0, 2.16.9, 2.15.14, 2.14.21
GitLab 12.0.1, 12.0, 11.11.3, 11.11.2, 11.10.6, 11.11.1, 11.10.5, and 11.9.12, 11.11
Bitbucket Server 6.4
Gerrit Code Review 2.15.14, 2.16.9
GitKraken 6.0.0, 5.0.4, 5.0.3, 5.0.2, 5.0.1, 5.0.0
GitHub Desktop 2.0.4, 2.0.3, 2.0.2, 2.0.1, 2.0.0
Sourcetree 3.2.1, 3.2

Other News

Various

GitHub Maintainer Security Advisories are a way to privately report vulnerabilities to OSS projects on GitHub, currently in a public beta. Those are a part of a larger effort by GitHub to support security (especially targeting OSS projects). For more information about this effort see the following GitHub Help pages:
- About maintainer security advisories
- About security alerts for vulnerable dependencies
Introducing new ways to keep your code secure by Justin Hutchings on GitHub Blog.
Announcing GitHub Sponsors: a new way to contribute to open source by Devon Zuegel on GitHub Blog; see also TechCrunch article by Frederic Lardinois on the same subject.
Introducing GitHub Package Registry by Simina Pasat on GitHub Blog; GitHub Package Registry feature is currently in limited beta (you need to sign up for the beta).

Light reading

FreeBSD and Git (PDF) – slides from Ed Maste’s presentation at the FreeBSD Vendor Summit 2018.
Things I Learnt The Hard Way (in 30 Years of Software Development) by Julio Biason, including few thoughts about the use of version control and Git.
Rebasing and merging in kernel repositories [LWN.net] by Jonathan Corbet.
Enforcing Conventional Commits using Git hooks by Enda Phelan. Conventional Commits is a specification for adding human and machine readable meaning to commit messages. The presented solution was turned into the Sailr project.
Learn Git concepts, not commands by Nico Riedmann: An interactive Git tutorial for beginners meant to teach you how Git works, not just which commands to execute.
A Git Origin Story by Zack Brown (2018).

Git tools and sites

Sailr: a global Git hook for adhering to the Conventional Commit specification

Credits

This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com>, Jakub Narębski <jnareb@gmail.com>, Markus Jansen <mja@jansen-preisler.de> and Gabriel Alcaras <gabriel.alcaras@telecom-paristech.fr> with help from Jeff Hostetler, David Pursehouse and Johannes Schindelin.