Git Rev News: Edition 20 (October 19th, 2016)

Welcome to the 20th edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.

This edition covers what happened during the month of September 2016.

Discussions

Reviews

Linus Torvalds asked for increasing the default value for number of characters in SHA-1 abbreviations. The default_abbrev = 7 was reasonable in the early days of Git, but a project of the size of the Linux kernel needs git config --global core.abbrev 12. While Git will extend the seven hex digits until the object name is unique, that only reflects the current situation in the repository. It gets annoying when a commit message has a short git ID that is no longer unique a few months later when one needs to go back and try to figure out what went wrong in that commit.

Jeff King, alias Peff, answered the “it gets annoying” part in the [PATCH 0/10] helping people resolve ambiguous sha1s patch series (merged in 66c22ba6). This patch series taught Git to help in the situation where only ambiguous shortened identifier is available, by listing the SHA-1s of the objects it found, along with a few bits of information that may help the user decide which one they meant.

  $ git rev-parse b2e1
  error: short SHA1 b2e1 is ambiguous
  hint: The candidates are:
  hint:   b2e1196 tag v2.8.0-rc1
  hint:   b2e11d1 tree
  hint:   b2e1632 commit 2007-11-14 - Merge branch 'bs/maint-commit-options'
  hint:   b2e1759 blob
  hint:   b2e18954 blob
  hint:   b2e1895c blob

The problem of Git providing SHA-1 abbreviation which would soon be invalid was solved in different way than proposed by Linus. Instead of increasing the default abbrev length for all projects, making abbrevs longer and more unwieldy also for small projects that don’t need it, Peff proposed to make default abbrev length be dynamically based on the number of objects in the repository. Linus sent rough implementation of this idea, which after a few iterations (and cleanups of related code) got merged into ‘next’ as bb188d00f7.

Johannes Schindelin, alias Dscho, is the maintainer of Git for Windows. He is working for Microsoft and, on top of his maintainer role, he has been working whenever time allowed since February this year to speed up interactive rebase (git rebase -i).

Since when it was created in 2005, the git rebase command has been implemented using shell scripts that are calling other git commands.

The interactive rebase that Dscho implemented in 2007 calls different commands than the regular, non interactive, rebase. The regular rebase uses git format-patch to create a patch series from some commits, and then git am to apply this patch series on top of a different commit, while the interactive rebase calls git cherry-pick repeatedly for the same purpose.

Neither of these approaches has been very efficient though, and the main reason behind that is that repeatedly calling a git command has a significant overhead. Even the regular git rebase would do that as git am had been implemented by launching git apply on each of the patches.

The overhead is especially big on Windows where creating a new process is quite slow, but even on other OSes it requires setting up everything from scratch, then reading the index from disk, and then, after performing some changes, writing the index back to the disk.

In case of the regular rebase, a patch series has been merged recently to the ‘master’ branch that makes git am call git apply’s internal functions without spawning the latter as a separate process. So the regular rebase will be significantly faster especially on Windows and for big repositories in the next Git feature release.

Dscho’s work achieves the same kind of results for the interactive rebase. The work, which has been distilled to the mailing list has a series of patch series, greatly improves and then uses a mechanism called the sequencer.

The sequencer had been from it’s beginning in 2008 as a GSoC (Google Summer of Code) project envisioned as a low-level patch-application engine written in C that would “take the ‘todo’ file format used by git-rebase -i and extend it to also support applying patches split out of mbox files”, so that “frontends like git-am, git-rebase, etc. can then setup the ‘todo’ script and pass it to git-sequencer, which does the actual patch application, editing, etc.”

Of course it was a too much ambitious project for a GSoC project, so the work that Stephan Beyer, the GSoC student at that time, did to implement it was not merged. A lot of great related work by Stephan had been merged though, and the sequencer idea as well as Stephan’s code were still considered valuable, so that in 2011 another GSoC project was attempted to further the idea and Stephan’s code. This time the goal was to first use the sequencer to improve cherry-picking, and reverting, many commits, and Ramkumar Ramachandra, alias Ram, succeeded. The sequencer code got merged and it was now possible to “continue”, “abort” or “skip” when cherry-picking or reverting many commits.

Despite this success, Dscho has had to improve a lot of things to make it possible to reuse the sequencer in the interactive rebase. For example he had to create a git-rebase–helper in C that ported a lot of the functionality from the git-rebase–interactive.sh shell script.

As Dscho explains in an answer to a question by Jakub Narębski, who asked about the status of the patch series, 10 of his patch series had already been accepted, 5 were in flight and 1 had not yet been submitted at the beginning of September.

These patch series, will speed up the interactive rebase, but are not enough to fully replace the rebase implementation in shell by one in C. According to Dscho such a result is “far, far, far in the future”:

…my hope is that the rebase–helper work is only an initial step, opening the door for other contributors to tackle independent parts of making git-rebase a builtin

Though the patch series have been reviewed by a large number of experienced Git developers like Junio Hamano, Johannes Sixt, Torsten Bögershausen, Jeff King, Jakub Narębski, Dennis Kaarsemaker, Eric Sunshine, Kevin Daudt and Stefan Beller, they are not fully merged into Git yet. But Dscho already “integrated the whole shebang into Git for Windows 2.10.0 and 2.10.1” that were released recently, and “it has been running without complaints (and some quite positive feedback)”.

About the performance improvements, Dscho wrote:

The end game of this patch series is a git-rebase–helper that makes rebase -i 5x faster on Windows (according to t/perf/p3404). Travis says that even MacOS X and Linux benefit (4x and 3x, respectively).

Such performance improvements as well as the code consolidations around the sequencer are of course very nice. It is interesting and satisfying to see that they are the result of building on top of previous work over the years by GSoC students, mentors and reviewers.

Dscho wrote about making interactive rebase much faster in a recent blog post (linked to in previous Git Rev News), repeating and extending information from his answer mentioned in the above article. Among others, he wrote how can he be sure that the code is ready:

The answer: I verified it. Inspired by GitHub’s blog post on their Scientist library, I taught my personal Git version to cross-validate each and every interactive rebase that I performed since the middle of May. That is, each and every interactive rebase I ran was first performed using the original shell script, then using the git rebase--helper, and then the results were confirmed to be identical (modulo time stamps).

And further:

Full disclosure: the cross-validation did find three regressions that were not caught by the regression test suite (which I have subsequently adjusted to test for those issues, of course). So it was worth the effort.

One can find which regressions were there in the followup on git mailing list. It is interesting to find the use of the Scientist library for ensuring the quality of Git code refactoring.

Kyle J. McKay had have been wanting a compact one line output format that included dates, times and initials, and is compatible with --graph.

  === 2015-09-17 ===
* ee6ad5f4 12:16 jch (tag: v2.5.3) Git 2.5.3
  === 2015-09-09 ===
* b9d66899 14:22 js  am --skip/--abort: merge HEAD/ORIG_HEAD tree into index
|   === 2015-09-04 ===
| * 27ea6f85 10:46 jch (tag: v2.5.2) Git 2.5.2
* 74b67638 10:36 jch (tag: v2.4.9) Git 2.4.9
                     ..........
* ecad27cf 10:32 jch (tag: v2.3.9) Git 2.3.9

see above

To have all this, Kyle proposed git-log-times script for contrib/.

Jeff King was surprised to see this as a separate script, and proposed a patch series adding support for features like --commit-header option for git log, making it possible to come close to what git-log-times provided.

Junio Hamano reminded that contrib/ area is not the place for random git-related things.

Unlike the earlier days of Git, if a custom command that uses Git is very useful, it can live its own life and flourish within the much larger Git userbase we have these days.

The proposed script was then therefore published as git-log-compact project.

Support

Rich Felker complained that compiling Git with musl libc no longer works out of the box (that is, without setting the NO_REGEX build configuration variable) after commit 2f895225. The proposed workaround unfortunately didn’t work on Windows, as pointed out by Jeff King and Johannes Schindelin.

There was a bit of derail about which are main Git platforms, and whether Git code should be able to rely on POSIX features. Jakub Narębski reminded that CodingGuidelines specifically state that:

  • Most importantly, we never say “It’s in POSIX; we’ll happily ignore your needs should your system not conform to it.” We live in the real world.

  • However, we often say “Let’s stay away from that construct, it’s not even in POSIX”.

  • In spite of the above two rules, we sometimes say “Although this is not in POSIX, it (is so convenient | makes the code much more readable | has other good characteristics) and practically all the platforms we care about support it, so let’s use it”.

The commit in question, making Git require to use regexp engine with REG_STARTEND support, while providing fallback implementation (turned on with NO_REGEX), matches 3rd point in the list above. This extension to regexec(), introduced by the NetBSD project, is present in all major regex implementation… though not in musl.

There was yet another proposed fix for the problem, namely adding padding so that end of mmap-ed file doesn’t fall on the page boundary, if regex implementation doesn’t support REG_STARTEND. One one hand, the workaround relied on undocumented (but sane) assumptions about operating system behavior, on the other hand it was faster than the workaround in original patch, that is copying contents to NUL-terminated buffer. Nevertheless, any workaround would mean additional code that needs to be maintained, and it was not accepted.

Also, it turned out that configure script detects if regex engine support REG_STARTEND and sets NO_REGEX if necessary, it was just badly described. It was since corrected.

Though Git doesn’t yet set NO_REGEX automatically based on information from uname.

Andrew Johnson asked on the mailing list:

While reading Pro Git 2nd Ed. I came across these three methods:

$ git help <verb>
$ git <verb> –help
$ man git-<verb>

I tested all three to confirm they were equivalent.

What was the motivation behind the complication, if any? I presume most developers would not provide multiple commands that do the same thing for absolutely no reason, so I led myself to ask this question.

Fredrik Gustafsson was the first to answer. He first said that the three commands are not actually equivalent on Windows as:

$ man git-<verb>

does not work and

$ git help <verb>

opens a webbrowser instead of a man page.

Philip Oakley then answered that the three different methods were added at different times for different reasons. The man methods was first added because “historically git was a set of shell scripts named git-*, so each stood alone”.

The --help was the result from “the modern git <cmd> approach, with every command normally having -h and --help options for short form usage and long form man pages”. Meanwhile “a git help <cmd> command was created” which “allowed selection of display type, so that on Unix/Linux man was the norm, while an --html (or --web) option is available for those who like the pretty browser view”.

Your own Christian Couder chimed in saying that git help makes it possible to teach people one command that will do something sensible on every system, and that it also “provides more configurability and more features like its -a and -g options”.

Jakub Narębski added that there are also help pages that are about “concepts (gitcli, gitrevisions, githooks, gitrepository-layout, gitglossary), or about files (gitignore, gitattributes, to some extent githooks)” and they are “only accessible with git help <concept> or, on OS with installed ‘man’, also man <gitconcept>”.

Philip replied to the above saying that “git revisions --help does work”, but Junio Hamano clarified things by saying that this was a bug that had been recently fixed.

It would indeed seem wrong to have git <concept> --help working, as concepts are not the same things as commands.

Anyway this shows that it is not so simple to design a good help system, especially one that is both full featured on different platforms and looking simple to users.

Developer Spotlight: Dennis Kaarsemaker

I’m Dennis Kaarsemaker, I do scalability and security things for Booking.com, part of which includes hacking on our git infrastructure together with Ævar Arnfjörð Bjarmason. I also maintain perl5.git.perl.org and do a lot of user support.

Spending a lot of time in #git and #github on freenode solving people’s git problems. Occasionally this leads to bug reports or even patches, but mostly I’m trying to create make users understand git and make them smile.

Besides user support, I do read the mailing-list and try to review patches or pick up smaller bugs as time permits. Time however is scarce with a fearless 14 month old girl crawling around the house trying to get into trouble :)

If I had a team of developers, their core focus would be scalability for very big repositories. Things like a protocol that is efficient with hundreds of thousands of refs and can be load-balanced properly, or more efficient storage for refs, external files and other data. Or a peer to peer continuous sync protocol for the object store.

Oh, if only I could remove submodules. They’re almost universally used for the wrong reason, are easy to get confused about and use wrong, and they complicate many parts of git.

Definitely GitHub. I appreciate that Git is made for distributed version control, and regularly use it in that way; but the social benefits of having a single place to discover, maintain and collaborate on projects that GitHub offers really helps in getting the most out of my open source experience. I even made a command line API client for Github, GitLab and BitBucket :)

Releases

Other News

Events

Various

Light reading

Git tools and sites

Credits

This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com> and Thomas Ferris Nicolaisen <tfnico@gmail.com>, with help from Jakub Narębski, Dennis Kaarsemaker, Johannes Schindelin, Lars Schneider and Jeff King.