Git Rev News: Edition 45 (November 21st, 2018)

Welcome to the 45th edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.

This edition covers what happened during the month of October 2018.

Discussions

Support

  • commit-graph is cool

    Elijah Newren sent an email to the mailing list that started with:

    Just wanted to give a shout-out for the commit-graph work and how impressive it is.

    He then describes a user with a repo where pushing a commit takes more than one minute. The repo was quite “unusual” as it had a lots of tag and the push.followTags config option was set to true. Elijah found that most of the time was spent in an add_missing_tags() function which called in_merge_bases_many() once per tag, which “seemed rather suboptimal”, as in_merge_bases_many() does a commit traversal which is not cheap.

    Instead of optimizing this Elijah tried a development version of the commit-graph feature. The commit-graph feature itself is a quite recent feature in Git that was developed by Derrick Stolee, alias Stolee, who blogged about it:

    (These links were already provided in Git Rev News edition 41 last July. Stolee has been interviewed in Git Rev News edition 42 last August.)

    Elijah found that the commit-graph feature reduced the time of a git push --dry-run by a factor of over 100, from over a minute to sub-second, though this speed up came from making all the in_merge_bases_many() calls much faster, not from reducing the number of calls to this function.

    Stolee replied that the generation numbers feature in the commit-graph file is likely what makes the calls much faster, as it can often avoid commit traversals altogether.

    Jeff King, alias Peff, also replied to Elijah suggesting implementing an “all-points” traversal instead of many commit traversals. Peff also noticed that generation numbers might give a better answer in some cases as commit traversals are “susceptible to wrong answers due to clock skew”.

    Stolee a few weeks later sent a small patch series to fix the behavior of the add_missing_tags() function by implementing a new get_reachable_subset() function which does “a many-to-many reachability test” and performs only one commit traversal.

    Junio Hamano, the Git maintainer, and Elijah then reviewed the patch series and discussed the implementation with Stolee.

    Elijah reported that the patch series indeed improved the time of a dry-run push from around 1 minutes and 20 seconds to around 3 seconds, but that it seemed that now the push was a little bit faster without the commit-graph feature. After discussing this with Stolee and running additional tests though Elijah reported that he had made a mistake in testing Stolee’s patch series and that using the commit-graph feature was still faster even with the patch series.

    Ævar Arnfjörð Bjarmason also replied to Elijah’s initial email to say that users can set the fetch.pruneTags config option to true to avoid accumulating local-only tags. Elijah then thanked Ævar for the suggestion.

    A few days later Stolee sent a slightly improved version of his small patch series. This version has recently been merged into the master branch, so it should be in the upcoming v2.20.0 Git release scheduled for the beginning of December.

General

  • [RFC] Generation Number v2

    The commit-graph file mechanism (see the description above) accelerates commit graph walks in the two following ways:

    1. Getting basic commit data (satisfying requirements of parse_commit_gently()) without decompressing and parsing.
    2. Reducing the time it takes to walk commits and determine topological relationships by providing “generation number” (as a negative-cut reachability index), which makes it possible skip walking whole parts of commit graph.

    The current version of the generation number has the advantage over using heuristic based on the commit date that it is always correct. It turned out however that in some cases it can give worse performance than using the date heuristics; that is why its use got limited in [PATCH 1/1] commit: don’t use generation numbers if not needed.

    For the same reason why [PATCH 0/6] Use generation numbers for –topo-order, and its subsequent revisions, also limited its use:

    One notable case that is not included in this series is the case of a history comparison such as git rev-list --topo-order A..B.

    Removing this limitation yields correct results, but the performance is worse.

    That is why Derrick Stolee sent this RFC:

    We’ve discussed in several places how to improve upon generation numbers. This RFC is a report based on my investigation into a few new options, and how they compare for Git’s purposes on several existing open-source repos.

    You can find this report and the associated test scripts at https://github.com/derrickstolee/gen-test.

    Please also let me know about any additional tests that I could run. Now that I’ve got a lot of test scripts built up, I can re-run the test suite pretty quickly.

    He then explains why Generation Number v2 is needed:

    Specifically, some algorithms in Git already use commit date as a heuristic reachability index. This has some problems, though, since commit date can be incorrect for several reasons (clock skew between machines, purposefully setting GIT_COMMIT_DATE to the author date, etc.). However, the speed boost by using commit date as a cutoff was so important in these cases, that the potential for incorrect answers was considered acceptable.

    When these algorithms were converted to use generation numbers, we added the extra constraint that the algorithms are never incorrect. Unfortunately, this led to some cases where performance was worse than before. There are known cases where git merge-base A B or git log --topo-order A..B are worse when using generation numbers than when using commit dates.

    This report investigates four replacements for generation numbers, and compares the number of walked commits to the existing algorithms (both using generation numbers and not using them at all). We can use this data to make decisions for the future of the feature.

    The very rough implementation of those four proposed generation numbers can be found in the reach-perf branch in https://github.com/derrickstolee/git.

    Based on performed benchmarks (by comparing the number of commits walked with the help of trace2 facility), Stolee proposed to pursue one of the following options, though he was undecided about which one to choose:

    1. Maximum Generation Number.
    2. Corrected Commit Date.

    Maximum generation number has the advantage that it is backwards-compatibile, that is it can be used (but not updated) with the current code; however it is not locally-computable or immutable. Corrected commit date would require changes to the commit-graph format, but it can be updated incrementally.

    Junio C Hamano replied that:

    […] I personally do not think being compatible with currently deployed clients is important at all (primarily because I still consider the whole thing experimental), and there is a clear way forward once we correct the mistake of not having a version number in the file format that tells the updated clients to ignore the generation numbers. For longer term viability, we should pick something that is immutable, reproducible, computable with minimum input—all of which would lead to being incrementally computable, I would think.

    It looks like the Corrected Commit Date is the way forward,… unless the variant of Maximum Generation Number proposed by Jakub Narębski, which looks like it could be updated almost incrementally, would turn out to be better. The change to use Corrected Commit Date would require new revision of the commit-graph format (which includes a version number, fortunately). Derrick Stolee writes:

    Here is my list for what needs to be in the next version of the commit-graph file format:

    1. A four-byte hash version.

    2. File incrementality (split commit-graph).

    3. Reachability Index versioning

    Most of these changes will happen in the file header. The chunks themselves don’t need to change, but some chunks may be added that only make sense in v2 commit-graphs.

Developer Spotlight: Elijah Newren

  • Who are you and what do you do?

    Big question; I’ll answer in three parts, and see if I can use a little humor to offset the lengthy answer.

    Personally, I’m a husband to the most amazing woman in the world, and a father to one son and six daughters. My wife is expecting again, so next spring my son will get something he’s never had before: a seventh sister! I’m a devout member of The Church of Jesus Christ of Latter-day Saints. I received a PhD in mathematics from the University of Utah, which aside from meaning I’ve forgotten more math than most people will ever know, comes with one primary benefit: when my kids need a “doctor’s note”, it’s often the case that someone has overlooked specifying that the note needs to come from a medical doctor. Sadly, my wife has vetoed me writing these notes myself, which just goes to show that a doctorate isn’t all it’s cracked up to be.

    In the open source world, in addition to my contributions to Git in more recent years, I was once upon a time heavily involved in the Gnome community; a behind the scenes interview I did with them may still be interesting, particularly the travelling tips.

    Professionally, I worked at Sandia National Labs for about six years, transitioning during that time from working on fluid dynamics codes to working on tools to make other developers more productive. Palantir lured me away in early 2013 with a cool mission (especially intriguing to me at the time was the results they were getting in fighting child exploitation and recovering missing children), and an understanding that I would get to work on open source stuff like Gerrit and Git. The underlying mission has remained cool (despite some contrary claims in the media these days), but between managerial turnover and the short-term focus of a startup, it took a long time before I actually had the opportunity to work on Git even part time.

  • What would you name your most important contribution to Git?

    I’ve contributed to a few different places in Git, but most of my contributions have been around merging. I’ve put a lot of work into addressing edge and corner cases; possibly too much: Junio has named some of my patch series things like en/t6042-insane-merge-rename-testcases. Part of the reason for addressing edge and corner cases, though, dovetails with my other work towards fixing, documenting, testing, and restructuring the recursive merge machinery with an eye towards changing out the basic implementation strategy.

    A while ago I found a bug in merge-recursive.c and traced it back to code introduced years ago by myself, but then found that the original bug was only an issue because of some other problem created years ago…that also traced back to me. Sometimes merge-recursive.c feels like it’s all my fault other than the original implementation design. So, not only have I mostly worked on stuff that few people will ever notice, but once I change the implementation underpinnings, merge problems can be entirely my fault too. :-)

    The most notable thing I’ve contributed that users are likely to notice is directory rename detection in the merge machinery. An amusing bit of trivia about that feature is that GitHub highlighted this feature in their Highlights from Git 2.19, even though this was a feature added in Git 2.18. (I’m not complaining since this meant more exposure to my pet feature, I just found it humorous.)

  • What are you doing on the Git project these days, and why?

    I’m currently creating a replacement for git filter-branch that I’m provisionally naming git repo-filter. My goal is to address what I perceive to be a few glaring deficiencies of the otherwise versatile and cool filter-branch tool. It’s not ready for external consumption at all yet (one problem of many is that it depends on Git patches which I just recently posted to the list). I’ll submit repo-filter to the list when it’s closer to ready.

    I’ve done some work to document inconsistencies and incompatible flag pairs in rebase, due to its multiple different backends. I’m slowly doing some ongoing work to make that behavior more uniform. One particular difference that ties into my other work concerns directory rename detection: I want that capability for rebases as well as merges and cherry-picks. However, directory rename detection in rebase is backend dependent, and the default backend lacks this ability. Dscho has some performance concerns with switching the default backend (fewer than he used to now that the various rebase-in-C rewrites have merged), so fixing that issue might depend on some more merge work first.

    I will also soon get back to my rewrite of the implementation strategy from merge-recursive. While that may not sound too exciting to most users, I think it could net some nice maintainability wins, improve performance (thus perhaps allowing the rebase switchover), fix a variety of edge/corner cases we currently fail, and make some new features easier to implement (e.g. merges in bare repos, cherry-picking to an un-checked-out branch, remerge-diff capability, and tree-based trivial merges).

  • If you could get a team of expert developers to work full time on something in Git for a full year, what would it be?

    I’d be happy if I could be allowed to work full time on Git myself. Getting a full time team? Well…

    • Work on all the stuff I mentioned above (including the features I think my current work would enable)

    • Upstream or at least release and open source our snowflake report tool, to help other groups (if there are any) that also weirdly support way too many customer-specific branches and want a better way to determine what changes have already been ported back to the main development branch.

    • Improve performance on large repositories (in particular, storing and using a partial index that includes some tree entries and omits files underneath, used together with partial clones and sparse checkouts).

    • Add a couple alternative forms of binary storage.

    • Create a better webby merge review tool. One which treats commits as the unit of review and branches as the unit of merging, possibly based on or taking advantage of range-diff. One which encourages writing clean history that is easy for future readers to follow. (This includes making commit messages a fundamental part of what is reviewed, expecting and working with multiple commits as separate small atomic steps, avoiding fixup commits in the final while also not doing user-hostile history-destroying squash merges, and if it wasn’t clear already from the previous requirements it needs to work reasonably with and not be hostile to rebases). Also, it shouldn’t botch commit order (I understand that merges may be difficult and some form of linearization may be in order, but messing up the topology of a linear history is unforgivable; doubly so when you document it as intended), and it shouldn’t use magic refnames. There are probably other issues from the various systems I have used that I could add into the above requirements, but the list already rules out all existing tools that I know of. Git’s (and Linux’s) email based workflow is the only one I know of to get all these things right; however, the problems with getting an email workflow running make it a non-starter for many groups. I wish there were something better than the current offerings to point people to, or that one of the existing offerings would transform into this tool.

  • If you could remove something from Git without worrying about backwards compatibility, what would it be?

    Perhaps just make checkout and reset do just one thing each?

  • What is your favorite Git-related tool/library, outside of Git itself?

    I would have said tbdiff, but now range-diff is built in. I could mention various repository management and code review tools (particularly a few that bundle these capabilities together), but it’s hard to pick a “favorite” as the ones I know all tend to be strong in some area(s) and extremely weak in others.

    I’m not sure if public-inbox.org/git qualifies as a “Git-related tool or library”, but it’s been very helpful. I also use Dscho’s apply-from-public-inbox.sh script to apply submitted patch series locally.

Releases

Other News

Various

Light reading

Git tools and sites

  • Gitote (https://gitote.in) is a new Git repository hosting site, an alternative to GitHub, GitLab, Bitbucket, etc. It is developed from Gogs fork, and powered by DigitalOcean. Announced in Gitote is out now blog post.
  • Bit (on GitHub) helps you organize your components, share them as a team and keep them synced in all your projects; it helps to seamlessly isolate components (sets of files) from any existing repository and share them.
  • Git::IssueManager is a Perl module for managing issues in a git branch within your repository, creating a distributed issue management system; see also Git Rev News Edition #43 for more bug trackers embedded in Git.
  • pre-commit Python module is a framework for managing and maintaining multiple multi-language pre-commit hooks.
  • lint-staged runs linters on Git staged files.
  • git-fire is a Git plugin to save (add, commit and push) your code in the case of emergency.

Credits

This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com>, Jakub Narębski <jnareb@gmail.com>, Markus Jansen <mja@jansen-preisler.de> and Gabriel Alcaras <gabriel.alcaras@telecom-paristech.fr> with help from Luca Milanesio, Elijah Newren, Derrick Stolee and Johannes Schindelin.