Git Rev News Edition 54 (August 21st, 2019)

Git Rev News: Edition 54 (August 21st, 2019)

Welcome to the 54th edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.

This edition covers what happened during the month of July 2019.

An Introduction to git-filter-repo (written by Elijah Newren)

There is a new tool available for surgery on git repositories: git-filter-repo. It claims to have many new unique features, good performance, and an ability to scale – from making simple history rewrites trivial, to facilitating the creation of entirely new tools which leverage existing capabilities to handle more complex cases.

You can read more about common use cases and base capabilities of filter-repo, but in this article, I’d like to focus on two things: providing a simple example to give a very brief flavor for git-filter-repo usage, and answer a few likely questions about its purpose and rationale (including a short comparison to other tools). I will provide several links along the way for curious folks to learn more.

A simple example

Let’s start with a simple example that has come up a lot for me: extracting a piece of an existing repository and preparing it to be merged into some larger monorepository. So, we want to:

extract the history of a single directory, src/. This means that only paths under src/ remain in the repo, and any commits that only touched paths outside this directory will be removed.
rename all files to have a new leading directory, my-module/ (e.g. so that src/foo.c becomes my-module/src/foo.c)
rename any tags in the extracted repository to have a ‘my-module-‘ prefix (to avoid any conflicts when we later merge this repo into something else)

Doing this with filter-repo is as simple as the following command:

  git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename '':'my-module-'

(The single quotes are unnecessary, but make it clearer to a human that we are replacing the empty string as a prefix with my-module-.)

By contrast, filter-branch comes with a pile of caveats even once you figure out the necessary (OS-dependent) invocation(s):

  git filter-branch \
      --index-filter 'git ls-files \
			  | grep -v ^src/ \
			  | xargs git rm -q --cached; \
		      git ls-files -s \
			  | sed "s%$(printf \\t)%&my-module/%" \
			  | git update-index --index-info; \
		      git ls-files \
			  | grep -v ^my-module/ \
			  | xargs git rm -q --cached' \
      --tag-name-filter 'echo "my-module-$(cat)"' --prune-empty -- --all
  git clone file://$(pwd) newcopy
  cd newcopy
  git for-each-ref --format="delete %(refname)" refs/tags/ \
      | grep -v refs/tags/my-module- \
      | git update-ref --stdin
  git gc --prune=now

BFG is not capable of this type of rewrite, and this type of rewrite is difficult to perform safely using fast-export and fast-import directly.

You can find a lot more examples in filter-repo’s manpage. (If you are curious about the “pile of caveats” mentioned above or the reasons for the extra steps for filter-branch, you can read more details about this example).

Why a new tool instead of contributing to other tools?

There are two well known tools in the repository rewriting space:

and two lesser-known tools:

(While fast-export and fast-import themselves are well known, they are usually thought of as export-to-another-VCS or import-from-another-VCS tools, though they also work for git->git transitions.)

I will briefly discuss each.

filter-branch and BFG

It’s natural to ask why, if these well-known tools lacked features I wanted, they could not have been extended instead of creating a new tool. In short, they were philosophically the wrong starting point for extension and they also had the wrong architecture or design to support such an effort.

From the philosophical angle:

BFG: easy to use flags for some common cases, but not extensible
filter-branch: relatively versatile capability via user-specified shell commands, but rapidly becomes very difficult to use beyond trivial cases especially as usability defaults increasingly conflict and cause problems.

I wanted something that made the easy cases simple like BFG, but which would scale up to more difficult cases and have versatility beyond that which filter-branch provides.

From the technical architecture/design angle:

BFG: works on packfiles and packed-refs, directly rewriting tree and blob objects; Roberto proved you can get a lot done with this design with his work on the BFG (as many people who have used his tool can attest), but this design does not permit things like differentiating paths in different directories with the same basename nor could it be used to allow renaming of paths (except within the same directory). Further, this design even sadly runs into a lot of roadblocks and limitations even within its intended use case of removing big or sensitive content.
filter-branch: performance really shouldn’t matter for a one shot usage tool, but filter-branch can turn a few hour rewrite (allowing an overnight downtime) into an intractable three-month wait. Further, its design architecture leaks through every level of the interface, making it nearly impossible to change anything about the slow design without having backward compatibility issues. These issues are well known, but what is less well known is that even ignoring performance, the usability choices in filter-branch rapidly become increasingly conflicting and problematic for users with larger repos and more involved rewrites, difficulties that again cannot be ameliorated without breaking backward compatibility.

reposurgeon

Some brief impressions about reposurgeon:

Appears to be almost exclusively focused on transitioning between different version control systems (CVS, SVN, Hg, Bazaar, Git, etc.), and in particular handling the myriad edge and corner cases that arise in transitioning from CVS or SVN to a DVCS.
Provides very thorough reference-style documentation; if you read all reposurgeon documentation, you will likely feel as though you can take an existing example and modify it in many ways.
Absolutely no full-fledged examples or user-guide style documentation are provided for getting started.
Appears to not have any facilities for quick (in terms of time spent by the user) conversions similar to filter-branch, BFG, or filter-repo. Users who want such capabilities are likely to be frustrated by reposurgeon and give up.
Strikes me as “GDB for history rewriting”; it has lots of facilities for manually inspecting and editing, but is not intended for the first-time or casual history spelunker. Only those who view history spelunking as a frequent hobby or job are likely to dive in. And it’s not quite clear whether it is only useful to those transitioning from CVS/SVN or whether the facilities would also be useful to others.
Built on top of fast-export and fast-import, which I contend is the right architecture for a history filtering tool (see below).

I have read the reposurgeon documentation multiple times over the years, and am almost at a point where I feel like I know how to get started with it. I haven’t had a need to convert a CVS or SVN repo in over a decade; if I had such a need, perhaps I’d persevere and learn more about it. I suspect it has some good ideas I could apply to filter-repo. But I haven’t managed to get started with reposurgeon, so clearly my impressions of it should be taken with a grain of salt.

fast-export and fast-import

Finally, fast-export and fast-import can be used with a little editing of the fast-export output to handle a number of history rewriting cases. I have done this many times, but it has some drawbacks:

Dangerous for programmatic edits: It’s tempting to use sed or perl one-liners to e.g. try to modify filenames, but you risk accidentally also munging unrelated data such as commit messages, file contents, and branch and tag names.
Easy to miss corner cases: for example, fast-export only quotes filenames when necessary; as such, your attempt to rename a directory might leave files with spaces or UTF-8 characters in their original location.
Difficult to directly provide higher level facilities: for example, rewriting (possibly abbreviated) commit hashes in commit messages to refer to the new commit hashes, or stripping of non-merge commits which become empty or merge commits which become degenerate and empty.
Misses a lot of pieces needed to round things out into a usable tool.

However, fast-export and fast-import are the right architecture for building a repository filtering tool on top of; they are fast, provide access to almost all aspects of a repository in a very machine-parseable format, and will continue to gain features and capabilities over time (e.g. when replace refs were added, fast-export and fast-import immediately gained support). To create a full repository surgery tool, you “just” need to combine fast-export and fast-import together with a whole lot of parsing and glue, which, in a nutshell, is what filter-repo is.

Upstream improvements

But to circle back to the question of improving existing tools, during the development of filter-repo and its predecessor, lots of improvements to both fast-export and fast-import were submitted and included in git.git.

(Also, filter-repo started in early 2009 as git_fast_filter.py and therefore technically predates both BFG and reposurgeon.)

Why not a builtin command?

One could ask why this new command is not written in C like most of Git. While that would have several advantages, it doesn’t meet the necessary design requirements. See the “VERSATILITY” section of the manpage or see the “Versatility” section under the Design Rationale of the README.

Technically, we could perhaps provide a mechanism for people to write and compile plugins that a builtin command could load, but having users write filtering functions in C sounds suboptimal, and requiring gcc for filter-repo sounds more onerous than using python.

Where to from here?

This was just a quick intro to filter-repo, and I’ve provided a lot of links above if you want to learn more. Just a few more that might be of interest:

Ramifications of repository rewrites; including some tips (not specific to filter-repo)
Finding big objects/directories/extensions (and renames) in your repo (can be used together with tools other than filter-repo too)
Creating new history rewriting tools

Discussions

Reviews

[RFC/PATCH] CodingGuidelines: spell out post-C89 rules

Carlo Arenas recently commented on a patch by Emily Shaffer that moving a declaration out of a “for” loop would allow building on a CentOS 6 box.

Junio Hamano, the Git maintainer, replied to Carlo that we indeed “still reject variable definition in for loop control” even if “for past several years we’ve been experimenting with a bit more modern features”.

Junio then sent a patch to update the Documentation/CodingGuidelines file. This file describes which coding conventions are, and should be, used by developers working on the Git codebase.

One very important part of these conventions are the C language features that the developers are allowed or disallowed to use.

For a very long time, to be compatible with as many systems as possible, only features part of the C89 standard were allowed. Since 2012 though features part of the C99 standard have been very slowly introduced.

When these new features were introduced, they were introduced in “weather balloons” patches, which are very limited changes that are easy to undo in case someone complains.

Fortunately in most cases, though not in the “for” loop case, since the patches have been merged, no one has complained that they couldn’t compile Git’s code due to these patches, which means that code using these new features can now be more widely accepted.

The goal of Junio’s patch was to document that fact and these new features at the same time.

One of the new features is allowing an enum definition whose last element is followed by a comma. Jonathan Nieder replied to Junio that someone complained about that in 2010, but, as it has not happened since 2012 when the feature was reintroduced in the code base, it is ok.

Jonathan even suggested that we “say that the last element should always be followed by a comma, for ease of later patching”, and Junio found this idea interesting.

A few more comments were made by Jonathan and Bryan Turner about small possible improvements to Junio’s patch. Junio then sent an updated version of the patch which has since been merged to the master branch.

Developer Spotlight: Jean-Noël Avila

Who are you and what do you do?

My name is Jean-Noël Avila, father of three daughters and husband of an incredibly understanding wife. I graduated a long time ago from a french engineering school, with a speciality in signal processing, not really in computer science.

Professionally, I work in the R&D team of a small company that makes industrial online measurement systems, and guess what, we’re using Git as an ubiquitous revision control system (software, documentation); my colleagues know who they should call for any issue. Sadly though, I don’t work on Git.
What would you name your most important contribution to Git?

For the Git project itself, my most important and only contribution has been to deliver the french localization of the software since… 2013 (gasp!) and occasionally propose some patches to fix internationalization issues.

At the beginning, I proposed some patches to fix glob-pattern matching in the .gitignore file, but even if it fixed the issue, it turned out later that the patch had introduced a performance regression. So I chose to stick to a less harmful activity in the community (although bad translation can be quite harmful).

In the Git ecosystem more generally, I’ve been working on translating the Pro Git book to French and managing with Peff and Pedro (@pedrorijo91) the publishing of the translations of the book on http://git-scm.com. So, to sum it up, not working on the core, but on the public interfaces of the project.
What are you doing on the Git project these days, and why?

Following the path of localization, what is a localized application worth if the documentation is still an impediment? In this idea, at the beginning of this year, I’ve started an effort to translate the manual pages in French and to propose the translation framework put in place for this purpose to other languages.

So far, only two languages have translated content, but I expect to have some more soon. The pages are already available at http://git-scm.com/docs/. What is still missing is the packaging for other distributions of Git. Maybe when we have more content to provide.
If you could get a team of expert developers to work full time on something in Git for a full year, what would it be?

Functionally, we’ve been bitten by some wrong merges of concurrent branches, and I wish Git could have some knowledge in patch algebra to better handle these corner cases. I know that it would be quite orthogonal to the present design, but even just detecting and showing warnings that something nasty is happening would prevent surprises (to say the least) of the users in complex histories.

From a translator’s stand point, a project that would not require a big expertise while still being useful would be to introduce rules and factorize internationalization. This part of Git is still the wild west by some aspects with a lot of freedom left to developers to choose their own formatting. A sizeable part of these strings are almost identical: with or without an ending period, with quoted or unquoted %s, with uppercase or not. Some strings are very similar: “foo and bar are mutually exclusive”, and so on. In the end, the number of segments to translate in Git amounts to 4674 for v2.23.0, which basically bars the entry to new translations.

As an aside, providing a po file for core strings and another one for less used strings would also help kickstart translations by focusing on more productive work for translators of new languages. I understand this kind of task would be Sisyphos’ work, but that would really help the community grow by giving access to Git to users less educated in computer science.
If you could remove something from Git without worrying about backwards compatibility, what would it be?

Translating Git and the manpages gives a good overview of what’s available and what is being introduced. So far, I haven’t experienced anything strikingly bad about a particular feature. At the limit, I would make rebasing require a more advanced knowledge of Git’s internals by not providing such an easy way to shoot in one’s foot.
What is your favorite Git-related tool/library, outside of Git itself?

In fact, in my daily work with Git, I don’t use the command line that much. I’m an Emacs fan, and Magit is really a miraculous tool when it comes to interacting with a Git repository from my favorite editor.

Releases

Git 2.23.0, 2.22.1, 2.23.0-rc2, 2.23.0-rc1, 2.23.0-rc0
Git for Windows 2.23.0
libgit2 0.28.3
libgit2sharp 0.26.1
Bitbucket Server 6.5
Gerrit Code Review 2.15.15, 2.16.11
GitHub Enterprise 2.17.6, 2.16.15, 2.15.20
GitLab 12.1.6, 12.1.4, 12.1.3, 12.1.2, 12.1.1
GitKraken 6.1.3, 6.1.2, 6.1.1, 6.1.0
GitHub Desktop 2.1.1

Other News

Various

The first translations of manpages are finally hitting git-scm.com. The project was kicked off and announced in January 2019. It was decided to hold the translation outside the main git repository and use a converter from the original asciidoc format to gettext po. After some more work on tooling and more people joining the project, the toolchain to allow publishing on git-scm.com is in place and you can already see some results. There are presently 9 languages of which 2 have several complete translated manpages. The upcoming tasks are to direct priority-oriented translators to most viewed content and generate packages for distribution along Git itself.
GitHub Actions now supports CI/CD, free for public repositories

Light reading

A Deep Dive into Git Performance using Trace2 by Jeff Hostetler explains everything about the Trace2 logging framework which was released in Git 2.22.0.
Documenting Proper Git Usage by Zack Brown in Linux Journal, mainly about the Rebasing and merging in kernel repositories article by Jonathan Corbet in LWN.net, mentioned in Git Rev News Edition #52.
Patch Workflow With Mutt - 2019 by Greg Kroah-Hartman.
Using Conventional Commit in Projects by Sathyabodh Mudhol on DZone. Conventional Commits specification was mentioned in Git Rev News Edition #52.

Git tools and sites

Game of Trees (Got) is a version control system developed by and for OpenBSD developers which prioritizes ease of use and simplicity over flexibility. Got wants to remain on-disk compatible with bare Git repositories.
git-revise by Nika Layzell is a history editing tool designed for the patch-stack workflow (similarly to StGit, Quilt, Guilt (formerly gq) or TopGit). It is intended to be fast, at least for small editing tasks; it is not meant to be a complete replacement for git rebase -i.
StoryTime is a tool that enables developers to easily simulate debugger-like visuals to tell or read a story about pieces of code (in their GitHub repositories).

Credits

This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com>, Jakub Narębski <jnareb@gmail.com>, Markus Jansen <mja@jansen-preisler.de> and Gabriel Alcaras <gabriel.alcaras@telecom-paristech.fr> with help from Elijah Newren, Jeff Hostetler, Andrew Ardill and Jean-Noël Avila.