Welcome to the 54th edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.
This edition covers what happened during the month of July 2019.
There is a new tool available for surgery on git repositories: git-filter-repo. It claims to have many new unique features, good performance, and an ability to scale – from making simple history rewrites trivial, to facilitating the creation of entirely new tools which leverage existing capabilities to handle more complex cases.
You can read more about common use cases and base capabilities of filter-repo, but in this article, I’d like to focus on two things: providing a simple example to give a very brief flavor for git-filter-repo usage, and answer a few likely questions about its purpose and rationale (including a short comparison to other tools). I will provide several links along the way for curious folks to learn more.
Let’s start with a simple example that has come up a lot for me: extracting a piece of an existing repository and preparing it to be merged into some larger monorepository. So, we want to:
Doing this with filter-repo is as simple as the following command:
git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename '':'my-module-'
(The single quotes are unnecessary, but make it clearer to a human that we
are replacing the empty string as a prefix with
By contrast, filter-branch comes with a pile of caveats even once you figure out the necessary (OS-dependent) invocation(s):
git filter-branch \ --index-filter 'git ls-files \ | grep -v ^src/ \ | xargs git rm -q --cached; \ git ls-files -s \ | sed "s%$(printf \\t)%&my-module/%" \ | git update-index --index-info; \ git ls-files \ | grep -v ^my-module/ \ | xargs git rm -q --cached' \ --tag-name-filter 'echo "my-module-$(cat)"' --prune-empty -- --all git clone file://$(pwd) newcopy cd newcopy git for-each-ref --format="delete %(refname)" refs/tags/ \ | grep -v refs/tags/my-module- \ | git update-ref --stdin git gc --prune=now
BFG is not capable of this type of rewrite, and this type of rewrite is difficult to perform safely using fast-export and fast-import directly.
You can find a lot more examples in filter-repo’s manpage. (If you are curious about the “pile of caveats” mentioned above or the reasons for the extra steps for filter-branch, you can read more details about this example).
There are two well known tools in the repository rewriting space:
and two lesser-known tools:
(While fast-export and fast-import themselves are well known, they are usually thought of as export-to-another-VCS or import-from-another-VCS tools, though they also work for git->git transitions.)
I will briefly discuss each.
It’s natural to ask why, if these well-known tools lacked features I wanted, they could not have been extended instead of creating a new tool. In short, they were philosophically the wrong starting point for extension and they also had the wrong architecture or design to support such an effort.
From the philosophical angle:
I wanted something that made the easy cases simple like BFG, but which would scale up to more difficult cases and have versatility beyond that which filter-branch provides.
From the technical architecture/design angle:
BFG: works on packfiles and packed-refs, directly rewriting tree and blob objects; Roberto proved you can get a lot done with this design with his work on the BFG (as many people who have used his tool can attest), but this design does not permit things like differentiating paths in different directories with the same basename nor could it be used to allow renaming of paths (except within the same directory). Further, this design even sadly runs into a lot of roadblocks and limitations even within its intended use case of removing big or sensitive content.
filter-branch: performance really shouldn’t matter for a one shot usage tool, but filter-branch can turn a few hour rewrite (allowing an overnight downtime) into an intractable three-month wait. Further, its design architecture leaks through every level of the interface, making it nearly impossible to change anything about the slow design without having backward compatibility issues. These issues are well known, but what is less well known is that even ignoring performance, the usability choices in filter-branch rapidly become increasingly conflicting and problematic for users with larger repos and more involved rewrites, difficulties that again cannot be ameliorated without breaking backward compatibility.
Some brief impressions about reposurgeon:
I have read the reposurgeon documentation multiple times over the years, and am almost at a point where I feel like I know how to get started with it. I haven’t had a need to convert a CVS or SVN repo in over a decade; if I had such a need, perhaps I’d persevere and learn more about it. I suspect it has some good ideas I could apply to filter-repo. But I haven’t managed to get started with reposurgeon, so clearly my impressions of it should be taken with a grain of salt.
Finally, fast-export and fast-import can be used with a little editing of the fast-export output to handle a number of history rewriting cases. I have done this many times, but it has some drawbacks:
perlone-liners to e.g. try to modify filenames, but you risk accidentally also munging unrelated data such as commit messages, file contents, and branch and tag names.
However, fast-export and fast-import are the right architecture for building a repository filtering tool on top of; they are fast, provide access to almost all aspects of a repository in a very machine-parseable format, and will continue to gain features and capabilities over time (e.g. when replace refs were added, fast-export and fast-import immediately gained support). To create a full repository surgery tool, you “just” need to combine fast-export and fast-import together with a whole lot of parsing and glue, which, in a nutshell, is what filter-repo is.
But to circle back to the question of improving existing tools, during the development of filter-repo and its predecessor, lots of improvements to both fast-export and fast-import were submitted and included in git.git.
(Also, filter-repo started in early 2009 as git_fast_filter.py and therefore technically predates both BFG and reposurgeon.)
One could ask why this new command is not written in C like most of Git. While that would have several advantages, it doesn’t meet the necessary design requirements. See the “VERSATILITY” section of the manpage or see the “Versatility” section under the Design Rationale of the README.
Technically, we could perhaps provide a mechanism for people to write and compile plugins that a builtin command could load, but having users write filtering functions in C sounds suboptimal, and requiring gcc for filter-repo sounds more onerous than using python.
This was just a quick intro to filter-repo, and I’ve provided a lot of links above if you want to learn more. Just a few more that might be of interest:
Carlo Arenas recently commented on a patch by Emily Shaffer that moving a declaration out of a “for” loop would allow building on a CentOS 6 box.
Junio Hamano, the Git maintainer, replied to Carlo that we indeed “still reject variable definition in for loop control” even if “for past several years we’ve been experimenting with a bit more modern features”.
Junio then sent a patch to update the Documentation/CodingGuidelines file. This file describes which coding conventions are, and should be, used by developers working on the Git codebase.
One very important part of these conventions are the C language features that the developers are allowed or disallowed to use.
For a very long time, to be compatible with as many systems as possible, only features part of the C89 standard were allowed. Since 2012 though features part of the C99 standard have been very slowly introduced.
When these new features were introduced, they were introduced in “weather balloons” patches, which are very limited changes that are easy to undo in case someone complains.
Fortunately in most cases, though not in the “for” loop case, since the patches have been merged, no one has complained that they couldn’t compile Git’s code due to these patches, which means that code using these new features can now be more widely accepted.
The goal of Junio’s patch was to document that fact and these new features at the same time.
One of the new features is allowing an enum definition whose last element is followed by a comma. Jonathan Nieder replied to Junio that someone complained about that in 2010, but, as it has not happened since 2012 when the feature was reintroduced in the code base, it is ok.
Jonathan even suggested that we “say that the last element should always be followed by a comma, for ease of later patching”, and Junio found this idea interesting.
A few more comments were made by Jonathan and Bryan Turner about small possible improvements to Junio’s patch. Junio then sent an updated version of the patch which has since been merged to the master branch.
Who are you and what do you do?
My name is Jean-Noël Avila, father of three daughters and husband of an incredibly understanding wife. I graduated a long time ago from a french engineering school, with a speciality in signal processing, not really in computer science.
Professionally, I work in the R&D team of a small company that makes industrial online measurement systems, and guess what, we’re using Git as an ubiquitous revision control system (software, documentation); my colleagues know who they should call for any issue. Sadly though, I don’t work on Git.
What would you name your most important contribution to Git?
For the Git project itself, my most important and only contribution has been to deliver the french localization of the software since… 2013 (gasp!) and occasionally propose some patches to fix internationalization issues.
At the beginning, I proposed some patches to fix glob-pattern
matching in the
.gitignore file, but even if it fixed the issue,
it turned out later that the patch had introduced a performance
regression. So I chose to stick to a less harmful activity in the
community (although bad translation can be quite harmful).
In the Git ecosystem more generally, I’ve been working on translating the Pro Git book to French and managing with Peff and Pedro (@pedrorijo91) the publishing of the translations of the book on http://git-scm.com. So, to sum it up, not working on the core, but on the public interfaces of the project.
What are you doing on the Git project these days, and why?
Following the path of localization, what is a localized application worth if the documentation is still an impediment? In this idea, at the beginning of this year, I’ve started an effort to translate the manual pages in French and to propose the translation framework put in place for this purpose to other languages.
So far, only two languages have translated content, but I expect to have some more soon. The pages are already available at http://git-scm.com/docs/. What is still missing is the packaging for other distributions of Git. Maybe when we have more content to provide.
If you could get a team of expert developers to work full time on something in Git for a full year, what would it be?
Functionally, we’ve been bitten by some wrong merges of concurrent branches, and I wish Git could have some knowledge in patch algebra to better handle these corner cases. I know that it would be quite orthogonal to the present design, but even just detecting and showing warnings that something nasty is happening would prevent surprises (to say the least) of the users in complex histories.
From a translator’s stand point, a project that would not require a
big expertise while still being useful would be to introduce rules
and factorize internationalization. This part of Git is still the
wild west by some aspects with a lot of freedom left to developers
to choose their own formatting. A sizeable part of these strings are
almost identical: with or without an ending period, with quoted or
%s, with uppercase or not. Some strings are very similar:
“foo and bar are mutually exclusive”, and so on. In the end, the
number of segments to translate in Git amounts to 4674 for v2.23.0,
which basically bars the entry to new translations.
As an aside, providing a
po file for core strings and another one
for less used strings would also help kickstart translations by
focusing on more productive work for translators of new languages. I
understand this kind of task would be Sisyphos’ work, but that would
really help the community grow by giving access to Git to users less
educated in computer science.
If you could remove something from Git without worrying about backwards compatibility, what would it be?
Translating Git and the manpages gives a good overview of what’s available and what is being introduced. So far, I haven’t experienced anything strikingly bad about a particular feature. At the limit, I would make rebasing require a more advanced knowledge of Git’s internals by not providing such an easy way to shoot in one’s foot.
What is your favorite Git-related tool/library, outside of Git itself?
In fact, in my daily work with Git, I don’t use the command line that much. I’m an Emacs fan, and Magit is really a miraculous tool when it comes to interacting with a Git repository from my favorite editor.
Git tools and sites
git rebase -i.
This edition of Git Rev News was curated by Christian Couder <firstname.lastname@example.org>, Jakub Narębski <email@example.com>, Markus Jansen <firstname.lastname@example.org> and Gabriel Alcaras <email@example.com> with help from Elijah Newren, Jeff Hostetler, Andrew Ardill and Jean-Noël Avila.