Git Rev News: Edition 54 (August 21st, 2019)

Welcome to the 54th edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.

This edition covers what happened during the month of July 2019.

An Introduction to git-filter-repo (written by Elijah Newren)

There is a new tool available for surgery on git repositories: git-filter-repo. It claims to have many new unique features, good performance, and an ability to scale – from making simple history rewrites trivial, to facilitating the creation of entirely new tools which leverage existing capabilities to handle more complex cases.

You can read more about common use cases and base capabilities of filter-repo, but in this article, I’d like to focus on two things: providing a simple example to give a very brief flavor for git-filter-repo usage, and answer a few likely questions about its purpose and rationale (including a short comparison to other tools). I will provide several links along the way for curious folks to learn more.

A simple example

Let’s start with a simple example that has come up a lot for me: extracting a piece of an existing repository and preparing it to be merged into some larger monorepository. So, we want to:

Doing this with filter-repo is as simple as the following command:

  git filter-repo --path src/ --to-subdirectory-filter my-module --tag-rename '':'my-module-'

(The single quotes are unnecessary, but make it clearer to a human that we are replacing the empty string as a prefix with my-module-.)

By contrast, filter-branch comes with a pile of caveats even once you figure out the necessary (OS-dependent) invocation(s):

  git filter-branch \
      --index-filter 'git ls-files \
			  | grep -v ^src/ \
			  | xargs git rm -q --cached; \
		      git ls-files -s \
			  | sed "s%$(printf \\t)%&my-module/%" \
			  | git update-index --index-info; \
		      git ls-files \
			  | grep -v ^my-module/ \
			  | xargs git rm -q --cached' \
      --tag-name-filter 'echo "my-module-$(cat)"' --prune-empty -- --all
  git clone file://$(pwd) newcopy
  cd newcopy
  git for-each-ref --format="delete %(refname)" refs/tags/ \
      | grep -v refs/tags/my-module- \
      | git update-ref --stdin
  git gc --prune=now

BFG is not capable of this type of rewrite, and this type of rewrite is difficult to perform safely using fast-export and fast-import directly.

You can find a lot more examples in filter-repo’s manpage. (If you are curious about the “pile of caveats” mentioned above or the reasons for the extra steps for filter-branch, you can read more details about this example).

Why a new tool instead of contributing to other tools?

There are two well known tools in the repository rewriting space:

and two lesser-known tools:

(While fast-export and fast-import themselves are well known, they are usually thought of as export-to-another-VCS or import-from-another-VCS tools, though they also work for git->git transitions.)

I will briefly discuss each.

filter-branch and BFG

It’s natural to ask why, if these well-known tools lacked features I wanted, they could not have been extended instead of creating a new tool. In short, they were philosophically the wrong starting point for extension and they also had the wrong architecture or design to support such an effort.

From the philosophical angle:

I wanted something that made the easy cases simple like BFG, but which would scale up to more difficult cases and have versatility beyond that which filter-branch provides.

From the technical architecture/design angle:

reposurgeon

Some brief impressions about reposurgeon:

I have read the reposurgeon documentation multiple times over the years, and am almost at a point where I feel like I know how to get started with it. I haven’t had a need to convert a CVS or SVN repo in over a decade; if I had such a need, perhaps I’d persevere and learn more about it. I suspect it has some good ideas I could apply to filter-repo. But I haven’t managed to get started with reposurgeon, so clearly my impressions of it should be taken with a grain of salt.

fast-export and fast-import

Finally, fast-export and fast-import can be used with a little editing of the fast-export output to handle a number of history rewriting cases. I have done this many times, but it has some drawbacks:

However, fast-export and fast-import are the right architecture for building a repository filtering tool on top of; they are fast, provide access to almost all aspects of a repository in a very machine-parseable format, and will continue to gain features and capabilities over time (e.g. when replace refs were added, fast-export and fast-import immediately gained support). To create a full repository surgery tool, you “just” need to combine fast-export and fast-import together with a whole lot of parsing and glue, which, in a nutshell, is what filter-repo is.

Upstream improvements

But to circle back to the question of improving existing tools, during the development of filter-repo and its predecessor, lots of improvements to both fast-export and fast-import were submitted and included in git.git.

(Also, filter-repo started in early 2009 as git_fast_filter.py and therefore technically predates both BFG and reposurgeon.)

Why not a builtin command?

One could ask why this new command is not written in C like most of Git. While that would have several advantages, it doesn’t meet the necessary design requirements. See the “VERSATILITY” section of the manpage or see the “Versatility” section under the Design Rationale of the README.

Technically, we could perhaps provide a mechanism for people to write and compile plugins that a builtin command could load, but having users write filtering functions in C sounds suboptimal, and requiring gcc for filter-repo sounds more onerous than using python.

Where to from here?

This was just a quick intro to filter-repo, and I’ve provided a lot of links above if you want to learn more. Just a few more that might be of interest:

Discussions

Reviews

Developer Spotlight: Jean-Noël Avila

Releases

Other News

Various

Light reading

Git tools and sites

Credits

This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com>, Jakub Narębski <jnareb@gmail.com>, Markus Jansen <mja@jansen-preisler.de> and Gabriel Alcaras <gabriel.alcaras@telecom-paristech.fr> with help from Elijah Newren, Jeff Hostetler, Andrew Ardill and Jean-Noël Avila.