Git Rev News: Edition 60 (February 19th, 2020)

Welcome to the 60th edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.

This edition covers what happened during the month of January 2020.

The Pros and Cons of Reposurgeon (written by Eric S. Raymond)

On January 12th 2020, the history of the GNU Compiler Collection was lifted from Subversion to Git. At 280K commits, with a history containing traces of two previous version-control systems (CVS and RCS) this was the largest and most complex repository conversion of an open-source project ever. It swamped the previous record-holder – Emacs’s move from Bazaar to Git back in 2011 – by an order of magnitude.

Both those conversions were done by reposurgeon. Neither of them could practically have been performed by any other conversion tool available. This article will explain why that is, and under what circumstances you might consider using reposurgeon yourself.

Let’s start with a brief description of what reposurgeon actually does. When you use it, you start by reading in a version-control repository… but actually, that’s not quite right. What reposurgeon actually does is read in a git fast-import stream. It looks like it reads repositories because it knows how to call front ends that use exporters such as git-fast-export and cvs-fast-export to serialize a repository for it.

Actually, that’s not quite right either. Subversion doesn’t have an exporter – there is no svn-fast-export (well, not one that works for more than trivial cases, anyway). Instead, reposurgeon reads the native serialization produced by Subversion’s svnadmin dump tool. Internally, this is massaged into the equivalent of a git fast-import stream and represented as one inside reposurgeon.

There are reposurgeon-compatible exporters for RCS, CVS, bzr, hg, SRC, bk, and of course git itself. With a little extra work using sccs2rcs it’s possible to reach all the way back to collections of SCCS files.

Now that you’ve caught your repository, what can you do with it?

I observed earlier that what you have, internally, is a deserialized version of a git fast-input stream. A productive way to think about what reposurgeon does is to remember that this is basically just a DAG (directed acyclic graph) with text attached to the nodes. Now think of reposurgeon as an editor for this graph and its nodes. Then, think of it as a DSL (domain-specific language) designed to be scripted – that is, designed to reproducibly apply editing procedures to this graph.

So the general answer to “what can you do with it” is “anything you want to”. I enjoy thinking about and implementing DSLs, and once I had the basic design idea it was pretty much inevitable that I was going to write the most general set of primitives I could imagine – and I have a very fertile imagination.

Elijah Newren’s aside on reposurgeon in Git Rev News 54 described it as “GDB for history rewriting”. That’s a pretty good analogy, actually. Better than even I knew until recently, because it turns out the Python Cmd library I originally used to write its command interpreter was designed to emulate the interface style of gdb and earlier symbolic debuggers.

Accordingly, you can immediately use reposurgeon for a lot of relatively simple tasks like (1) removing extremely bulky content that shouldn’t have been checked in, (2) partitioning and merging repositories, (3) transcoding Latin-1 metadata to UTF-8, (4) debubbling an unnecessary complex history to make reading it easier.

Often, though, those things can be done with other tools like his git-filter-repo. It’s repository conversions for which you are likely to actually need the full power of a domain-specific language designed for repository surgery.

Which brings us to how you write out your graph as a live repository. Reposurgeon doesn’t do that directly either. When it needs to write out a repository, it hands a git fast-import stream to an importer back end. That could be git fast-import itself, or the corresponding importers for hg, bzr, darcs, bk, RCS, or SRC.

Here’s what reading in and immediately converting a small Subversion dump would look like:

$ reposurgeon
reposurgeon% read <foo.svn
23 svn revisions (0K/s)
* foo
reposurgeon% prefer git
git is the preferred type.
reposurgeon% rebuild bar
reposurgeon: rebuild is complete.
reposurgeon: no preservations.
reposurgeon% 

In theory you now have a Git repository named “bar” in your current directory that is a perfect translation of foo. In practice, for any nontrivial repository, you probably have a bit of a mess on your hands.

If you had read in any Git repository and written it out again, you’d get a perfect copy. But when you’re moving histories between different version-control systems, you have to deal with the mismatch between the source system’s model of version control and the target’s.

A good example of this is the fact that Subversion doesn’t have anything directly corresponding to a Git tag. A Subversion tag is actually a directory copy operation with a target under the tags/ directory. The copy operation leaves a commit in place which, if moved literally to gitspace, would just be junk. What you want is to move the metadata of that commit to an annotated tag.

Many attempts at importers silently botch this in practice, but it can be handled automatically in theory – and reposurgeon does that. The mess you’re likely to have on your hands anyway is due to Subversion operator errors, scar tissue for a previous conversion out of CVS, and use of git-svn as a live gateway to the repository.

The most common symptom of all these error sources is misplaced branch joins; in extreme cases you may even have disconnected branches. Reposurgeon enables you audit for and repair this kind of defect. Here are a few examples of that kind of repair done on the GCC repository:

# /branches/GC_5_0_ALPHA_1
<27855>|<27860> reparent --use-order
# /branches/apple-200511-release-branch
<105446>|<105574> reparent --use-order
# /branches/apple-gcc_os_35-branch
<90334>|<90607> reparent --use-order
# /branches/apple-tiger-release-branch
<96593>|<96595> reparent --use-order

The GCC conversion was pretty hairy – 343 lines of DSL scripting – but there are whole new levels of complexity when, as still sometimes happens, you need to recover history from pre-version-controlled sources to stitch the repository together.

In one extreme case, I ended up stitching together material from 18 different release tarballs, 11 unreleased snapshot tarballs, one release tarball I could reconstruct, one release tarball mined out of an obsolete Red Hat source RPM, two shar archives, a pax archive, five published patches, two zip files, a darcs archive, and a partial RCS history,

But reposurgeon can handle this, because it make conversion experiments easy. The workflow it’s designed for is carefully building a script that assembles your source repository and other data into a simulacrum of what a Git repository tracking your project from the beginning of time would have looked like.

Almost never will you get this right the first time. It takes testing, polishing, tripping over assumptions you didn’t know you and your tools were making, and correcting for those assumptions. In the GCC case it took many hours of work to locate and develop fixes for the misplaced branch joins.

A subtle but important point is that I didn’t do that work myself. That kind of thing is not a job for reposurgeon’s maintainer, it’s a job for a “Mr. Inside” who knows the project’s history intimately – in this case it was actually the GCC project lead, Joseph Myers. One of reposurgeon’s requirements is that it has to be a tool that a “Mr. Inside” can learn to use with minimum friction.

And generally it is, if you’re being driven to it by the kind of problem it was designed to solve – it’s like gdb that way. I’ve been taken to task about the tool having no intro documentation; this is not because I’m lazy, it’s because there’s no plausible way to write any, any more than there is for gdb. You’re ready to learn reposurgeon, as Joseph Myers did, when you’re stuck into a conversion or editing problem so deep that the very complete reposurgeon command reference starts to make sense to you.

You can find more about conversions with reposurgeon here.

Discussions

Support

  • Broken branch after git commit

    Torsten Krah wrote that after using git restore --staged $my-files to remove some files from the index, but not the working tree, and then git commit to commit other changes, he found that the files were marked as deleted in the index though they are still in the working tree.

    He didn’t understand what happened and didn’t know how to fix the situation. He later added that the git commit he had done had actually committed the files even though git status told him that the files wouldn’t go into the commit. And he also sent instructions to help reproduce the issue.

    Jeff King, alias Peff, replied to Torsten saying that he couldn’t reproduce the issue and asking for more details and examples.

    Torsten then sent an example with a lot more details which enabled Peff to reproduce the issue. Peff confirmed that there was a bug in git restore and also found that the index-reading code could segfault when it processes bogus cache-trees.

    Peff also mentioned that Emily Shaffer had recently found a similar segfault and that she had provided a fix, and then came up with a patch to fix the bug in git restore --staged.

    Junio Hamano, the Git maintainer, reviewed the patch and praised Peff, while Dennis Kaarsemaker reported just being bitten by seemingly the same bug.

    In the meantime Torsten also thanked Peff for investigating the issue and asked for a way to fix his current index. Peff suggested using git reset --hard <original-commit> after finding the original commit in the reflog. Torsten replied that he actually had to remove the index using rm .git/index first to get back to a working state.

Developer Spotlight: Chris Webster

  • Who are you and what do you do?

    I am a retired software developer. After 25+ years in commercial software development I wanted to contribute to open source. Having used GNU Make and Perl extensively in a build system I wanted to give back. GitGitGadget caught my attention when I was contributing a fix to diff-highlight.pm in Git. Dscho guided me through my many errors to get the change approved. I decided to help with GitGitGadget to say thanks. Fun/scary facts: Primary language – IBM mainframe assembler, Secondary language: C++.

  • What would you name your most important contribution to Git?

    My contributions are mainly to GitGitGadget and I think adding the /preview command was really helpful. It only gets used once by most new contributors to Git but it gives them a chance to see what their change requests would look like. I hope seeing the email gives these contributors the confidence to submit their changes to the general Git community. It wasn’t a big change but it appears to get used by a lot of first time contributors.

  • What are you doing on the Git ecosystem these days, and why?

    I am still very much in learning mode. Working on GitGitGadget provides an opportunity to work with modern TypeScript and JavaScript features. There are some issues I would still like to help with. Contributing more to Git is a next step. Helping to make tools easier and more reliable is my motivation.

  • Can you briefly explain your user experience with Git so far?

    So far I have been mostly using Git from the command line and only recently started to look at Git-GUI and gitk. This is a stark contrast to my previous use of CM/Synergy and PVCS where the GUI was used almost exclusively. Working on GitGitGadget requires getting to know some of the plumbing of Git and GitHub, which is interesting. For personal projects, most use is pretty basic.

  • Have you ever felt that something that you’ve used in other version control software is missing from Git?

    VS Code shows a side-by-side diff for working objects. CM/Synergy allowed a side-by-side diff for any versions of the objects. I have not seen a way to do that. Sometimes it is easier to debug when you have the complete source to see changes.

    Note: Chris was told that git difftool might help him with getting side-by-side diff in Git. He acknowledged the same stating “Difftool looks like what I am looking for”.

  • What is your favorite Git-related tool/library, outside of Git itself?

    My answer should be GitGitGadget but I’m not really a user these days – need to be more of a contributor to Git to appreciate GitGitGadget. VS Code is providing more Git specific tooling (i.e. Timeline view) that looks really interesting along with the GitLens extension.

Releases

Other News

Various

Light reading

Git tools and sites

  • git-repo is a reimplementation in Golang by Alibaba Cloud of the Android repo tool.
  • OneDev is an all-in-one DevOps platform, with issue tracking, Git management, pull requests, and build farm; written in Java.
  • Introducing Scalar: Git at scale for everyone.
  • chronologer: visualize changes in program timing (performance) over git commit history; uses hyperfine for the benchmarking.
  • skeema is an open source pure-SQL database schema management system, with which you can diff, push, and pull schemas.
    There are various other tools to help version-control database schemas, including:
    • Sqitch is database-native change management for framework-free development and dependable deployment.
    • Alembic is a lightweight database migration tool for usage with the SQLAlchemy Database Toolkit for Python.
    • Flyway by Redgate, with Apache v2 licensed Community Edition version.
    • Liquibase by Datical, with free Liquibase Community edition.
  • get-lore-mbox by Konstantin Ryabitsev is a Python script that given the message ID for any email in a thread of interest; it will download the entire thread from the lore.kernel.org archive into a local mbox file, for an easy applying to git repository.
  • GitHub and Government is a site to gather, curate, and feature stories of public servants and civic hackers using GitHub as part of their open government innovations.
  • git-evtag: Extended verification for git tags, that can be used as a replacement for git-tag -s.
  • shit (shit == Shell Git) is an implementation of Git [plumbing] using (almost) entirely POSIX shell; more proof-of-concept than a practical tool.
  • git-tfs is a two-way bridge between TFS and git, similar to git-svn; no more actively maintained because the authors are no longer users of TFS:

Credits

This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com>, Jakub Narębski <jnareb@gmail.com>, Markus Jansen <mja@jansen-preisler.de> and Kaartic Sivaraam <kaartic.sivaraam@gmail.com> with help from Eric S. Raymond, Josh Steadmon and Johannes Schindelin.