Git Rev News: Edition 7 (September 9th, 2015)

Welcome to the 7th edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.

This edition covers what happened during the month of August 2015.

Discussions

Google Summer of code Wrap Up (written by Matthieu Moy)

Both of the Git project Google Summer of Code students, Paul Tan and Karthik Nayak have passed the final evaluation.

Projects in 2015

Unification of git for-each-ref, git branch and git tag

Student: Karthik Nayak, mentored by Christian Couder and Matthieu Moy.

With Git, branches and tags are references to a state in history. Internally, a branch is almost the same as a tag, and both are particular cases of the concept of references, or “ref” for short. Still, because the use-cases are different, we had two commands with different user-interfaces, and different implementations, plus the command git for-each-ref meant to be more flexible and easier to use in scripts.

A lot of code was moved from the user-interface part of the code to the core library of Git, and duplicated ref-listing and formatting code was refactored.

A nice side effect for the user is that features that were available on one command but not another are now available everywhere. For example, we had git for-each-ref --format but no git branch --format, and had git branch --contains but not git for-each-ref --contains.

The biggest contribution is of course that we reduced the code duplication and made the code internally better.

Port git am and git pull from shell-script to C

Student: Paul Tan, mentored by Stefan Beller and Johannes Schindelin.

Git was initially implemented as a set of C commands for the core, and a set of shell-script to provide a user-friendly interface. While implementing some commands in shell allowed a quick prototyping, the presence of shell-scripts raises a lot of issues. One is performance: shell-scripts tend to create a lot of processes, and often manipulate a lot of temporary files. Another one is portability, especially on Windows.

Many Git commands written as shell-scripts were later re-written as builtin C commands. You can still see some of the initial scripts in the contrib/examples directory of Git’s source tree.

The second project was to port git am and git pull in C. One difficulty is that Git has strong requirements on backward-compatibility.

Both commands have been completely ported, and the result merged to the master branch.

Retrospective

Both projects were really successful (and obviously the students “passed” in the GSoC jargon). One thing we’re happy about is that for one project, the code is already completely reviewed and merged, and for the other the reviewing process is almost over (and the student remains active after the pens down date). This means that the GSoC did not just bring us initial implementations, but solid, and well-reviewed patches. The Git codebase is better after the GSoC than before from the maintainability point of view.

Interestingly, both projects were essentially internal refactoring ones (the one I mentored last year was, too). Nothing really impressive for the end-user, but in both cases a substantial contribution to Git’s maintainability.

I’m positively surprised that students chose these topics. They are not the best subjects to show off with your friends (“see this new command you love so much, I implemented it!”), but are necessary work to make the codebase healthier.

Not all our past experiences were success. We had failed GSoC for very promising students, or projects that we considered as “pass” from the GSoC point of view, but producing code that were never merged or merged far later after substantial work from the mentor. We actually took a break from GSoC and did not participate in 2013 because we thought we were not good enough at mentoring and integrating students at that time.

This year, we had long private discussions during the selection period of the GSoC: refactoring projects like these ones have relatively high risk. The “Wow” effect when it works is not so high, but the “Uh oh” effect if it breaks is.

Reading the students proposals and resume is rarely sufficient to really evaluate their skills, which makes the risk particularly hard to evaluate. Since 2014, we’ve been using microprojects to select students. This turned out to be very efficient, both at identifying good (or bad) applicants, but also at giving students a glimpse of what “contributing to Git” means. Student get an idea of how the community is organized before starting the project.

We’re now happy we took the risk. We’ve gained a bunch of useful changes to Git. We’ve had lively discussions with students and had fun doing it. And hopefully, these student now feel part of the community and the fun will continue long after the GSoC :-).

Reviews

Last June, David Turner, who is working for Twitter, sent to the list a message titled “RFC/Pull Request: Refs db backend” that started with:

I’ve revived and modified Ronnie Sahlberg’s work on the refs db backend.

Ronnie Sahlberg, working for Google, had sent last year some emails about an experimental TDB backend and pluggable backends he had developed to store git refs. Unfortunately his work was too experimental and had not been merged.

This is interesting work because Git gets slower when the number of refs in a repositories is getting really big, as its packed-refs backend is file based and had not been designed to handle a huge number of refs. That’s why a lot of prominent Git developers, like Jeff King, Shawn Pearce, Michael Haggerty, Stefan Beller, Duy Nguyen and Junio Hamano were interested by David’s announcement last June and some details it contained like:

The db backend runs git for-each-ref about 30% faster than the files backend with fully-packed refs on a repo with ~120k refs. It’s also about 4x faster than using fully-unpacked refs. In addition, and perhaps more importantly, it avoids case-conflict issues on OS X.

One difference between Ronnie’s and David’s work is that David chose LMDB instead of TDB for the new database backend. David explained that with:

The advantage of tdb is that it’s smaller (~125k). The disadvantages are that tdb is hard to build on OS X. It’s also not in homebrew. So lmdb seemed simpler.

Since June, David, helped by many reviewers like Eric Sunshine, Johan Herland, Michael Haggerty, Jacob Keller, Duy Nguyen, Stefan Beller, Johannes Sixt, Philip Oakley and Junio has worked on many related improvements, that are very helpful to advance this topic. That’s why the last version of his patch series contains the following:

This series depends on at least the following topics in pu: dt/refs-bisection dt/refs-pseudo dt/reflog-tests kn/for-each-tag (…)

It also contains the following interesting bits:

Also, now per-worktree refs live in the filesystem.

(…)

As Michael Haggerty suggested, I’m now using struct ref_transaction as a base struct for the ref transaction structs.

This shows that David’s series is using very recent improvements by other developers in the Git codebase.

Hopefully, we can expect that, with those new ref backends, users will be able to benefit soon from a huge amount of ground work that has been done during the last few years.

Johannes Schauer reported that git-archive does not use the zip64 extension and therefore is unable to properly create zip archives with more than 16k entries.

René Scharfe agreed that there is a problem. It comes from the fact that without the zip64 extension, the zip field for the number of entries has only two bytes. The limit is then 65535 entries.

René then sent a patch series to add tests for this problem and then fix it. The first patch contains the following code, which tests that a suitable zipinfo command is available on the current machine, and sets the ZIPINFO prerequesite if this is the case:

+ZIPINFO=zipinfo
+
+test_lazy_prereq ZIPINFO '
+	n=$("$ZIPINFO" "$TEST_DIRECTORY"/t5004/empty.zip | sed -n "2s/.* //p")
+	test "x$n" = "x0"
+'

Eric Sunshine replied that unfortunately the above would work neither on MacOS X where the zipinfo output is different, nor on FreeBSD where zipinfo has been removed in favor of unzip -Z.

Eric then discussed in details the possibility of using unzip -Z instead of zipinfo to have a portable test, but it appears that this doesn’t work well on files using the zip64 extension on MacOS X and FreeBSD .

After further discussing this, Eric, René and Junio agreed that it was good enough that the patch and the above zipinfo check work on Linux as we don’t need to test the archive generated by git archive on every platform.

Lars Schneider, who is in the process of migrating huge Perforce repositories to Git, posted a RFC (Request For Comments) patch to add an option to git-p4 to store some big files into Git LFS.

Luke Diamand, who has been working a lot on git-p4 during the past years, reviewed his patch saying first that it would be better to have both a generic mechanism to handle big files and a separate Git LFS extension rather than a specific mechanism for Git LFS. Luke also noticed that the changes seem to require Python 3.

The discussion then focused on the merit of migrating git-p4 from Python 2 to Python 3 until John Keeping chimed in writing:

Documentation/CodingGuidelines currently says:

  • As a minimum, we aim to be compatible with Python 2.6 and 2.7.

  • Where required libraries do not restrict us to Python 2, we try to also be compatible with Python 3.1 and later.

and then explaining the reason why it’s difficult to migrate.

Fortunately it seems that this has not discouraged Lars to send a second version of his patch that starts to address Luke’s concerns and that Luke has already reviewed.

Hopefully, thanks to this ongoing work, we will soon have an easy way to migrate Perforce repos that contains big files.

Developer Spotlight: Eric Sunshine

Q. Who are you and how did you get involved in open-source software?

A. I’m an old-time developer who has been giving away software and source code I’ve written ever since I learned to program – before readily-available Internet access, and long before the World Wide Web existed. Many of my projects – at least the ones which weren’t worth porting forward – are as long-dead as the platforms and operating systems for which they were written.

Q. How did your introduction to Git come about?

A. I may have seen a periodic reference to Git here and there, but what really brought it to my attention was a 2007 post by Ted Kulp which talked about Git’s distributed nature, painless branching, offline capability, and full local history. As a long-time RCS, CVS, and Subversion user, I was intrigued, and decided to learn about Git, but failed utterly. Partly this was due to Git tutorials I read being aimed at people already at least somewhat familiar with the concept of distributed version control, thus not really explaining it, and, as a long-time centralized version control user, I had difficulty grasping it. The tutorials also glossed over the Git “index”, thus making it difficult to understand its purpose or significance. Finally, Git terminology, such as “plumbing”, “porcelain”, “rebasing”, and “cherry picking”, which had no analog in other version control systems, or in my general experience, seemed genuinely opaque. This situation remained unchanged until late 2008 when I decided to try again. Watching a video of a Git talk at Google by Linus Torvalds, followed by one by Randal Schwartz, and reading improved tutorials helped me finally get a grip on distributed version control and Git itself.

Q. How did you get involved in Git project itself?

A. Once I learned and began using Git, I quickly came to appreciate its underlying architecture, the thoughtfulness which went into its design, and how it enabled me to create sharp, well focused commits in a way that other VCS’s had not (or actively discouraged), as well as the ability to organize and polish history locally. Consequently, I joined the project, not because I wanted a shiny new feature or a bug fixed, but rather because I wanted to do my part to contribute back to a project from which I was benefitting. My participation takes the form of doing code reviews of submitted patches, diagnosing and fixing bugs, and adding or enhancing a feature here and there. My hope, particularly with the code reviews, has been to take some of the load off the shoulders of other regular project members.

Q. What are your favorite Git features?

A. My two favorite features are (1) the ability to stage changes in the index with precision to create sharp, well focused commits, and (2) interactive rebase for (repeatedly) refining a series of commits until I’m happy with the history they represent. More generally, given Git’s easy branching and built-in safety mechanisms, I appreciate the freedom it gives me to work on many changes at once (when necessary) without worrying about those changes clobbering one other.

Q. What is your preferred Git development model?

A. I’m partial to the mailing list approach used by the Git project itself, since all important information – discussions, patches, bug reports – are available in a centralized location (my mailbox) with no effort on my part. This allows for a more streamlined experience than if I have to actively seek out the information by consulting forums, bug trackers, and patch review websites. Moreover, the mailing list development model allows use of email-related tools best suited for the person and task, whereas web-based tools are often difficult to use, feature-poor, and sometimes outright crippled. I also appreciate the Git project’s heavy emphasis well-engineered solutions, quality commit messages, and well-organized, highly focused patches.

Releases

Other News

Various

Light reading

Git tools and sites

Credits

This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com>, Thomas Ferris Nicolaisen <tfnico@gmail.com> and Nicola Paolucci <npaolucci@atlassian.com>, with help from Matthieu Moy, Eric Sunshine and Johannes Schindelin.