Welcome to the 105th edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.
This edition covers what happened during the months of October 2023 and November 2023.
Git participates in Outreachy’s December 2023 to March 2024 round
Achu Luma will work on the “Move existing tests to a unit testing framework” project. They will be mentored by Christian Couder.
Congratulations to Luma for being selected!
Thanks to GitLab for sponsoring this Outreachy internship! Thanks also to the other contributors who applied and worked on micro-projects, but couldn’t be selected! We hope to continue to see you in the community!
[PATCH v2 0/2] Prevent re-reading 4 GiB files on every status
In May 2022 Jason Hatton sent an email to the mailing list about the fact that any file of a size that is an exact multiple of 8GiB makes Git extremely slow on the repository.
He said that he had already opened an issue about this on the Git for Windows issue tracker where Jason, Philip Oakley, brian m. carlson and Johannes Schindelin, alias Dscho, had already discussed the issue.
Git uses an uint32_t
type, a 32 bit long unsigned integer, for
storing the file size in the index. This rolls over if the value is
greater than 2 to the power 32, so with file sizes over 4GiB. When
the size is exactly 4GiB or a multiple of it, like 8GiB,
the rollover makes it zero.
A zero file size in the index has a special meaning for Git, though. It tells Git that the file needs to be hashed again. Hashing a file is supposed to reset its file size in the index to a non-zero value, but with a 4GiB file size the rollover happens and the file size is still zero. So the hashing will be performed again and again by many different Git commands, making Git very slow.
Jason proposed, as a solution to this problem, to detect when the rollover would happen, and in that case set the size to 1 instead of zero.
Junio C Hamano, the Git maintainer, replied to Jason confirming the issue and explaining it a bit more in detail. Jason and Junio then discussed the issue a bit more, while Jason tested locally his suggested fix and proposed to send a real patch to fix the issue.
René Scharfe then chimed into the discussion asking if a value other than one would be better and would avoid other possible issues. Philip Oakley replied to René suggesting using 0x80000000 instead of 1 when the rollover is detected. This would make it easier to detect “almost all incremental and decremental changes in file sizes”, as the file size in the index helps detecting file changes.
Jason and Philip discussed the issue a bit more and agreed that using 0x80000000 only for exact multiples of 4GiB would likely be the best solution.
Philip and Carlo Marcelo Arenas Belón also tried to help Jason properly submit a patch to the mailing list.
Jason then sent a patch to the mailing list with the changes and explanation that had been discussed. Torsten Bögershausen, Philip and Junio reviewed it, and suggested some improvements. Junio especially requested some tests to be added.
After some discussions with Jason to clarify what should be improved, Jason sent another version of his patch.
It looked like Jason found an issue with the patch due to using 0x80000000 instead of 1. René and Philip discussed it with Jason, but there was no clear conclusion. It wasn’t even clear if there was an issue at all. But anyway the work on this stopped for more than one year.
Fortunately a few weeks ago, brian m. carlson sent a new version of Jason’s patch along with another patch adding tests.
These patches were reviewed by Eric Sunshine, Jeff King, alias Peff, Junio and Jason. After some discussions it appeared that the patches were good enough for Junio, so he decided to apply a small change and then merge them. This issue is therefore fixed in Git 2.43.0 released on November 20th.
Who are you and what do you do?
I am Alexander Shopov - a backend engineer in the Amsterdam office of Uber working on money related systems. I am a long time translator of FOSS software to Bulgarian - I am coordinating translations of GNOME, Translation Project and many GNU modules. Bulgarian is an Eastern South Slavic language written in the Cyrillic alphabet.
What would you name your most important contribution to Git?
I made and now maintain the Bulgarian translation of the text interface of Git, Gitk, and Git Gui.
What is the typical workflow of a contributor engaged in Git translation?
There are 19 translations of the text interface of Git, and only 13 of them are above 80%, so I am not sure about “typical”. It is a fairly standard workflow for a FOSS project.
Generally one needs to do the following:
Currently the translation is a bit above 5500 messages, which is about 40k words, 250k of characters, or about 150 pages of text. It can be intimidating for a new translator. But you can definitely make it: be patient and translate some messages every release, merge, publish and repeat. Even better though harder is getting more than one person translating.
Do you contribute to Git in ways other than providing translation? If so, could you elaborate about them?
Sadly not that much. On rare occasions I improve messages and mark strings for translations. Perhaps that will be the way I contribute unless I find a mentor and something that I find particularly interesting and important for me. So if anyone is willing to mentor me, especially in making large repos faster - ping me. I can be a competent tester at least.
If you could get a team of expert developers to work full time on something in Git for a full year, what would it be?
Due to its enormous success, Git is being used on humongous code bases with a crazy number of files, directories, commits and branches. Working with repos larger than 10GB can be a bit slow. Improving the experience would be a great thing.
If you could remove something from Git without worrying about backwards compatibility, what would it be?
Backwards compatibility is massively important and I am thankful developers and users are all invested in this.
If we treat this as a hypothetical question, there are 3 things to Git:
The command-line interface is gradually being improved. The wire protocol is also a place where there are workarounds for versioning. The storage format however is another (quite conservative and public) API. I would remove the old versions and try to design it targeting projects that are 10-100 times larger than the Linux kernel first. In for a penny, in for a pound. If we break things, let us break them so hard that bards will sing songs about us!
What is your favorite Git-related tool/library, outside of Git itself?
I mainly use command line git
plus gitk
and git-gui
. I do like using
the meld diff tool when I work on translations.
Do you happen to have any memorable experience w.r.t. contributing to the Git project? If yes, could you share it with us?
The initial getting to 100% translated messages was a challenge. I decided that I should translate Git around December 2013. That was around 2200 messages at that time and it took me about 3 releases of Git to reach 100%. Getting to 100% was immensely hard, rewarding and memorable. Afterwards keeping the translation at 100% was much easier.
Is there something you feel could be done to ease the life of translators?
The terminology glossary of Git is much larger than 7 years ago, and we
(the translators) should actually update git://repo.or.cz/git-gui.git::po/glossary
and merge it in Git.
What is your advice for people who want to start Git development? Where and how should they start?
I don’t know to be honest. If I knew I may have started already.
If there’s one tip you would like to share with other Git developers, what would it be?
That would be the tip of master two years in the future. On a more serious note - perhaps more tools for migration out of the still existing proprietary version control systems would be helpful.
Various
git repack
tricks
(including adjusting sparse clone filters), nicer looking reverts of reverts
with git revert
, fixed interaction between --subject-prefix
and --rfc
in git format-patch
, custom log format options that simulate the decorations, etc.Light reading
git log
and git diff
, as an addition to the Julia Evans’ article
about confusing Git terminology, the .. and … section.git log -L
by Caleb Hearth on his blog; the post lists also a few his other articles about Git:
git log -S
and
git log -G
, or searching commit messages with git log --grep
.textconv
gitattribute, by Garrit Franke on Garrit’s Notes..gitattributes
file can be used to improve
language detection on GitHub, which is using the
Linguist library.Easy watching and listening
Git tools and sites
.gitattributes
files,
similar to gitignore.io.git-file-history
command line tool (in Node.js).
Mentioned in passing in Git Rev News Edition #48.lei
is a command-line tool
for importing and searching email, regardless of whether it is from a personal mailbox
or a public-inbox instance, like public-inbox.org
or lore.kernel.org.lei
is still in its early stages and may destroy mail.
git blame
(counting surviving lines).
Written in Ruby..git
files,
with support for many SQL features
such as grouping, ordering and aggregation functions.mergestat-lite
command line tool, which runs SQL queries against local Git repositories.
First mentioned in Git Rev News Edition #82.
Actively developed, mergestat-lite is written in Go.Git/fs
binary in Git9
(Git client for Plan 9 non-POSIX filesystem)
serves repository history as a file system.This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com>, Jakub Narębski <jnareb@gmail.com>, Markus Jansen <mja@jansen-preisler.de> and Kaartic Sivaraam <kaartic.sivaraam@gmail.com> with help from Alexander Shopov, Luca Milanesio, Bruno Brito, and Štěpán Němec.