Welcome to the 95th edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.
This edition covers what happened during the months of December 2022 and January 2023.
Question: How to execute git-gc correctly on the Git server?
ZheNing Hu asked about how he could run git gc
correctly on his
own Git server. He seemed to be afraid by
the git gc
documentation
saying that there is a risk of failures and repository corruption
when the command is run concurrently with other Git processes.
He said that he
read about git gc --cruft
which could overcome these issues, but that he was still using Git
v2.35 on his server while --cruft
was introduced in v2.38.
He also wondered if there was a need for git gc
to set a
repository level lock blocking most or all other Git operations,
and what these operations – especially git clone
and git push
–
should do or report when hitting this lock.
Ævar Arnfjörð Bjarmason replied that running git gc
on a “live”
repo was always racy, but the odds of corrupting the repo were
becoming very small when the value of the gc.pruneExpire
config
option was increased. He said that the default setting for this
option, 2 weeks, was “more than enough for even the most paranoid
user”.
About --cruft
, Ævar thought that its purpose was not only to avoid
possible repo corruption, but also to allow more aggressive gc
(garbage collection).
He also wondered if this question was about large hosting sites like
GitHub and GitLab, where git gc
is run on live repos, and
suggested not to worry in this case, but to take backups.
Jeff King, alias Peff, replied to Ævar saying he was “a bit less
optimistic” about the corruption risk decreasing when
gc.pruneExpire
was increased because there was no atomic view of
the ref namespace. So renaming a branch for example was risky
because it could be seen as removing a branch and adding a different
one by any concurrent process. Such a process could be another push
,
not just a gc
.
Peff also said that using --cruft
was not so much about avoiding
corruption, but about keeping cruft objects out of the main pack to
reduce the cost of lookups and bitmaps, and about avoiding to
explode a lot of old objects into loose objects, which could be very
bad for performance.
Ævar replied to Peff discussing further when corruption was likely
or not to happen, which issues --cruft
could help with, and a
patch he sent in the past to reduce possible corruption. He also
suggested running git gc
on the least busy hours of the day.
Later Taylor Blau replied to Ævar and Peff discussing --cruft
in
the context of single-pack bitmaps or multi-pack (MIDX) bitmaps, and
also in the context of GitHub.
In the meantime, Michal Suchánek replied to Ævar’s first email
asking what the 2 week default expiration time applied to. He also
said that he got corrupted repos with less than 100 users “and some
scripting” which went away when gc
was disabled.
Peff replied to Michal, saying that the expiration time applied to
the mtime
on the object file (or the pack containing it), and
confirmed that it was “far from a complete race-free solution”.
ZheNing also replied to Michal saying that he preferred “no error at all” to a “small probability of error”.
Michal replied to Peff listing some workflows that are more likely to lead to a corrupt repo, like deleting branches but pushing other branches that are variants of these branches, and different people pushing files from the same external source.
Peff confirmed that these workflows were indeed risky, and detailed a bit further how the race conditions can happen.
ZheNing then replied to Peff asking if there was “an easy and poor
performance” way like a lock on a repository to avoid for example
concurrent push
and gc
processes.
Ævar replied that there was no such way but that we should have one. He explained that it could perhaps be done using hooks, like ‘pre-receive’ and ‘post-receive’, when we were sure that all relevant operations were going through these hooks. (For example no local branch deletion should be possible.)
ZheNing and Michal discussed a bit further the details related to
how a repo corruption can happen with concurrent push
and gc
processes, and how that could possibly be avoided.
Who are you and what do you do?
My work is related to R&D efficiency tools development at Alibaba Cloud. Our team have currently built a code hosting service as codeup.aliyun.com which provides free and high-quality code services for Chinese developers on the public cloud. In addition, I used to be a Gerrit contributor, because I wrote Java for nearly 10 years, and this process made me almost forget the C language, LOL.
For the contributions of Git community, apart from me, Jiang Xin (the Git localization coordinator), ZheNing Hu, and Chen BoJun are also in the team.
What would you name your most important contribution to Git?
First of all, I know Git for some years, but I’m new in the community, because Git’s technical depth is obvious which involves algorithms, operating systems, testing techniques, etc. Also, Git has many subcommands, which makes the implementation of Git itself involve many aspects, and I think it is difficult for a new contributor to understand everything, but long-term participation may make you an expert in one aspect of Git. Sadly, my time devoted to the Git community is actually limited.
I contributed a feature last year to allow the git ls-tree
subcommand
to support the --format
option which let you print out the result as you
want, this is helpful for some automated tools or scripted work I think. If
you want to know about it further, a better way is to read the blog by
Taylor Blau.
What are you doing on the Git project these days, and why?
I’ve been following the evolution of the bundle-uri
feature recently, I think
the idea of this feature is great and attractive. If used properly, it can not
only improve the speed of code download in some scenarios, but also
reduce the load on the server.
I’m also reading about algorithms related code (like bitmap, multi-pack bitmap, bloom-filter), as I want to know some details about the combination of Git and algorithms. I think it’s interesting.
If you could get a team of expert developers to work full time on something in Git for a full year, what would it be?
We all know that it can be a pain in terms of resource load and cost to provide large-scale Git services. I hope to be able to solve the problem with Git’s storage and computing coupling to let Git be better to integrate with cloud-native architecture. Like, should it be possible to store the refs, loose objects and packs on a Distributed Database?
I think this is one of the future development direction of the Git architecture, starting from lower cost and cloud friendliness. If you want to do these tasks based on Git, you may need to make the internal related implementations more adaptable, which requires a lot of professional work I think.
If you could remove something from Git without worrying about backwards compatibility, what would it be?
Maybe introduce a new option --branches
in git push
to replace --all
.
Option --all
means to push all branches, --tags
means to push all tags, but
many people misunderstand it (at least those around me), because they think
--all
means to push all
the branches and tags together. In fact, I made an RFC
patch
before, hoping to support the --branches
parameter in the first step, and
I’ll consider following up with this patch.
What is your favorite Git-related tool/library, outside of Git itself?
I prefer git-repo which supports doing code reviews or pull requests on the client, just like using a native Git subcommand.
Do you happen to have any memorable experience w.r.t. contributing to the Git project? If yes, could you share it with us?
Still memorable when my first commit was merged in, even though it was a small fix. This process made me understand that contributing to Git is completely different from other workflows, and the process and results both feel good.
What is your toolbox for interacting with the mailing list and for development of Git?
First, I use https://public-inbox.org/git/?q=a%3Adyroneteng to check if there is any new mails related to me.
Then, I’ve been using git format-patch
to create patchsets and git send-email
to post them, and git am
for local reviews. I don’t know if there’s a better
way, but it seems to be enough for me.
What is your advice for people who want to start Git development? Where and how should they start?
Contributing to Git is not an easy task, after all, you are working with other excellent contributors in the community, but continuous understanding and participation may make you an expert in a certain direction.
If there’s one tip you would like to share with other Git developers, what would it be?
I think it would be “get used to the process of contribution slowly”.
The review process is sometimes frustrating, but most of the suggestions by reviewers are still valuable; you can learn a lot from the process, then you can better participate in the next contribution.
Various
Light reading
git checkout <remote-branch>
trick,
and does not mention the newer git switch <branch>
command as alternative to git checkout <branch>
.git-lfs
) is evil on shitty networks
on the Somewhere Within Boredom blog (may be fixed by the time you are reading this).Git tools and sites
git-sim
command in the terminal,
for example git-sim reset HEAD^
or git-sim merge dev
,
to generate a custom Git command visualization (.jpg, .mp4) from your repository.
Written in Python, available as package on PyPI.This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com>, Jakub Narębski <jnareb@gmail.com>, Markus Jansen <mja@jansen-preisler.de> and Kaartic Sivaraam <kaartic.sivaraam@gmail.com> with help from Teng Long.