Welcome to the 107th edition of Git Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page on git.github.io.
This edition covers what happened during the months of December 2023 and January 2024.
Jeremy Pridmore reported an issue to the Git mailing list. He used
git bugreport
, so his
message looks like a filled out form with questions and answers.
He was trying to cherry-pick changes from one repo (A) to another (B), while both A and B came from the same original TFS server but with different set of changes. He was disappointed though because some files that had been moved in repo A were matched up by the rename detection mechanism to files other than what he expected in repo B, and he wondered if the reason for this was the new ‘ort’ merge strategy described in a blog post by Elijah Newren.
While not obvious at first, Jeremy’s primary problem specifically
centered around cases where there were multiple files with 100%
identical content. For example, originally there could have
been an orig/foo.txt
file, while one of the descendant repos
does not have that file anymore but instead has two files,
dir2/foo.txt
and dir3/foo.txt
, both with contents identical
to the original orig/foo.txt
. So, Git has to figure out which
one of dir2/foo.txt
and dir3/foo.txt
is the result of renaming
orig/foo.txt
.
Elijah replied to Jeremy explaining extensively how rename detection
works in Git. Elijah pointed out that Jeremy’s problem, as
described, did not involve directory rename detection (despite
looking kind of like a directory rename detection problem). Also,
since Jeremy pointed out that the contents of the “misdetected”
renames had identical contents to what they were paired with, that
meant that only exact renames were involved. Because of these two
factors, Elijah said that the new ‘ort’ merge strategy, which he
implemented, and which replaced the old ‘recursive’ strategy, should
use the same rename detection rules as that old strategy for
Jeremy’s problem. Elijah suggested adding the -s recursive
option
to the cherry-pick command to verify this and check if it worked
differently using the old ‘recursive’ strategy.
Elijah also pointed out that for exact renames in a setup like this, other than Git giving a preference to files with the same basename, if there are multiple choices with identical content then it will just pick one essentially at random.
Jeremy replied to Elijah saying that this sounded like what he was observing. He gave some more examples, showing that when there are multiple 100% matches, Git didn’t always match up the files that he wanted but matched files differently. Jeremy suggested that filename similarity (beyond just basename matching) be added as a secondary criteria to content similarity for rename detection, since it would help in his case.
Elijah replied that he had tried a few filename similarity ideas,
and added a “same basename” criteria for inexact renames in the
ort
merge strategy along these lines. However, he said other
filename similarity measurements he tried didn’t work out so well.
He mentioned that they risk being repository-specific (in a way
where they help with merges in some repositories but actually hurt
in others). He also mentioned a rather counter-intuitive result
that filename comparisons could rival the cost of content
comparisons, which means such measurements could adversely affect
performance and possibly even throw a monkey wrench in multiple of
the existing performance optimizations in the current merge
algorithm.
The thread also involved additional explanations about various facts involving rename detection. This included details about how renames are just a hint for developers as they are not recorded, but are instead computed from scratch in response to user commands. It also included details about what things like “added by both” means (namely that both sides added the same filename but with different contents), why you never see “deleted by both” as a conflict status (there is no conflict; the file can just be deleted), and other minor points.
Elijah also brought up a slightly more common case that mirrors the problems Jeremy saw, where users could be surprised by the per-file content similarity matching that Git does. This more general case arises from having multiple copies of a versioned library. For example, you may have a “base” version with a directory named “library-x-1.7/”, and a “stable” version has many changes in that directory, while a “development” branch has removed that directory but has added both a “library-x-1.8/” and a “library-x-1.9/” directory which both have changes compared to “library-x-1.7/”. In such a case, if you are trying to cherry-pick a commit involving several files modified under “library-x-1.7/”, where do the changes get applied? Some users might expect the changes in that commit to get applied to “library-x-1.8/”, while others might expect them to get applied to “library-x-1.9/”. In practice, though, it would not be uncommon for Git to apply the changes from some of the files in the commit to “library-x-1.8/” and changes from other files in the commit to “library-x-1.9/”. Elijah explained why this happens and suggested a hack for users dealing with this particular kind of case to work around rename detection.
Philip Oakley then chimed into the discussion to suggest using
“BLOBSAME” for exact renames in the same way as “TREESAME” is used
in git log
for history simplification. Elijah replied to Philip
that he thinks that ‘exact rename’ already works. Junio C Hamano,
the Git maintainer, then pointed out that “TREESAME” is a property
of commits, not trees, and suggested using words other than
“BLOBSAME” and “TREESAME” in the context of rename detection.
Philip and Elijah discussed terminology at more length, agreeing that good terminology can sometimes help people coming from an “old centralised VCS” make the mind shift to understand Git’s model, but didn’t find anything that would help in this case.
Finally, Philip requested more information about how Git computes file content similarity (for inexact rename detection), referencing Elijah’s mention of “spanhash representation”. Elijah explained the internal data structure in detail, and supported his earlier claim that “comparison of filenames can rival the cost of file content similarity computations”.
Various
Light reading
pull.rebase
to true
depends on whether project prefers merges or rebases,
and is very project-dependent.Jujutsu: a new, Git-compatible version control system by Daroc Alden on LWN.net (free link). Jujutsu was first mentioned in Git Rev News Edition #85; there was also a Jujutsu: A Git-Compatible VCS talk by Martin von Zweigbergk at Git Merge 2022, mentioned in passing in Git Rev News Edition #91.
Git tools and sites
This edition of Git Rev News was curated by Christian Couder <christian.couder@gmail.com>, Jakub Narębski <jnareb@gmail.com>, Markus Jansen <mja@jansen-preisler.de> and Kaartic Sivaraam <kaartic.sivaraam@gmail.com> with help from Elijah Newren, Bruno Brito, Brandon Pugh and Štěpán Němec.