This is the idea page for Summer of Code 2020 for Git.
Please completely read the general application information page before reading the idea list below.
Students: Please consider these ideas as starting points for generating proposals. We are also more than happy to receive proposals for other ideas related to Git. Make sure you have read the “Note about refactoring projects versus projects that implement new features” in the general application information page though.
Git has an old problem of duplicated implementations of some logic. For example, Git had at least 4 different implementations to format command output for different commands. Our previous GSoC students and Outreachy interns unified some of the formating logic into ref-filter and got rid of similar logic in some command specific files. Current task is to continue this work and reuse ref-filter formatting logic in pretty.
See discussion in:
git log --onelineimprovements
A number of Git commands, like
git log, can show commit information
in a configurable way using
Such formats though don’t yet support some features that users would
like, for example to display a log like the following:
b9df16a4c (HEAD -> master) pathspec: don't error out on all-exclusionary pathspec patterns 91a491f05 pathspec magic: add '^' as alias for '!' c8e05fd6d ls-remote: add "--diff" option to show only refs that differ 20769079d (tag: v2.12.0-rc2, origin/master, origin/HEAD) Git 2.12-rc2 076c05393 Hopefully the final batch of mini-topics before the final
See discussions in:
A few components of Git are still in the form of shell and sometimes Perl scripts. This causes problems in production code – in particular on multiple platforms, e.g. Windows (think: POSIX-to-Windows path conversion issues).
The idea of this project is to dive into the Git source code and convert a couple of shell and/or Perl scripts into portable and performant C code, making it a so-called “built-in”.
It will be an important part of the project to discuss and find the most interesting scripts or parts of scripts to be ported.
See discussion in:
Git uses various clever methods for making operations on very large
repositories faster, from bitmap indices for
git fetch, to generation
numbers (also known as topological levels) in the commit-graph file for
commit graph traversal operations like
git log --graph.
Unfortunately it turned out that we can get worse performance when using those generation numbers than without them, with using committerdate as a heuristics[8,3] (and for selecting which commits to walk first). It can lead to a large increase in number of commits walked. The example we saw in the Linux kernel repository was a bug fix created on top of a very old commit, so there was a commit of low generation with very high commit-date that caused extra walking. (See  for a detailed description of the data shape in this case.)
Therefore the need for generation number v2 was born. Various candidates were examined (see e.g. https://github.com/derrickstolee/gen-test for initial list). New generation number needed to provide good performance, incremental updates, and (due to unfortunate problem[10,3]) also backward compatibility.
The generation number v2 that fulfills those requirements is ‘backward compatible date ceiling’ or ‘corrected commit date with monotonically increasing offsets[11,3]. What is stored in the commit-graph in place of generation number is value of date offset; it is chosen to be at least 1 more than maximum offsets of the parents of the commit, while committerdate+offset (corrected commit date) also meets this condition.
The task would be then to update the generation number to “v2” based
on the referenced definition. A part of this task would be moving the
generation member of
struct commit into a slab before making it a
64-bit value. (To learn how to store data on a slab one can see
ongoing work on adding Bloom filter for changed files to the commit
graph .) This part could be done with help of Coccinelle
scripts, like the ones in
Another part of this task could be examining performance
improvements, like in https://github.com/derrickstolee/gen-test
(perhaps extending it with
An alternative would be examining other possible improvements that can make Git even faster than just using generation numbers, like using min-post intervals labeling. The basis of this labeling is post-visit order of a depth-first search (DFS) traversal tree of a commit graph, let’s call it ‘post(v)’.
If for each commit ‘v’ we would compute and store in the commit-graph file two numbers: ‘post(v)’ and the minimum of ‘post(u)’ for all commits reachable from ‘v’, let’s call the latter ‘min_graph(v)’, then the following condition is true:
if ‘v’ can reach ‘u’, then min_graph(v) <= post(u) <= post(v)
This labeling can be used to quickly find which commits are unreachable (it is so called negative-cut filter).
If for each commit ‘v’ we would compute and store in the commit-graph file two numbers: ‘post(v)’ and the minimum of ‘post(u)’ for commits that were visited during the part of depth-first search that started from ‘v’ (which is the minimum of post-order number for subtree of a spanning tree that starts at ‘v’), let’s call the later ‘min_tree(v)’, then the following condition is true:
if min_tree(v) <= post(u) <= post(v), then ‘v’ can reach ‘u’
This labeling can be used to quickly find which commits are reachable, because if they are reachable in the spanning tree for commit graph, then they are reachable in commit graph itself. (Such labeling is called positive-cut filter).
The task would be to implement computing such labeling (or a more involved variant of it, for example as described in [4,5,6]), store it in the commit-graph file, and then use it for speeding up git commands, such as:
git merge-base --is-ancestor
git branch --contains
git tag --contains
git branch --merged
git tag --merged
git merge-base --all
git log --topo-sort
Before starting on this task, at the beginning of the GSoC, it might be good idea to have an exploration period, to determine which methods brings which performance improvements. This could be done with the help of “Reachability labels for version control graphs.ipynb” Jupyter Notebook on Google Colaboratory (in Python). This notebook was created to answer such questions, though the exploration didn’t get finished. It would be possible with this notebook to at least find the amount of false negatives for min-post interval labeling in git.git or Linux kernel repo. Alternatively this could be done by creating prototypes and perhaps examining performance in portable and repeatable way using trace2 API, like it was done for gen-test (experimenting with candidates for generation number v2, see above).
One can start this task with using min-post interval labeling for
making selected single command faster, for example for
queries (as it was done for generation numbers).
See also discussion in: