This is the idea page for Summer of Code 2014 for Git and libgit2.
It is strongly recommended that students who want to apply to the Git project for the Summer of Code 2014 complete a tiny, code-related “microproject” as part of their application. Please refer to our guidelines and suggestions for microprojects for more information. Completing a microproject is not a strict requirement, but we will definitely give more attention to applicants who do so. Doing a microproject will also help get you started in the Git project and help you judge whether you want to work with us.
Students: Please consider these ideas as starting points for generating proposals. We are also more than happy to receive proposals for other ideas related to Git or libgit2.
Vicent Martí and Jeff King have open-sourced a patchset that implements JGit’s pack-bitmap optimization on Core Git. Ironically enough, even through Vicent is the maintainer of libgit2, this feature is still not available in the library.
The goal of this SoC project would be to bring the implementation from Core Git back to libgit2 and add new APIs that optimize commonly used operations using bitmaps.
Removed. Michael Haggerty is already working on this.
Git reads objects from storage (loose and packed) through functions in
sha1_file.c
. Most commands only require very simple, opaque read and
write access to the object storage. As a weatherballoon, show that it
is feasible to use libgit2 git_odb_*
routines for these simple callers.
Aim for passing the git test suite using git_odb_*
object storage
access, except for tests that verify behavior in the face of storage
corruption, replacement objects, alternate storage locations, and
similar quirks. Of course it is even better if you pass the test suite
without exception.
As an alternative to the diff3 conflict style, invent a conflict style that shows the original unpatched segment along with the raw patch text. The user can then apply the patch by hand.
It is common in the git world to have a “triangular” workflow in which commits are fetched from an upstream repository to the local repository, and then pushed up to a personal publishing point. This workflow is missing some convenience features, and there are many possible projects in this area.
For example, @{publish}
is a feature like @{upstream}
, showing the
state of the publish-point in the case of triangular workflows.
Implement this while sharing code with git-push, and polish it until the
prompt shows publish-state.
When performing operations that fail, git typically writes to a temporary file and then atomically moves it into place. During failures, some of these temporary files may be left in place. This is convenient for forensics, but inconvenient when the files are very large (especially if the operation failed due to running out of disk space). Refactor the handling of temporary packs and object files so that they can optionally be cleaned up automatically. The implementation should be shared with other files that are cleaned automatically, like lockfiles.
git-bisect
improvementsThe student will become familiar with the git-bisect
command and
implement many small-to-medium fixes. Two examples:
an oft-requested feature is for bisect
to swap the good
and
bad
labels or to give them alternate names (for finding a fix
rather than a regression). While this seems simple at the outset,
there are many subtleties. The student will need to read and
understand previous proposals in this area.
in some cases, git bisect
may test too many merge bases, thus
slowing down the bisection (making it closer to linear than
logarithmic).
Students proposing projects in this area will be expected to communicate with the Git community and include specific projects in their proposal.
git branch -l
, git tag -l
, and git for-each-ref
These three commands are all about selecting a subset of a repository’s
refs, and then printing the result. However, the implementations are not
shared. Some commands know selection options the others do not
(e.g., --contains
, --merged
), and others know formatting options the
others do not (e.g., for-each-ref
’s --format
).
There have been experimental patches to unify the selection process, and some discussion on unifying the formatting. Based on this previous work, factor out a common library which can cleanly handle both selection and formatting, and use it consistently in the three commands.
Libgit2 has support for the client side of the negotiation, but it’s missing
server-side capabilities. We wouldn’t want to simply reimplement upload-pack
or receive-pack
but instead create the framework that takes care of the protocol
details and calls to user code for
which would allow e.g. limiting which references are shown to a particular user or make decisions about updates in the callbacks instead of script hooks.
Sometimes git objects contain malformed or undesirable data. E.g., broken author emails, skewed dates, trees with duplicate filenames are all malformed from git’s perspective. Something like non-valid or non-normalized UTF-8 in pathnames is not an error, but may violate project policy.
Because git’s data model is additive, fixing these problems requires
rewriting history to create new objects. Doing this with the current
toolset is possible, but requires a high degree of specialized
knowledge, and often requires running the slow and arcane git
filter-branch
.
There are several possible improvements that can be made in this area, including:
git fsck
coverage of git data errorsgit fsck
to optionally note policy problems (like UTF8)hash-object
to perform stricter, fsck-like checksfsck
errors into fixed git replace
objectsgit replace
, cementing
replacement objects into placeA successful project would not have to hit each of these points, but should aim for producing a coherent workflow for non-experts to diagnose and repair broken history.
git rebase --interactive
One of the more powerful features in Git is the command git rebase
--interactive
, which allows recent commits to be reordered, squashed
together, or even revised completely. The command creates a todo list
and opens it in an editor. The original todo list might look like:
pick deadbee Implement feature XXX
pick c0ffeee The oneline of the next commit
pick 01a01a0 This change is questionable
pick f1a5c00 Fix to feature XXX
pick deadbab The oneline of the commit after
The user can edit the list to make changes to the history, for example to
pick deadbee Implement feature XXX
squash f1a5c00 Fix to feature XXX
exec make
edit c0ffeee The oneline of the next commit
pick deadbab The oneline of the commit after
This would cause commits deadbee
and f1a5c00
to be squashed
together into one commit followed by running make
to test-compile
the results, delete commit 01a01a0
altogether, and stop after
committing commit c0ffeee
to allow the user to make changes.
It would be nice to support more flexibility in the todo-list commands by allowing the commands to take options. Maybe
Convert a commit into a merge commit:
pick -p c0ffeee -p e1ee712 deadbab The oneline of the commit after
After squashing two commits, add a “Signed-off-by” line to the commit log message:
pick deadbee Implement feature XXX
squash --signoff f1a5c00 Fix to feature XXX
or GPG-sign a commit:
pick --gpg-sign=<keyid> deadbee Implement feature XXX
Reset the author of the commit to the current user or a specified user:
pick --reset-author deadbee Implement feature XXX
pick --author="A U Thor <author@example.com>" deadbab The oneline of the commit after
See this discussion on the mailing list for more related ideas.
The goal of this project would be (1) to add the infrastructure for
handling options on todo-list lines, and (2) implement some concrete
options. A big part of the difficulty of this project is that git
rebase --interactive
is implemented via a sparsely-commented shell
script. Adding comments and cleaning up the script as you go would be
very welcome.
There are many places in Git that need to read a configuration value.
Currently, each such site calls git_config()
, which reads and parses
the configuration files every time that it is called. This is
wasteful, because it results in the configuration files being
processed multiple times during a single git
invocation. It also
prevents the implementation of potential new features, like adding
syntax to allow a configuration file to unset a previously-set value.
This goal of this project is to make configuration work as follows:
Read the configuration from files once and cache the results in an appropriate data structure in memory.
Change git_config()
to iterate through the pre-read values in
memory rather than re-reading the configuration files. This
function should remain backwards-compatible with the old
implementation so that callers don’t have to all be rewritten at
once.
Add new API functions that allow the cache to be inquired easily and efficiently. Add helper functions to retrieve configuration values of various types (string, integer, boolean, etc.) from the cache by name.
Rewrite callers to use the new API wherever possible.
You will need to consider how to handle other config API entry points
like git_config_early()
and git_config_from_file()
, as well as how
to invalidate the cache correctly in the case that the configuration
is changed while git
is executing.
See this mailing list thread and this email for some discussion about this and related ideas.