Archive for December 2007
the woes of “git gc –aggressive” (and how git deltas work)
Today I found a gem in the git mailing lists that discusses a little bit about how git handles deltas in the pack (i.e. efficiently storing revisions) and why — somewhat non-obviously — the aggressive git garbage collect (invoked by doing git gc --aggressive) is (generally) a big no-no. The verbatim email from Linus explaining this is affixed as part of the full text of this article.
A quick summary
Since there is little point in simply reposting this information (other than for personal archival), I will condense it here for quick reading:
Git does not use your standard per-file/per-commit forward and/or backward delta chains to derive files. Instead, it is legal to use any other stored version to derive another version. Contrast this to most version control systems where the only option is simply to compute the delta against the last version. The latter approach is so common probably because of a systematic tendency to couple the deltas to the revision history. In Git the development history is not in any way tied to these deltas (which are arranged to minimize space usage) and the history is instead imposed at a higher level of abstraction.
Now that we have exposed how git has some greater flexibility in choosing what revisions to derive another revision from we get to the problem with --aggressive.
Here’s what the git-gc 1.5.3.7 man page has to say about it:
--aggressive
Usually git-gc runs very quickly while providing good disk space
utilization and performance. This option will cause git-gc to more
aggressively optimize the repository at the expense of taking much
more time. The effects of this optimization are persistent, so this
option only needs to be used occasionally; every few hundred
changesets or so.
Unfortunately, this characterization is very misleading. It can be true if one has a horrendous set of delta-derivations (for example: after doing a large git-fast-import), but its true behavior is to throw away all the old deltas and compute new ones from scratch. This may not sound so bad except that --aggressive isn’t aggressive enough at doing this to do a good job and may throw away better delta decisions made previously. For this reason --aggressive will probably be removed from the manpages and left as an undocumented feature for a while.
So now you ask: “Well, suppose I do really want to do the expensive thing because I just copied my company’s history into git and it has an inordinately large pack. How do I do it?”
Excerpted from Linus’ mail here is a terse recipe (with some explanation) that may take a very long time and require a lot of RAM to run but should deliver results:
So the equivalent of "git gc --aggressive" - but done *properly* - is to do (overnight) something like git repack -a -d --depth=250 --window=250 where that depth thing is just about how deep the delta chains can be (make them longer for old history - it's worth the space overhead), and the window thing is about how big an object window we want each delta candidate to scan. And here, you might well want to add the "-f" flag (which is the "drop all old deltas", since you now are actually trying to make sure that this one actually finds good candidates.
Other notes and observations
- If you have a development history where you constantly change between several particular versions of, say, a large binary blob — say a resource file of some kind — this operation can be very cheap under Git since it can delta against versions that are not adjacent in the development history.
- The delta derivations don’t have to obey causality: a commit made chronologically later can be used to derive one made earlier. It’s just a bunch of blobs in a graph, there isn’t even a strictly necessary notion of time attached to each blob at all to begin with! That data is maintained at a higher level. Repack doesn’t have to know or care about when a commit was made. (The only reason it may care is to help implement heuristics. Right now no such heuristic exists[0])
- Finding/verifying an optimal (space-minimizing) delta-derivation graph feels NP-hard. I now wave my hands furiously.
[0]: From the git-repack man page:
--window=[N], --depth=[N]
These two options affect how the objects contained in the pack are
stored using delta compression. The objects are first internally
sorted by type, size and optionally names and compared against the
other objects within --window to see if using delta compression
saves space. --depth limits the maximum delta depth; making it too
deep affects the performance on the unpacker side, because delta
data needs to be applied that many times to get to the necessary
object. The default value for --window is 10 and --depth is 50.