Metalinguistic Abstraction

Computer Languages, Programming, and Free Software

Archive for the ‘distributed’ Category

the woes of “git gc –aggressive” (and how git deltas work)

with 11 comments

Today I found a gem in the git mailing lists that discusses a little bit about how git handles deltas in the pack (i.e. efficiently storing revisions) and why — somewhat non-obviously — the aggressive git garbage collect (invoked by doing git gc --aggressive) is (generally) a big no-no. The verbatim email from Linus explaining this is affixed as part of the full text of this article.

A quick summary

Since there is little point in simply reposting this information (other than for personal archival), I will condense it here for quick reading:

Git does not use your standard per-file/per-commit forward and/or backward delta chains to derive files. Instead, it is legal to use any other stored version to derive another version. Contrast this to most version control systems where the only option is simply to compute the delta against the last version. The latter approach is so common probably because of a systematic tendency to couple the deltas to the revision history. In Git the development history is not in any way tied to these deltas (which are arranged to minimize space usage) and the history is instead imposed at a higher level of abstraction.

Now that we have exposed how git has some greater flexibility in choosing what revisions to derive another revision from we get to the problem with --aggressive.

Here’s what the git-gc 1.5.3.7 man page has to say about it:

       --aggressive
           Usually git-gc runs very quickly while providing good disk space
           utilization and performance. This option will cause git-gc to more
           aggressively optimize the repository at the expense of taking much
           more time. The effects of this optimization are persistent, so this
           option only needs to be used occasionally; every few hundred
           changesets or so.

Unfortunately, this characterization is very misleading. It can be true if one has a horrendous set of delta-derivations (for example: after doing a large git-fast-import), but its true behavior is to throw away all the old deltas and compute new ones from scratch. This may not sound so bad except that --aggressive isn’t aggressive enough at doing this to do a good job and may throw away better delta decisions made previously. For this reason --aggressive will probably be removed from the manpages and left as an undocumented feature for a while.

So now you ask: “Well, suppose I do really want to do the expensive thing because I just copied my company’s history into git and it has an inordinately large pack. How do I do it?”

Excerpted from Linus’ mail here is a terse recipe (with some explanation) that may take a very long time and require a lot of RAM to run but should deliver results:

So the equivalent of "git gc --aggressive" - but done *properly* - is to
do (overnight) something like

	git repack -a -d --depth=250 --window=250

where that depth thing is just about how deep the delta chains can be
(make them longer for old history - it's worth the space overhead), and
the window thing is about how big an object window we want each delta
candidate to scan.

And here, you might well want to add the "-f" flag (which is the "drop all
old deltas", since you now are actually trying to make sure that this one
actually finds good candidates.

Other notes and observations

  • If you have a development history where you constantly change between several particular versions of, say, a large binary blob — say a resource file of some kind — this operation can be very cheap under Git since it can delta against versions that are not adjacent in the development history.
  • The delta derivations don’t have to obey causality: a commit made chronologically later can be used to derive one made earlier. It’s just a bunch of blobs in a graph, there isn’t even a strictly necessary notion of time attached to each blob at all to begin with! That data is maintained at a higher level. Repack doesn’t have to know or care about when a commit was made. (The only reason it may care is to help implement heuristics. Right now no such heuristic exists[0])
  • Finding/verifying an optimal (space-minimizing) delta-derivation graph feels NP-hard. I now wave my hands furiously.

[0]: From the git-repack man page:

--window=[N], --depth=[N]

    These two options affect how the objects contained in the pack are
    stored using delta compression. The objects are first internally
    sorted by type, size and optionally names and compared against the
    other objects within --window to see if using delta compression
    saves space. --depth limits the maximum delta depth; making it too
    deep affects the performance on the unpacker side, because delta
    data needs to be applied that many times to get to the necessary
    object. The default value for --window is 10 and --depth is 50.

Read the rest of this entry »

Written by fdr

December 6, 2007 at 4:56 am

Posted in distributed, version-control

Tagged with , , ,