Metalinguistic Abstraction

Computer Languages, Programming, and Free Software

the woes of “git gc –aggressive” (and how git deltas work)

with 11 comments

Today I found a gem in the git mailing lists that discusses a little bit about how git handles deltas in the pack (i.e. efficiently storing revisions) and why — somewhat non-obviously — the aggressive git garbage collect (invoked by doing git gc --aggressive) is (generally) a big no-no. The verbatim email from Linus explaining this is affixed as part of the full text of this article.

A quick summary

Since there is little point in simply reposting this information (other than for personal archival), I will condense it here for quick reading:

Git does not use your standard per-file/per-commit forward and/or backward delta chains to derive files. Instead, it is legal to use any other stored version to derive another version. Contrast this to most version control systems where the only option is simply to compute the delta against the last version. The latter approach is so common probably because of a systematic tendency to couple the deltas to the revision history. In Git the development history is not in any way tied to these deltas (which are arranged to minimize space usage) and the history is instead imposed at a higher level of abstraction.

Now that we have exposed how git has some greater flexibility in choosing what revisions to derive another revision from we get to the problem with --aggressive.

Here's what the git-gc 1.5.3.7 man page has to say about it:

       --aggressive
           Usually git-gc runs very quickly while providing good disk space
           utilization and performance. This option will cause git-gc to more
           aggressively optimize the repository at the expense of taking much
           more time. The effects of this optimization are persistent, so this
           option only needs to be used occasionally; every few hundred
           changesets or so.

Unfortunately, this characterization is very misleading. It can be true if one has a horrendous set of delta-derivations (for example: after doing a large git-fast-import), but its true behavior is to throw away all the old deltas and compute new ones from scratch. This may not sound so bad except that --aggressive isn't aggressive enough at doing this to do a good job and may throw away better delta decisions made previously. For this reason --aggressive will probably be removed from the manpages and left as an undocumented feature for a while.

So now you ask: "Well, suppose I do really want to do the expensive thing because I just copied my company's history into git and it has an inordinately large pack. How do I do it?"

Excerpted from Linus' mail here is a terse recipe (with some explanation) that may take a very long time and require a lot of RAM to run but should deliver results:

So the equivalent of "git gc --aggressive" - but done *properly* - is to
do (overnight) something like

	git repack -a -d --depth=250 --window=250

where that depth thing is just about how deep the delta chains can be
(make them longer for old history - it's worth the space overhead), and
the window thing is about how big an object window we want each delta
candidate to scan.

And here, you might well want to add the "-f" flag (which is the "drop all
old deltas", since you now are actually trying to make sure that this one
actually finds good candidates.

Other notes and observations

  • If you have a development history where you constantly change between several particular versions of, say, a large binary blob — say a resource file of some kind — this operation can be very cheap under Git since it can delta against versions that are not adjacent in the development history.
  • The delta derivations don't have to obey causality: a commit made chronologically later can be used to derive one made earlier. It's just a bunch of blobs in a graph, there isn't even a strictly necessary notion of time attached to each blob at all to begin with! That data is maintained at a higher level. Repack doesn't have to know or care about when a commit was made. (The only reason it may care is to help implement heuristics. Right now no such heuristic exists[0])
  • Finding/verifying an optimal (space-minimizing) delta-derivation graph feels NP-hard. I now wave my hands furiously.

[0]: From the git-repack man page:

--window=[N], --depth=[N]

    These two options affect how the objects contained in the pack are
    stored using delta compression. The objects are first internally
    sorted by type, size and optionally names and compared against the
    other objects within --window to see if using delta compression
    saves space. --depth limits the maximum delta depth; making it too
    deep affects the performance on the unpacker side, because delta
    data needs to be applied that many times to get to the necessary
    object. The default value for --window is 10 and --depth is 50.

Linus' email to the list

Date:	Wed, 5 Dec 2007 22:09:12 -0800 (PST)
From:	Linus Torvalds
Subject: Re: Git and GCC

On Thu, 6 Dec 2007, Daniel Berlin wrote:
>
> Actually, it turns out that git-gc --aggressive does this dumb thing
> to pack files sometimes regardless of whether you converted from an
> SVN repo or not.

Absolutely. git --aggressive is mostly dumb. It's really only useful for
the case of "I know I have a *really* bad pack, and I want to throw away
all the bad packing decisions I have done".

To explain this, it's worth explaining (you are probably aware of it, but
let me go through the basics anyway) how git delta-chains work, and how
they are so different from most other systems.

In other SCM's, a delta-chain is generally fixed. It might be "forwards"
or "backwards", and it might evolve a bit as you work with the repository,
but generally it's a chain of changes to a single file represented as some
kind of single SCM entity. In CVS, it's obviously the *,v file, and a lot
of other systems do rather similar things.

Git also does delta-chains, but it does them a lot more "loosely". There
is no fixed entity. Delta's are generated against any random other version
that git deems to be a good delta candidate (with various fairly
successful heursitics), and there are absolutely no hard grouping rules.

This is generally a very good thing. It's good for various conceptual
reasons (ie git internally never really even needs to care about the whole
revision chain - it doesn't really think in terms of deltas at all), but
it's also great because getting rid of the inflexible delta rules means
that git doesn't have any problems at all with merging two files together,
for example - there simply are no arbitrary *,v "revision files" that have
some hidden meaning.

It also means that the choice of deltas is a much more open-ended
question. If you limit the delta chain to just one file, you really don't
have a lot of choices on what to do about deltas, but in git, it really
can be a totally different issue.

And this is where the really badly named "--aggressive" comes in. While
git generally tries to re-use delta information (because it's a good idea,
and it doesn't waste CPU time re-finding all the good deltas we found
earlier), sometimes you want to say "let's start all over, with a blank
slate, and ignore all the previous delta information, and try to generate
a new set of deltas".

So "--aggressive" is not really about being aggressive, but about wasting
CPU time re-doing a decision we already did earlier!

*Sometimes* that is a good thing. Some import tools in particular could
generate really horribly bad deltas. Anything that uses "git fast-import",
for example, likely doesn't have much of a great delta layout, so it might
be worth saying "I want to start from a clean slate".

But almost always, in other cases, it's actually a really bad thing to do.
It's going to waste CPU time, and especially if you had actually done a
good job at deltaing earlier, the end result isn't going to re-use all
those *good* deltas you already found, so you'll actually end up with a
much worse end result too!

I'll send a patch to Junio to just remove the "git gc --aggressive"
documentation. It can be useful, but it generally is useful only when you
really understand at a very deep level what it's doing, and that
documentation doesn't help you do that.

Generally, doing incremental "git gc" is the right approach, and better
than doing "git gc --aggressive". It's going to re-use old deltas, and
when those old deltas can't be found (the reason for doing incremental GC
in the first place!) it's going to create new ones.

On the other hand, it's definitely true that an "initial import of a long
and involved history" is a point where it can be worth spending a lot of
time finding the *really*good* deltas. Then, every user ever after (as
long as they don't use "git gc --aggressive" to undo it!) will get the
advantage of that one-time event. So especially for big projects with a
long history, it's probably worth doing some extra work, telling the delta
finding code to go wild.

So the equivalent of "git gc --aggressive" - but done *properly* - is to
do (overnight) something like

	git repack -a -d --depth=250 --window=250

where that depth thing is just about how deep the delta chains can be
(make them longer for old history - it's worth the space overhead), and
the window thing is about how big an object window we want each delta
candidate to scan.

And here, you might well want to add the "-f" flag (which is the "drop all
old deltas", since you now are actually trying to make sure that this one
actually finds good candidates.

And then it's going to take forever and a day (ie a "do it overnight"
thing). But the end result is that everybody downstream from that
repository will get much better packs, without having to spend any effort
on it themselves.

			Linus
About these ads

Written by fdr

December 6, 2007 at 4:56 am

Posted in distributed, version-control

Tagged with , , ,

11 Responses

Subscribe to comments with RSS.

  1. This seems like an awful lot of bother for something I just want to track my source code. Which I guess is why I ended up using Bazaar instead of git. :p

    InfiniteVoid

    June 17, 2008 at 7:59 am

  2. * bzr, AFAIK, is a far stretch from being as fast as git
    * especially at finding things like content-moves
    * explicit move metadata is annoying to import, I prefer git’s heuristic approach (it’s not rocket science to discover a file move)
    * git gc is nowadays automatic
    * this repacking only really need be an issue when you import large repos with git-fast-import or one of the tools that uses it.
    * newer git svn will automagically handle this too…
    * git’s history-rewriting friendliness is a godsend when refactoring patchsets and doing code review.

    fdr

    June 24, 2008 at 9:39 pm

  3. [...] that you might want to call gc and repack to compress the git repository. $ git gc –aggressive $ git repack -a -d –depth=250 [...]

  4. Wait wait, am i reading you right, that git delta encoding has no relationship to the actual relationship between commits?

    Then why is git gc –aggressive wrong? Worst case, the repo becomes at most a tiny bit slower, which frankly doesn’t seem like an issue at all, given that git is basically lightning fast, but it will still be most compact.

    That git repack would take overnight as written above, is totally not the case for anything humanly sized. Takes up 650 MB of RAM and runs within minutes on a 25 000 commit, 4 years old repo, 10 year old codebase of around two million lines of text. The resulting pack is about 65 MB large compresed from 800 MB initially produced by fast-import.

    • Yes, you are reading me right. Git has some heuristics as to how to organize objects into a window that is then used for delta-chaining, and that need not have anything to do with the lineage of the content being changed (although it optionally can).

      It is wrong when one has a high-quality pack already: --aggressive will forget all those good decisions and make a bunch of worse decisions instead. But the real problem seems that --aggressive was, at the time at least, not aggressive enough, so in the above post Linus recommends an even-more aggressive formulation that seeks out longer delta chains and uses more memory.

      Your use case, fast-import he specifically addresses as being one of the winners, when one knows the delta derivations in the pack is very lousy.

      fdr

      August 10, 2012 at 11:59 am

    • Just for context — you can argue whether this is humanly sized — your repo is tiny, and your history shallow, compared to the one I deal with.

      * 271,568 commits imported from SVN.
      * 15 year history
      * nearly 2 GB of repository data. (mostly contained in 10 packs)
      * 11,322,497 lines of Java code
      * 318,448 lines of Framemaker source
      * millions of lines of other stuff I got tired of waiting on the line counts for.

      I’m happy for you for your lightweight repository, but you shouldn’t dismiss the concerns about the time it takes, when there are repositories out there an order of magnitude bigger than yours.

      Still — you provide a useful datapoint, and I may give it a try sometime when I can spare one of my repositories a couple days (in case it takes that long to complete).

      Git is certainly fast — the performance issues I have relate to the size of the worktree itself, and anything that doesn’t have to traverse the actual worktree is really quick. ‘git status’ is probably the biggest annoyance, and I’m looking at splitting things up. Splitting the codebase into separate branches in the same repository speeds things up considerably, so it’s definitely the worktree size that’s the issue.

      But it does make me suspect that there may NOT be repositories out there another order of magnitude larger than this one.

      Bob Kerns

      October 4, 2012 at 5:45 am

  5. Any updates on this? The git gc –aggressive documentation reads as above, 4 1/2 years later. Is –aggressive still to be avoided under normal usage scenarios (i.e. no huge nasty imports, just normal everyday work)?

    • I think it’s still a bad idea except when one has a very disorganized pack, which basically never happens in common operation.

      Since then, though, one is able to specify the amount of memory (rather than the number of objects) used for the delta-chain seeking window, using the git repack --window-memory option. This is handy because the old formulation could blow out memory if some objects were very large.

      fdr

      August 10, 2012 at 11:55 am

  6. Has aggressive been fixed? Latest man page still says same as above. If the option is wrong, shouldn’t the man be updated?

    Guest

    November 27, 2012 at 3:36 pm

  7. [...] repack makes it a little smaller. Running garbage collection made it way smaller, but it’s not recommended. This, however, is [...]

  8. […] Check out this article mentioned by @spuder in the comments for a better […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: