Recently I’ve been thinking a lot about application partitioning, and
how ad-hoc it seems to be at large: everyone knows that they have to
do it (some people call it “sharding”). No approach seems to be
supported as a common pattern by web application frameworks that I
would consider using.
I also was thinking about GiST and GIN indexes in Postgres, which,
like all other index types, serve to partition parts of a database
heap to allow for rapid retrieval of data matching certain predicates.
The compelling property of GiST and GIN is that they rely on a small
number of operators that must be implemented by the author of a data
type: this was the principal contribution of the original GiST papers,
and GIN uses a smaller subset of the very same operators. I then
thought, “why isn’t there such a structured interface for Django?”
Take note that this is a very different question than a holy grail of
“automatically make my application support partitioning,” but rather
“can I even come up with a rigorous checklist to help reduce the
uncertainty in planning partitioning features and allow for design of
tools around a generally useful approach.” With this in mind, I think
that the research in GiST is useful as an inspiration to such an
Here’s a paragraph to ease one into my frame of thought, taken from
the Postgres documentation; the operator names (‘same’, ‘consistent’,
etc) are important to note.
There are seven methods that an index operator class for GiST must provide. Correctness of the index is ensured by proper implementation of the 'same', 'consistent' and 'union' methods, while efficiency (size and speed) of the index will depend on the 'penalty' and 'picksplit' methods. The remaining two methods are 'compress' and 'decompress,' which allow an index to have internal tree data of a different type than the data it indexes.
Very Roughly sketched:
- compress/decompress: I am unsure what these would mean, exactly; they may be the identity function.
- same: is a new partitioning scheme going to result in changed routing information? This is used to decide if new splits are necessary or not. This operator does not make a direct appearance in the original GiST paper, but does in the PostgreSQL implementation.
- consistent: finding the right chunk of hardware to serve a query or web request (contains the partition that query or web request is pertinent to)
- union: consolidation of servers
- picksplit: split a server into two partitions
- penalty: estimate the cost of putting a partition somewhere
GIN may be a more appropriate mental model for some of this, but keep
in mind that GIN requires a btree over the keys and doesn’t address
how to make that btree (which GiST does). GIN could probably be
defined as a GiST type, in theory, to take into account that btree.
I also considered GIN in this analysis, too, since in practice
implementing either one tend to require the same basic pieces; as a
result, many datatypes support both when there’s a useful non-range
query to be satisfied. GIN has a narrower interface, although it
“cheats” by relying on a self-balancing B-tree over the keys it
indexes, whereas GiST does not assume the existence of such a tree,
and forces you to define operators to build such structures.
Of more general note, it looks like consistent, but fine-grained
partitions could be pretty useful, so it’s not safe to assume that the
number of partitions is small; thus, some cleverness in managing a
large number of fine-grained partitions across servers can be called
for in some cases: