Large Binary Data is (not) a Weakness of Erlang
Update: Good news in the comments, with some good details. To sum up:
split_binary/2 is the call to use. Binaries are also refcounted instead of garbage collected. The next release will fix large binary pattern matching due to improper handling of bignums in the current release. So the internet wins again and the original blog author should be happy.
Finally — even if we get all this bookkeeping right — we have to deal with the possibility of obscene wastage. Suppose someone loads five hundred megabytes of binary data into memory, takes a five hundred byte slice, and discards the old instance. We can’t deallocate just parts of an array with standard [mc]alloc(), so we have to make a decision on how many bytes of wastage is worth not copying the salient part of the array to a fresh, appropriately sized heap allocation. Sometimes it may be clear, but with lots of small-ish binary instances wasting some memory it may be less clear. Or I guess we could just call realloc() every time, which may be a little bit overkill but would be simple-ish and predictable…
One thing that does remain, however, is shrinking the region allocated for the binary memory. There is a danger of keeping big chunks of binary around when one has taken a tiny slice, so either some GC work needs to be put in place to realloc() properly or the user just needs to be cognizant and copy the binary if they feel that it may release a large chunk of memory and then let all previous references die.
Venturing onward on this posting should be reserved for archivists and the curious.
The following should only be read by the curious. I’m just keeping for…posterity?
Erlang’s binary manipulation procedures are definitely not suited for large-scale bit-management, especially considering how a seemingly deficient pattern matching implementation is the primary means of selecting slices of a binary. Jumping 2**27 – 1 bytes at a time just seems so wrong. I poked around a while to find an Erlang binary++ third-party package or some sort of built-in that would ease my disquiet, but alas, I found nothing. This is most saddening, because my naive instinct is that the act of taking a slice of a binary chunk should be practically free with the no-mutate contract with Erlang. It may, however, require addressing of the following issues:
Obligatory Disclaimer: The following is based on my general understanding of the world and not the fine-level details of Erlang’s implementation in these areas. I would certainly invite anyone intimately familiar with Erlang’s internals to share corrections or insight.
- In the naive-implementation one could allocate a new Erlang binary with a pointer into somewhere in the middle of another Erlang binary. This is happy because of no-mutate until someone goes and rudely disposes the underlying binary as garbage or moves it (presuming Erlang has a compacting/moving garbage collector).
- We could solve some of this with another level of indirection (if not present already) that decouples the Erlang binary instance from the binary data itself and decorating the data with a refcount or somesuch. Hopefully the garbage collector will have a finalizer-type hook that will allow the refcount to be updated properly when one of our referrers gets destroyed. Obviously for this kind of thing to work we want to keep our binary data more or less outside the garbage collector’s knowledge by using a fundamental system call ala [mc]alloc(), unless we want to get a whole lot more complicated and let it get compacted/moved and be capable of updating all the referrers’ pointers.
- Finally — even if we get all this bookkeeping right — we have to deal with the possibility of obscene wastage. Suppose someone loads five hundred megabytes of binary data into memory, takes a five hundred byte slice, and discards the old instance. We can’t deallocate just parts of an array with standard [mc]alloc(), so we have to make a decision on how many bytes of wastage is worth not copying the salient part of the array to a fresh, appropriately sized heap allocation. Sometimes it may be clear, but with lots of small-ish binary instances wasting some memory it may be less clear. Or I guess we could just call realloc() every time, which may be a little bit overkill but would be simple-ish and predictable…
- Where should one expose this? Should the pattern matcher be ‘fixed’? Should it just be some functions in a module like “binary” (much as we have “lists” and “dict” and so on)? Should we write the former and then write some parse_transform statements to mangle it into the latter? (That sounds possible, but probably painful…)
In other words, a lot more work than one may think. Unless we just copy stuff and not fix the binary pattern matching, and then it’s all relatively easy. I think. But then we don’t get O(1) slicing!
In this case, I’d probably vote for the Worse-is-Better approach: copy data to get O(N)-speed copies to avoid trying to mimic copy-on-write while fighting with the garbage collector and just write additional functions instead of poking at the pattern matcher. Then throw that copy away and write a better one that gets O(1) slicing. Then fix the pattern matcher.
On the subject of largish-files from disk: where is my mmap functionality? Does that exist? Another question for another post…my cursory web search insofar is not fruitful, so either no one talks about it because it’s so nicely abstracted into Erlang somehow or because it’s not being employed.
Getting momentum may be as simple as writing a low-level Port Driver that will offer a somewhat painful (but functional) way to address these issues. If enough people care it’ll be worth getting into mainstream Erlang. (And making sure it runs on all those platforms et al)
Actually, one might want to think of this as a Functor/Monad, although performing the advanced category mappings in C might be a truly painful task, depending on what one wants to support.