Metalinguistic Abstraction

Computer Languages, Programming, and Free Software

Large Binary Data is (not) a Weakness of Erlang

with 4 comments

Update: Good news in the comments, with some good details. To sum up:

split_binary/2 is the call to use. Binaries are also refcounted instead of garbage collected. The next release will fix large binary pattern matching due to improper handling of bignums in the current release. So the internet wins again and the original blog author should be happy.

Finally — even if we get all this bookkeeping right — we have to deal with the possibility of obscene wastage. Suppose someone loads five hundred megabytes of binary data into memory, takes a five hundred byte slice, and discards the old instance. We can’t deallocate just parts of an array with standard [mc]alloc(), so we have to make a decision on how many bytes of wastage is worth not copying the salient part of the array to a fresh, appropriately sized heap allocation. Sometimes it may be clear, but with lots of small-ish binary instances wasting some memory it may be less clear. Or I guess we could just call realloc() every time, which may be a little bit overkill but would be simple-ish and predictable…

One thing that does remain, however, is shrinking the region allocated for the binary memory. There is a danger of keeping big chunks of binary around when one has taken a tiny slice, so either some GC work needs to be put in place to realloc() properly or the user just needs to be cognizant and copy the binary if they feel that it may release a large chunk of memory and then let all previous references die.

Venturing onward on this posting should be reserved for archivists and the curious.

The following should only be read by the curious. I’m just keeping for…posterity?

Erlang’s binary manipulation procedures are definitely not suited for large-scale bit-management, especially considering how a seemingly deficient pattern matching implementation is the primary means of selecting slices of a binary. Jumping 2**27 – 1 bytes at a time just seems so wrong. I poked around a while to find an Erlang binary++ third-party package or some sort of built-in that would ease my disquiet, but alas, I found nothing. This is most saddening, because my naive instinct is that the act of taking a slice of a binary chunk should be practically free with the no-mutate contract with Erlang. It may, however, require addressing of the following issues:

Obligatory Disclaimer: The following is based on my general understanding of the world and not the fine-level details of Erlang’s implementation in these areas. I would certainly invite anyone intimately familiar with Erlang’s internals to share corrections or insight.

  • In the naive-implementation one could allocate a new Erlang binary with a pointer into somewhere in the middle of another Erlang binary. This is happy because of no-mutate until someone goes and rudely disposes the underlying binary as garbage or moves it (presuming Erlang has a compacting/moving garbage collector).
  • We could solve some of this with another level of indirection (if not present already) that decouples the Erlang binary instance from the binary data itself and decorating the data with a refcount or somesuch. Hopefully the garbage collector will have a finalizer-type hook that will allow the refcount to be updated properly when one of our referrers gets destroyed. Obviously for this kind of thing to work we want to keep our binary data more or less outside the garbage collector’s knowledge by using a fundamental system call ala [mc]alloc(), unless we want to get a whole lot more complicated and let it get compacted/moved and be capable of updating all the referrers’ pointers.
  • Finally — even if we get all this bookkeeping right — we have to deal with the possibility of obscene wastage. Suppose someone loads five hundred megabytes of binary data into memory, takes a five hundred byte slice, and discards the old instance. We can’t deallocate just parts of an array with standard [mc]alloc(), so we have to make a decision on how many bytes of wastage is worth not copying the salient part of the array to a fresh, appropriately sized heap allocation. Sometimes it may be clear, but with lots of small-ish binary instances wasting some memory it may be less clear. Or I guess we could just call realloc() every time, which may be a little bit overkill but would be simple-ish and predictable…
  • Where should one expose this? Should the pattern matcher be ‘fixed’? Should it just be some functions in a module like “binary” (much as we have “lists” and “dict” and so on)? Should we write the former and then write some parse_transform statements to mangle it into the latter? (That sounds possible, but probably painful…)

In other words, a lot more work than one may think. Unless we just copy stuff and not fix the binary pattern matching, and then it’s all relatively easy. I think. But then we don’t get O(1) slicing!

In this case, I’d probably vote for the Worse-is-Better approach: copy data to get O(N)-speed copies to avoid trying to mimic copy-on-write while fighting with the garbage collector and just write additional functions instead of poking at the pattern matcher. Then throw that copy away and write a better one that gets O(1) slicing. Then fix the pattern matcher.

On the subject of largish-files from disk: where is my mmap functionality? Does that exist? Another question for another post…my cursory web search insofar is not fruitful, so either no one talks about it because it’s so nicely abstracted into Erlang somehow or because it’s not being employed.

Addendum:

Getting momentum may be as simple as writing a low-level Port Driver that will offer a somewhat painful (but functional) way to address these issues. If enough people care it’ll be worth getting into mainstream Erlang. (And making sure it runs on all those platforms et al)

Actually, one might want to think of this as a Functor/Monad, although performing the advanced category mappings in C might be a truly painful task, depending on what one wants to support.

About these ads

Written by fdr

July 9, 2007 at 3:24 am

Posted in erlang, languages

4 Responses

Subscribe to comments with RSS.

  1. You can create a binary as a pointer into another binary:

    {_, B2} = split_binary(B1, 4711).

    And, from the Erlang FAQ:
    “Normal data in Erlang is put on the process heap, which is garbage collected. Large binaries, on the other hand, are reference counted.”

    Per Melin

    July 10, 2007 at 4:18 am

  2. Selecting slices in binaries is O(1). The reason there were problems to skip ahead in binaries was because binary pattern matching did not handle bignums correctly. This will be fixed in the next release.

    There is a danger in referencing binaries which can lead to wasted memory as you mentioned. As far as I know the gc does nothing to fix this.

    There is already built in functions to handle binaries e.g. split_binary/2. Which divides a binary into two at O(1) cost.

    If you want to have large binaries handled correctly now you should use a 64-bit machine, this holds true even if pattern matching is fixed. Because the sizes of binaries on 32-bit machines is limited. Definitely to smaller than 4 GB, but the limit is perhaps even smaller than that.

    Per Gustafsson

    July 10, 2007 at 6:41 am

  3. […] Large Binary Data is (not) a Weakness of Erlang Update: Good news in the comments, with some good details. To sum up: split_binary/2 is the call to use. Binaries are […] […]

  4. > the original blog author should be happy.

    I am indeed. :-)

    I posted that entry on the off chance that someone else might run into this rare problem and need a quick solution. Having the issue treated as a bug and fixed in the language itself was a most pleasant surprise.

    Philip Robinson

    July 12, 2007 at 4:17 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: