Metalinguistic Abstraction

Computer Languages, Programming, and Free Software

Django file and stream serving performance Gotcha

with 9 comments

Recently I’ve been doing a little bit of work with the Django web framework for Python. Part of this project involves having a bit of reasonable binary file streaming to and from the server. There is currently a patch in trac (#2070) slated for acceptance. So I apply it and try it out and try copying some files in and out through the web server. I have some problems with the particulars of this patch and I intend to amend my complaints, but that’s for another post. What I discovered was an annoying performance gotcha in simply reading back binary files to be served to the user.

The gotcha is simple to expose:

In a Django view, use the documented functionality of passing a file-like object to the response object from the view; preferably a big, binary one. So you do something like this:

return HttpResponse(open('/path/to/big/file.bin'))

And then you surf on over to localhost and try grabbing this file. Your hard drive whirs and you notice your CPU usage is at 100% while serving the file slowly. Most people then rationalize it away saying “well, of course, Python is slow, so it makes sense that it would suck at this. Set up a dedicated static file serving server written in C and use some URL routing incantations.”

The crucial information that I had to dig for is how Django emits bytes to users. Django calls iter() on the input object and then uses calls to .next() to grab more bytes to write out to the stream. Once you factor in that the default iter() behavior for a open file in Python is to read lines you realize that there’s just an enormous amount of time and unnecessarily evil buffering going on just to emit chunks of the file separated by (in the case of binary files) completely arbitrarily spaced newline bytes. The result is lots of heap abuse as well as lots of burned CPU time looking for these needles in the haystack.

The hack to address this is very simple: we write a tiny iterator wrapper that simply uses the read(size) call. It can look something like this:

class FileIterWrapper(object):
  def __init__(self, flo, chunk_size = 1024**2):
    self.flo = flo
    self.chunk_size = chunk_size

  def next(self):
    data = self.flo.read(self.chunk_size)
    if data:
      return data
    else:
      raise StopIteration

  def __iter__(self):
    return self

1024 ** 2 in bytes is one megabyte in a chunk. When using this iterator the logic is simple and the result is that Python consumes very little CPU time and memory to rip through a file stream. It can be applied to the previous example like so:

return HttpResponse(FileIterWrapper(open('/path/to/big/file.bin')))

Now everything is fast and happy and running as it should.

So what should Django do about this? It could be just written off as an idiosyncrasy of the framework, but I think that the case is strong that Django should inspect for file-like objects and use more aggressive calls to .read() to prevent such unpredictable behavior. One problem with such large (1MB) read()s is that they may block for too long instead of trickling bytes to the user, so some asynchronous I/O strategy would be better.

There’s no reason why a small to moderate sized site should get hosed performance-wise because several people are downloading binary files from a Django server via modpython or wsgi.

Finally, proper error handling on disposing the file descriptor in the above examples is an exercise to the reader. I suggest the using the “with” statement that can be currently imported from future.

About these ads

Written by fdr

February 12, 2008 at 1:51 pm

Posted in django, projects, python

Tagged with , ,

9 Responses

Subscribe to comments with RSS.

  1. Would be nice to just make use of sendfile syscall as present on Linux etc.

    Adomas

    February 12, 2008 at 2:33 pm

  2. Thanks for this. Also, including the `mimetype’ named param may help the client-side application interpret whatever file you’re dealing with properly. Not a shocking fact, but useful.

    I sympathise with your complaints about Django’s static file skiddishness. It’s obnoxious for developing and stuff, but I doubt the Django team wants to accept the responsibility for creating a robust and reliable static web server. Besides, you can enable static files w/o too much trouble.

    wilsoniya

    March 24, 2008 at 3:37 pm

  3. @wilsoniya

    It is true, and I don’t ask it to be a *good* static file server, but the current performance penalty is atrocious…probably a factor of 50 to 100 in terms of CPU usage.

    Also, client side interpretation is neither here nor there with regard to this issue, the issue that it’s extremely easy for a naive developer to type something like “open(‘file’)” and then send the file handle to Django to be sent. Django takes the very sensible action of making an iterator of the file object, and by default Python’s iterators over files seeks to return one line at a time. Or, more to the point, any series of bytes terminated by a newline byte.

    When one wants to simply send something…such as open(‘movie.mkv’)…this seeking of newline bytes is a huge waste of time. Instead, something like the FileIterWrapper documented in the post will make the process much, much more efficient.

    My only beef with this is that it hits someone who makes this honest, common mistake far too hard. It also may reflect badly on Django’s performance, whereas it’s not really Django’s fault.

    fdr

    April 7, 2008 at 1:59 pm

  4. In my opinion, Django should at least try to make use of the optional file_wrapper stuff in WSGI (http://www.python.org/dev/peps/pep-0333/#optional-platform-specific-file-handling) if it exists, and provide a fast implementation itself for that specific case if the server doesn’t expose one.

    djc

    August 7, 2008 at 1:24 am

  5. Perhaps the built-in file class could be subclassed to return 16k chunks instead of lines for files that have been opened as binary:


    class file(file):
    def __iter__(self):
    if 'b' in self.mode:
    return iter(lambda: self.read(1024 * 16), '')
    else:
    return self

    Or Django could incorporate similar logic, or just always serve open files as chunks. There’s no point seeking for lines when it’s not doing anything with them.

    For big chunk sizes, I don’t think using alarms or more threads are going to help you. AFAIK, during a request, one thread is dedicated to servicing that request, right until it completes. Breaking out of a long write is not going to free that thread to service other requests unless you stop servicing the request entirely. The chunk size is probably only a concern if you’re planning on servicing a lot of concurrent downloads, when you might need to reduce memory consumption.

    Gavin Panella

    August 7, 2008 at 5:16 am

  6. @Gavin Panella

    Large read() calls may be problematic — for reasons besides memory usage — if they block for an uncomfortable period of time before returning, especially if your disk is highly contended for or slow (or over a network).

    What I’m really trying to approximate here is asynchronous I/O where you are regularly given chances to empty a buffer that is filled as quickly as possible to allow more responsive streaming of bytes. A basic consumer/producer issue.

    This problem is present for any time when read() may block for an extended period when you could be delivering chunks of the stream that you’ve already gotten.

    fdr

    August 7, 2008 at 12:10 pm

  7. hey! Saw this one coming and found your post via google. Splendid work, totally agree. I didn’t need to implement with, I just hit ‘close’ on self.flo just before raising StopIteration. Thanks for the help!

    bambam

    March 11, 2009 at 3:37 am

    • @bambam
      Just beware any exceptions…’with’ is nice to avoid a try/except/finally block to clean up correctly.

      fdr

      April 5, 2009 at 6:59 pm

  8. Hey.
    I’m going to use your class, it’s very useful, not only for django.
    However, as you mentioned, it’s much cleaner and beautiful to use a with block. With blocks require a __enter__ and __exit__ method, so I added them to your class and am using it :)


    def __exit__(self,type,value,traceback):
    self.flo.__exit__(type,value,traceback)

    def __enter__(self):
    self.flo.__enter__()
    return self

    Nice article :)

    Fábio Santos

    September 20, 2011 at 7:14 am


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: