Thursday, July 31, 2014

Javascript Performance: Synchronous vs. Asynchronous FileReader

I learned something surprising about Javascript that didn't intuitively transfer over from other languages.

HTML5's FileReader API lets you read files from disk. The user must choose the file(s), but then you have permission to load them in text or binary format. If you've used it before, you know that it's asynchronous, kind of like XMLHttpRequest: you set up a FileReader with callback functions before you perform a read:

reader = new FileReader();
reader.onload = function(event) {
// ... event.target.result has file contents ...
};
reader.readAsText(file);
I've heard it said that asynchronous operations in Javascript run slower than synchronous ones. I assume this has to do with processor scheduling, the overhead of a function call, etc. Asynchronous can be a little cumbersome to use, too, and it can lend itself to spaghetti code.

But did you know that there is a synchronous FileReader? No callbacks necessary:
var contents = new FileReaderSync().readAsText(file);

Nice, huh! There's one catch, though: it's only available in web workers. This is because reading a file synchronously is a blocking operation and would lock up the web page. Therefore, you can only use FileReaderSync in a separate thread.

When I rewrote Papa Parse for version 3.0 which would support running in a web worker, I was excited to use this new FileReaderSync thing. One cool feature about Papa is that it can "stream" large files by loading them in pieces so that they fit into memory. Both FileReader APIs support this "chunking" mechanism, so I took advantage of it.

Finally, it was ready to test. I executed it and roughly measured its performance. My first test was in the main thread using FileReader (asynchronous):

That looked right. It took about 11 seconds to process just under five million rows. I was happy with the results.

Then I tried it in the worker thread using FileReaderSync. Like its asynchronous sister code, this routine was periodically sending updates to the console so I could keep an eye on it. Rather abruptly, updates stopped coming. I had to resort to system-level monitors to ensure my thread hadn't locked up or died. It was still there, but its CPU usage dropped significantly and updates slowed to a crawl:

It took nearly 20x longer to process the file in a dedicated worker thread using a synchronous file reader! What!?

The jagged line was expected: the results had to be sent to the main thread after every chunk for reporting. (In Javascript, data has to be copied between threads, rather than sharing memory. D'oh.)

Isn't it odd that its speed sharply, steadily declines at exactly 30 seconds?

Here's the two graphs on the same plane:
Obviously, this was unacceptable. What was going on?

I asked the folks on Stack Overflow and Google+, and guess where I got the answer? Google+, believe it or not. Tyler Ault suggested trying the regular, asynchronous FileReader in the worker thread.

I did and it worked.

Performance in the worker then was comparable to the main thread. No slowdowns. Why?

I don't know for sure. A couple theories were discussed:

  • The garbage collector only runs when the CPU is idle on that Javascript thread. Since it was all synchronous, there wasn't any time to breathe and clean things up. Memory had to be shifted around a lot and maybe even swapped out (though I never expended my system memory; maybe the browser would put a cap on it).
  • The fact that it slowed at exactly 30 seconds is interesting. It's possible the browser was doing some throttling of threads that ran for that long without a pause. (Though that seems to defeat the purpose of using worker threads.)
If you have more light to share on the topic, please feel free to comment (and +mention me to be sure I get it) or tell me on Twitter @mholt6 or something.