Tuesday, November 18, 2014

Papa Parse 4, the fastest CSV parser for the browser

Papa Parse 4 is here.

This version sports a complete rewrite of the core parser, making it about 10 times faster in general, and up to 270 times faster in special cases. The speed multiplier is inversely correlated to the input size, but the new algorithm is still faster all around.

It's the fastest for its feature set of all the parsers I compared. I've run hundreds of benchmarks locally, but you can try running some benchmarks yourself on jsPerf.

I rewrote Papa Parse almost 6 times and tried about a dozen total different ways to parse CSV. I won't spend time going into detail about each algorithm. But here's a summary of how the old parser worked, then a few ways I tried that either didn't work, or worked but was too slow and complex.

Various (slow) ways of parsing CSV

The old parser iterated on each character. Though character iteration allowed us to keep a very granular amount of state as we went, it was slow, plain and simple.

So I figured that splitting on newlines and delimiters may not be a bad idea after all. Sure, quoted fields and commented lines (not part of the standard) throw things off, but we could iterate through and "repair" the broken fields if we found any. The nested loop led to too many iterations, and moving things around within arrays when making repairs was too slow.

What about doing the same thing, but starting with an empty array and appending each field, one at a time, then each row to the array, instead of splicing existing arrays? Still a nested loop; still too slow.
I tried reversing the split order: first split on delimiters, then on newlines. I didn't expect much, and sure enough, it wasn't faster.

Then I tried splitting on delimiters, and using substring to detect newlines and append rows to the final result set. This was actually working pretty well, until... quoted fields, especially quoted fields that end a line. Things got complicated quickly.

The opposite occurred to me, where I split on newlines, then use substring to extract fields between delimiters? Still a big fat nested for loop. Performance tanked before I could get the algorithm correct for all the edge cases.

Then a novel idea: split on quotes??  Hey, that's neat: no more needing to look over my shoulder for those pesky quoted fields and skipping opening and closing quotes. Then I could just use indexOf() and substring() between delimiters and newlines without any trouble. It was a weird way of thinking, since an element in the array that ends with delimiter or newline indicates a quoted field just began. Once inside a quoted field, if I encountered two consecutive, empty strings, I knew the data had a single quotes character. Strange, but plausible. The more quotes were in the data, the slower it performed. I couldn't win! And what if the data didn't have quotes? Then it was just the plain substring extraction between newlines and delimiters anyway. Wait... that could work...

To heck with split(). Even though it's a super-fast JavaScript native, the repairs were too hard. The inner loops were too slow. But that last way of splitting on quotes showed me that using substring() was pretty easy and still fast enough.

This turned out to be the fastest, simplest algorithm yet. It uses a single tight loop without any nested loops except for -- you guessed it -- handling quoted fields. There is no splicing of array elements and no string splitting.

The new algorithm

The base algorithm is fairly simple, except for when you hit a quoted field:
  1. Start with cursor = 0, row = [], results = [], nextDelim = pos of first delimiter, and nextNewline = pos of first line break
  2. loop:
    1. if input[cursor] == quote:
      1. find the closing quote using indexOf (tricky)
      2. push input.substring(cursor+1, closingQuotePos) into row, replacing "" with "
      3. advance cursor and handle end of field, line, or file as needed
      4. nextDelim = index of next delimiter, and nextNewline = index of next line break;
    2. else if nextDelim != EOF && nextDelim < nextNewline:
      1. push input.substring(cursor, nextDelim) into row
      2. advance cursor
      3. nextDelim = index of next delimiter;
    3. else if nextNewline != EOF:
      1. push input.substring(cursor, nextNewline) into row
      2. push row into results, then clear row
      3. advance cursor
      4. nextNewline = index of next newline;
    4. else: break;
  3. push input.substring(cursor, EOF) into row
  4. push row into results
  5. advance cursor
  6. return results;
Basically you just hop along between delimiters and newlines, whichever comes first, and copy substrings into the results array. Quoted fields are an exception, but they are not as complex as you'd think. (Papa Parse also handles commented lines and a few other features but the performance impact is negligible.)

The hard logic is finding the correct closing quote, because you could theoretically have a field that looks like """,""" (a delimiter with escaped quotes on either side, in a quoted field). In context, it might look like ...",""",""","... and so to make sure we do it right, there's a few fun if statements.

As for speed, I think the slowest part, other than pushing the data, is replacing escaped quotes "" with a single quote " character.

As far as I can tell, the basic algorithm is correct (over 80 tests prove it, with more on the way). Feel free to leave a pull request if you'd like to improve on it for either speed or correctness! Just make sure the tests pass and the benchmarks are good.

Benchmarks are helpful, but be wary

Now's a good time to remind that benchmarks are not scripture and should be taken with some skepticism, since most benchmarks demand more elaboration. For example, CSV parsers behave very differently on different types of input. Input with quoted fields will be slower, and input with wide rows or short rows also differ. Some parsers are faster with small input, but another parser that's slower with small input may slow down slower as input grows (which is usually more important).

From what I can see, Papa Parse slows down the slowest.

Fast Mode

Papa Parse 4 has a new "fastMode" setting which, when enabled, can parse the input even faster. By enabling fast mode, however, you assert that the input does not contain quoted fields. All it's doing is String.split(), but the nice thing is that all the other Papa Parse features still work. The only requirement is that the input doesn't have quoted fields. If it does, the quotes will not be stripped, and the results may be incorrect if any field contains delimiters, newlines, or escaped quotes. You can see the difference in these benchmarks.

The 1 GB Test

In my testing, Papa Parse can now parse a ~1 GB tab-delimited file in ~45 seconds in Chrome, and ~30 seconds in Firefox, a vast improvement to the previous 3 minutes in Chrome and 75 seconds in Firefox. And with fast mode, the same file took only about 19.48 seconds in Firefox:



Change log

Without further ado, Papa Parse 4 is a new major version. Along with several bug fixes, note these changes:
  • Handling line breaks is no longer the job of the core parser. The line break sequence (\n, \r, or \r\n) must be specified by the caller or it will be auto-detected by Papa.parse() before parsing actually begins.
  • No more ParseAbort or UnexpectedQuotes errors. It's expected that if you, the user, abort the process, you know about it, and don't want to be bothered handling another error that tells you what you already know. Also, UnexpectedQuotes is more like a warning, as it does not interfere with parsing (except slowing it down). As such, both these errors are no longer generated.
  • New fastMode setting. Enable it for way faster parsing, but only if the input doesn't have quoted fields.
  • MissingQuotes used to have the line number as a reference point. Counting line breaks slows down parsing considerably because line breaks my occur in quoted fields, so this is no longer available. However, the row index and character position within the input are still there.
  • The keepEmptyRows setting is replaced with skipEmptyLines. The default value is false, so the default behavior is that empty lines are retained in the output, which is different from previous versions. This is important for single-column CSV files that have empty fields. This also means that fancy behavior like stripping empty rows now has to be explicitly turned on, whereas before Papa was lenient with empty lines and whitespace and would take them out for you without you asking. (This was usually bad!)
  • Lines with just white space are no longer considered empty and the white space will be included in the output.
  • Comments can now start with a certain string of arbitrary length, rather than a single character. For example, comment lines can start with "//" or "=N(" instead of just "#".
  • The chunk callback is now passed three arguments: results, streamer, and if streaming a local file, the File object. The streamer has pause and resume functions exposed so you can pause parsing when receiving chunked results. Before, you could only pause parsing with the step callback.
Anyway, hope you like it. Feel free to give a shout if you're using Papa Parse, I'd love to know! (You can also donate if you want to; it's nice to be able to get a free lunch once in a while.)