Thursday, July 31, 2014

Javascript Performance: Synchronous vs. Asynchronous FileReader

I learned something surprising about Javascript that didn't intuitively transfer over from other languages.

HTML5's FileReader API lets you read files from disk. The user must choose the file(s), but then you have permission to load them in text or binary format. If you've used it before, you know that it's asynchronous, kind of like XMLHttpRequest: you set up a FileReader with callback functions before you perform a read:

reader = new FileReader();
reader.onload = function(event) {
// ... event.target.result has file contents ...
};
reader.readAsText(file);
I've heard it said that asynchronous operations in Javascript run slower than synchronous ones. I assume this has to do with processor scheduling, the overhead of a function call, etc. Asynchronous can be a little cumbersome to use, too, and it can lend itself to spaghetti code.

But did you know that there is a synchronous FileReader? No callbacks necessary:
var contents = new FileReaderSync().readAsText(file);

Nice, huh! There's one catch, though: it's only available in web workers. This is because reading a file synchronously is a blocking operation and would lock up the web page. Therefore, you can only use FileReaderSync in a separate thread.

When I rewrote Papa Parse for version 3.0 which would support running in a web worker, I was excited to use this new FileReaderSync thing. One cool feature about Papa is that it can "stream" large files by loading them in pieces so that they fit into memory. Both FileReader APIs support this "chunking" mechanism, so I took advantage of it.

Finally, it was ready to test. I executed it and roughly measured its performance. My first test was in the main thread using FileReader (asynchronous):

That looked right. It took about 11 seconds to process just under five million rows. I was happy with the results.

Then I tried it in the worker thread using FileReaderSync. Like its asynchronous sister code, this routine was periodically sending updates to the console so I could keep an eye on it. Rather abruptly, updates stopped coming. I had to resort to system-level monitors to ensure my thread hadn't locked up or died. It was still there, but its CPU usage dropped significantly and updates slowed to a crawl:

It took nearly 20x longer to process the file in a dedicated worker thread using a synchronous file reader! What!?

The jagged line was expected: the results had to be sent to the main thread after every chunk for reporting. (In Javascript, data has to be copied between threads, rather than sharing memory. D'oh.)

Isn't it odd that its speed sharply, steadily declines at exactly 30 seconds?

Here's the two graphs on the same plane:
Obviously, this was unacceptable. What was going on?

I asked the folks on Stack Overflow and Google+, and guess where I got the answer? Google+, believe it or not. Tyler Ault suggested trying the regular, asynchronous FileReader in the worker thread.

I did and it worked.

Performance in the worker then was comparable to the main thread. No slowdowns. Why?

I don't know for sure. A couple theories were discussed:

  • The garbage collector only runs when the CPU is idle on that Javascript thread. Since it was all synchronous, there wasn't any time to breathe and clean things up. Memory had to be shifted around a lot and maybe even swapped out (though I never expended my system memory; maybe the browser would put a cap on it).
  • The fact that it slowed at exactly 30 seconds is interesting. It's possible the browser was doing some throttling of threads that ran for that long without a pause. (Though that seems to defeat the purpose of using worker threads.)
If you have more light to share on the topic, please feel free to comment (and +mention me to be sure I get it) or tell me on Twitter @mholt6 or something.

Monday, July 28, 2014

GoConvey for Gophers

GoConvey is a testing utility for Go built around the standard testing package and go tools. If you're not familiar with GoConvey yet, take a look, maybe try it out, or watch the intro video before finishing this article.

Mike (@mdwhatcott) and I (@mholt6) made GoConvey to improve our workflow and to help us write clearer and more deliberate tests.

I wanted to point out a few things that may not be clear or obvious about GoConvey. Then I'll discuss why it is or isn't a good choice for you and your project.

"GoConvey" means a few things.

In your editor, it's a testing framework (namely the Convey and So functions).

In the terminal, goconvey is an HTTP server that auto-runs tests.

At http://localhost:8080, it's a graphical interface for using go test and viewing results.

In your compiled binaries, it's nothing. (More on this later.)

GoConvey is loosely coupled.

That means you can use the DSL without the goconvey server. And vice-versa.

Many developers use the web UI to auto-run and visualize standard Go tests that don't use GoConvey's special functions. GoConvey can run tests built with other testing frameworks as long as they also use go test.

By using GoConvey, you're not automatically building another barrier to entry: you're automatically building and running Go tests.

On the flip-side, it's totally normal to just keep running go test from the command line manually while using GoConvey's syntax to structure tests. (If that's what you prefer.)

GoConvey runs go test.

Look! Nothing up my sleeve. All goconvey really does is run go test. Yeah, it uses the -cover flag and stuff, but all the output and data it generates comes from the go command.

go test runs GoConvey.

GoConvey tests are in _test.go files and inside Test functions with the testing.T type, just like regular tests.

It's a good idea to use regular tests when you don't want the BDD structure or when it becomes unnatural to describe your tests in the BDD style. You can use both kinds of tests in the same package and file.

GoConvey is not a CI tool any more than go test is.

Sometimes I see people refer to GoConvey as a CI (continuous integration) utility. Maybe they define that differently than I do, but the only thing "continuous" about GoConvey is the auto-test feature and the only "integration" features it has are with go test (and maybe with the file system, if you count polling the project directory for changes).

If it works for you in a CI environment, great! (Would you tell me how?) You can write tests using the DSL and run those in CI jobs, but beyond that, the web UI is interactive and I don't think the goconvey server that powers it is useful in automated scripts...

Test dependencies are not compiled into your programs.

Except for running go test, your Go programs are not built with test files and dependencies. This means that using a testing framework like GoConvey has no technical effects on your compiled program. The file size and function of your resulting binary is unchanged.

Even go-getting any package does not download its test dependencies. You need the -t flag for that.

Test dependencies are thus very deliberately obtained and are only needed by your library's contributors/hackers. Most people will never need to concern themselves with your testing framework or assertions package of choice.

GoConvey documents code beyond godoc.

Godoc is Go's convenient, built-in documentation system. It reads comments directly inline with your code to produce package, function, and variable documentation; examples; known bugs; and more.

Far from replacing godoc, GoConvey's DSL complements it. While godoc documents how to use a package, Convey tests document and enforce package functionality and behavior.

For example, suppose you wrote a bowling game. The godoc would probably explain how to use functions like NewGame, Score, and Roll, along with what they do. This is useful for users of the package, but leaves a lot to be assumed by any developers that want to start hacking away on it. To fix this, you could make your godoc much more verbose and describe how the package is supposed to behave, but that's just noise for users who don't need to know about the internals, and developers still don't have any assertions that prove your program works like you say. You could write regular tests, but then you hope the comments stay true to the code.

This is where behavioral tests come in. The GoConvey tests for the bowling game make it clear exactly what the program should do normally and in edge cases; plus, it asserts correct behavior in context of the test case. The tests actually convey intent and become that missing documentation.

See, by using godoc along with descriptive tests, you've now sufficiently documented your code for both users and developers.

Deciding when and how to use GoConvey


Should everyone use GoConvey for everything?



No.

GoConvey is all about workflow and intent. It's not for everyone in every situation. I don't even use it all the time.

It does some things very well, but other things not so much.

Does well:


  • Documents your code with descriptive, structured tests
  • Auto-runs tests
  • Assertions
  • Reports results in real-time in the browser or via desktop notifications
  • Integrates with go test and standard testing package
  • Stubs out test code for you

Doesn't do well:

  • Automated environments (regular go test works fine though)
  • Run outside of $GOPATH
  • Save battery life
  • Race detection
  • Truly randomized test execution order
  • Idiomatic Go (I'm referring to the DSL)

There’s no magic formula that I know of to determine whether you should use GoConvey one way or another. But here are a few ideas to try:

  • Execute the goconvey command and open the web UI on your existing Go tests. See how it feels.
  • If you are starting a new project, check out the examples folder and try a new way of writing tests using GoConvey's DSL (careful, you might like it). Make sure the GoConvey server is running so you get instant feedback after every save. (Or take a look at the tests for Go-AWS-Auth for a real use case.)
  • If you don't like the nested BDD structure, try using a single level of Convey instead. This way your tests can still be somewhat descriptive and you get the benefit of the many So assertions. (For example, you can still run table-driven tests using a for loop inside a single Convey level.)
  • Use Go's dot-import for the convey package so you can make it "one of your own," so to speak. Way more convenient.

One of our hopes for GoConvey is to make writing tests more of a pleasure than any old task on your to-do list. Watching the coverage bars go up, up, up and seeing that orange "FAIL" turn into a big green "PASS" is seriously motivating.




(okay, really; done with this meme now)

Tuesday, July 15, 2014

Papa Parse 3.0 is here, and it's boss

After months of re-writing Papa Parse from the ground up, version 3.0 is finally here. In case you don't know, Papa Parse is a CSV library for Javascript. With it, you can parse CSV files or strings into JSON, and convert JSON back to CSV.

(Nerd alert: Today was also the day of csv,conf in Berlin. Isn't that awesome? I had no idea.)

Papa Parse 3 is a major, breaking change. Here's a quick look at what's new.

First, a quick warning: 3.0 is a breaking change. It's not a drop-in replacement for 2.1... you will break your app! The API is slightly different, and the results are structured differently. Read on for more information.

New results structure

Previously, parse results were returned as "results", "errors", and "meta", and "results" contained the parsed data, or if using a header row, "fields" and "rows". This was confusing and led to awkward code that looked like results.results.rows[0][2] to access any data. The new structure of results is much more consistent and intuitive:

{ data: // array of parse results errors: // array of errors meta: // object with extra info }

The "data" property only ever contains the parsed data as an array, where each element in the array represents a row. In the case of a header row, the "fields" have been moved into "meta" where they belong.

Header and dynamic typing are off by default

In Papa Parse 2.x, the header and dynamic typing were enabled by default. Now, the default is false/off, which is more intuitive. If you want any fanciness, you have to turn it on explicitly.

Eliminated the jQuery dependency

Papa Parse is now fully "Papa Parse" and not the "jQuery Parse Plugin" - the files and repository have been renamed to accommodate. Where before you would invoke $.parse(), now you simply call Papa.parse(). Much more elegant.

Technically, Papa Parse is still a jQuery plugin. If jQuery is defined, it still has the familiar $('input[type=file]').parse(...) binding that you may have used to parse local files. This interface has been improved and parsing files has never been easier.

Since Papa has been completely de-coupled from jQuery, it's easier to use in Node and on pages that don't have or want jQuery brought in.

Unparse - convert JSON to CSV

Papa's specialty is parsing CSV into JSON, or Javascript objects. But now it can export CSV too. It's easy to use:

var csv = Papa.unparse([ ["1-1", "1-2", "1-3"], ["2-1", "2-2", "2-3"] ]);
// 1-1,1-2,1-3
// 2-1,2-2,2-3

Here we passed in an array of arrays, but you could also pass in an array of objects. Even more settings are described in the documentation.

Run in web worker

Long-running scripts, like parsing large files or strings, can lock up the browser. No bueno. Papa Parse 3 can spawn a worker thread and delegate the heavy lifting away from your page. This means your page will stay reactive to mouse clicks, scrolling, etc, while heavy-duty parsing is taking place.

Web workers are actually kind of a pain in some sense, but Papa makes it easy. Just say worker: true:

Papa.parse(file, { worker: true,
complete: function(results, file) { ... }
});

Download and parse files over the Internet

Papa could parse files locally using FileReader for a while. But now it's easy to download remote files and parse them. This isn't hard even without Papa to do it for you, but the advantage here is that now you can stream the file. So if you have a large file, let's say, 200 MB, sitting on another machine, you can give Papa the URL and it will download the file in chunks and feed you the results row-by-row, rather than loading the whole thing into memory. Big win!

Papa.parse("/files/big.csv", { download: true,
step: function(data) { ... },
complete: function(results) { ... } });

Those are the most notable new features and changes in version 3.0. There's a bunch of other stuff under the hood, too, that you'll benefit from.

Now maybe get your feet wet with the demo page or visit it on GitHub.

Sunday, July 13, 2014

An AWS signing library for Go

Go-AWS-Auth is a comprehensive, lightweight AWS signing library for Go. Simply give it your http.Request and it will sign the request with the proper authentication for the service you're accessing.

Other libraries like goamz can be useful and convenient, but they do come with a cost: less flexibility and a larger code base. Though Go-AWS-Auth only does signing, it is a reliable and transparent way to interact with AWS from Go. And it works directly with your http.Request objects for any AWS service.

Now making requests to AWS with Go is extremely easy:
url := "https://iam.amazonaws.com/?Action=ListRoles&Version=2010-05-08"
client := new(http.Client)

req, err := http.NewRequest("GET", url, nil)

awsauth.Sign(req)

resp, err := client.Do(req)
The library is thread-safe and supports the following authentication mechanisms:
Feel free to use it and contribute if you find ways to improve it!