BashReduce 2009/06/27

I might be crazy but I’m loving Last.fm’s BashReduce tool [1]. So much so that I’ve taken time at work to make some improvements so we can use it for some internal data analysis (unrelated to the public DNS Stats I spoke about at Velocity [2]).

Changes

I had to change the argument structure a bit from Erik Frey’s original version to accommodate some of the bigger changes I made:

-h is now how you specify a list of hosts.
-? is now how you get the detailed help message.
-m is now how you specify the map program.
-M allows you to specify the merge program.
-i now accepts a file or a directory.
-f does some DFS-like things I’ll discuss in more detail later.

I think specifying your own merge step can be pretty useful, especially in cases where you want to create a re-reduce step on the end. A user still has to be careful to do as little work as possible in the merge step since it is serializing the output of map and reduce.

The new -f option causes br to distribute filenames instead of lines of data over the network (via nc). The implication here is that each node has a mirror of the directory specified with -i.

Because br can now act on a directory instead of a single file, it’s possible that you as a user would want to keep some or all of this data gzipped. br will transparently handle gzipped content when -i is specified (gzipped stdin is not supported, use zcat).

One last thing: stderr from most parts of br is saved to $tmp/br_stderr where $tmp is the -t option (which defaults to /tmp).

Performance

I’ve found performance to pretty well follow with the number of cores made available with one caveat: if your dataset is small and your merge step is significant, it will dominate and performance gains will be reduced.

Distributed Filesystem

I want to spend some time talking about what a distributed filesystem for br will look like. For OpenDNS’ immediate purposes, a filesystem mirrored using rsync is desirable from both a performance and durability point-of-view. For more general use, being limited by the size of your smallest node is not acceptable.

Growing the dataset beyond the size of the smallest node introduces two challenges. First, how to partition and replicate the data and second, how to find these partitions later. Ultimately, I realize the answer is probably to stop hating and use Hadoop. Humor me.

The simplest solution is probably the RedHat GFS [3] since it’s POSIX-compliant and designed to scale beyond one node’s capacity. The downside is that it isn’t able to take any of the locality considerations of HDFS into account.

The other interesting possibility is wiring MogileFS [4] or HDFS [5] itself underneath br as a frontend. With Hadoop Streaming, this isn’t too far fetched and may represent the best route for a br user who outgrows a single-node’s storage capacity.

Licensed by the goodwill of Erik Frey

Here’s my version: http://github.com/rcrowley/bashreduce