I might be crazy but I’m loving Last.fm’s BashReduce tool . So much so that I’ve taken time at work to make some improvements so we can use it for some internal data analysis (unrelated to the public DNS Stats I spoke about at Velocity ).
I had to change the argument structure a bit from Erik Frey’s original version to accommodate some of the bigger changes I made:
-his now how you specify a list of hosts.
-?is now how you get the detailed help message.
-mis now how you specify the map program.
-Mallows you to specify the merge program.
-inow accepts a file or a directory.
-fdoes some DFS-like things I’ll discuss in more detail later.
I think specifying your own merge step can be pretty useful, especially in cases where you want to create a re-reduce step on the end. A user still has to be careful to do as little work as possible in the merge step since it is serializing the output of map and reduce.
-f option causes
br to distribute
filenames instead of lines of data over the network (via
nc). The implication here is that each node has a
mirror of the directory specified with
br can now act on a directory instead of a single
file, it’s possible that you as a user would want to keep some or
all of this data gzipped.
br will transparently handle
gzipped content when
-i is specified (gzipped
stdin is not supported, use
One last thing:
stderr from most parts of
is saved to
$tmp is the
-t option (which defaults to
I’ve found performance to pretty well follow with the number of cores made available with one caveat: if your dataset is small and your merge step is significant, it will dominate and performance gains will be reduced.
I want to spend some time talking about what a distributed filesystem
br will look like. For OpenDNS’ immediate
purposes, a filesystem mirrored using
rsync is desirable from
both a performance and durability point-of-view. For more general
use, being limited by the size of your smallest node is not acceptable.
Growing the dataset beyond the size of the smallest node introduces two challenges. First, how to partition and replicate the data and second, how to find these partitions later. Ultimately, I realize the answer is probably to stop hating and use Hadoop. Humor me.
The simplest solution is probably the RedHat GFS  since it’s POSIX-compliant and designed to scale beyond one node’s capacity. The downside is that it isn’t able to take any of the locality considerations of HDFS into account.
The other interesting possibility is wiring
MogileFS  or
br as a frontend. With Hadoop Streaming,
this isn’t too far fetched and may represent the best route for
br user who outgrows a single-node’s storage
Licensed by the goodwill of Erik Frey
Here’s my version: http://github.com/rcrowley/bashreduce