BashReduce 2009/06/27
I might be crazy but I’m loving Last.fm’s BashReduce tool [1]. So much so that I’ve taken time at work to make some improvements so we can use it for some internal data analysis (unrelated to the public DNS Stats I spoke about at Velocity [2]).
Changes
I had to change the argument structure a bit from Erik Frey’s original version to accommodate some of the bigger changes I made:
-h
is now how you specify a list of hosts.-?
is now how you get the detailed help message.-m
is now how you specify the map program.-M
allows you to specify the merge program.-i
now accepts a file or a directory.-f
does some DFS-like things I’ll discuss in more detail later.
I think specifying your own merge step can be pretty useful, especially in cases where you want to create a re-reduce step on the end. A user still has to be careful to do as little work as possible in the merge step since it is serializing the output of map and reduce.
The new -f
option causes br
to distribute
filenames instead of lines of data over the network (via
nc
). The implication here is that each node has a
mirror of the directory specified with -i
.
Because br
can now act on a directory instead of a single
file, it’s possible that you as a user would want to keep some or
all of this data gzipped. br
will transparently handle
gzipped content when -i
is specified (gzipped
stdin
is not supported, use zcat
).
One last thing: stderr
from most parts of br
is saved to $tmp/br_stderr
where $tmp
is the
-t
option (which defaults to /tmp
).
Performance
I’ve found performance to pretty well follow with the number of cores made available with one caveat: if your dataset is small and your merge step is significant, it will dominate and performance gains will be reduced.
Distributed Filesystem
I want to spend some time talking about what a distributed filesystem
for br
will look like. For OpenDNS’ immediate
purposes, a filesystem mirrored using rsync
is desirable from
both a performance and durability point-of-view. For more general
use, being limited by the size of your smallest node is not acceptable.
Growing the dataset beyond the size of the smallest node introduces two challenges. First, how to partition and replicate the data and second, how to find these partitions later. Ultimately, I realize the answer is probably to stop hating and use Hadoop. Humor me.
The simplest solution is probably the RedHat GFS [3] since it’s POSIX-compliant and designed to scale beyond one node’s capacity. The downside is that it isn’t able to take any of the locality considerations of HDFS into account.
The other interesting possibility is wiring
MogileFS [4] or
HDFS [5]
itself underneath br
as a frontend. With Hadoop Streaming,
this isn’t too far fetched and may represent the best route for
a br
user who outgrows a single-node’s storage
capacity.
Licensed by the goodwill of Erik Frey
Here’s my version: http://github.com/rcrowley/bashreduce