Graphite is a nearly undocumented thing of sweetness. We at Betable collect, store, and graph many tens of thousands of metrics and we quickly reached the limits of a single-instance Graphite infrastructure. That’s OK, though, because Graphite does actually support some pretty reasonable federated architectures if you know where to look. Consider this your treasure map.
Aside: https://gist.github.com/2311507 is how I build Debian packages for Graphite which I deploy via Freight.
You probably already have your firewalls configured to allow your entire infrastructure to send metrics to Graphite on port 2003. Your Graphite almost-cluster is going to need a bit more permission to communicate with itself.
Open ports 2014, 2114, and so on as you need between all Graphite nodes. These are how the
carbon-relay.py instance on each Graphite node will communicate with the
carbon-cache.py instance(s) on each Graphite node.
Also open the port on which the web app listens to other Graphite nodes. This is how the web apps will communicate with the other web apps.
carbon-cache.py instance should listen on a non-loopback interface and on a port that’s open to the
carbon-relay.pys. I have found two instances of
carbon-cache.py per Graphite node to be advantageous on the Rackspace Cloud (YMMV). Their pickle receivers listen on 2014 for instance
a and 2114 for instance
b on each Graphite node.
carbon-cache.py instance must have a name unique with respect to all
carbon-cache.py instances on all Graphite nodes. For example:
18.104.22.168:2114:d. This isn’t obvious from the documentation or even from reading the code, so keep this in mind.
I started the migration from one Graphite node to two by
rsyncing all Betable’s Whisper databases from the old node to the new node.
rsync -avz --exclude=carbon graphite-ops.betable.com:/var/lib/graphite/whisper /var/lib/graphite/
The original plan (which is executed below) was to go back later and “garbage collect” the Whisper databases that didn’t belong after the Graphite cluster was up-and-running. If you’re feeling adventurous, you should use
whisper-clean.py from https://gist.github.com/3153844 to choose which Whisper databases to
rsync and delete up-front because a Whisper database on the local filesystem will be preferred over querying a remote
Each Graphite node should run a single
carbon-relay.py instance. It should listen on 2003 and 2004 so your senders don't have to be reconfigured.
carbon-cache.py instances on all Graphite nodes in
carbon.conf, each with its
PICKLE_RECEIVER_PORT, and instance name. Sort the instances by their name. This is what allows metrics to arrive at any Graphite node and be routed to the right Whisper database.
To use consistent hashing or not to use consistent hashing? That’s your own problem. I use consistent hashing because I have better things to do than balance a Graphite cluster by hand.
With this configuration, Graphite’s write path is federated between two nodes. Its read path, however, appears to be missing half its data.
Each Graphite node should run the web app, even if you don’t plan to use it directly.
Each web app should list the local
carbon-cache.py instances in
CARBONLINK_HOSTS each with its
CACHE_QUERY_PORT, and instance name. Sort the instances by their name as before so the consistent hash ring works properly. If you’re not using consistent hashing, sort the instances by their name to appease my OCD.
Each web app should list the other web apps in
CLUSTER_SERVERS, each with its address and port. If a web app lists itself in
CLUSTER_SERVERS, it’s gonna have a bad time.
Once you’re satisfied with your Graphite cluster, it’s time to collect the garbage left by
rsyncing all those Whisper files around.
whisper-clean.py from https://gist.github.com/3153844 does exactly that.
(Of course, the usual disclaimers about how this deletes data but is only known to work for me apply so tread lightly.)
DJANGO_SETTINGS_MODULE="graphite.settings" python whisper-clean.py 22.214.171.124:a 126.96.36.199:b -188.8.131.52:c -184.108.40.206:d
DJANGO_SETTINGS_MODULE="graphite.settings" python whisper-clean.py -220.127.116.11:a -18.104.22.168:b 22.214.171.124:c 126.96.36.199:d
If you happen to have been smarter than I was with your
rsyncing, this step probably won’t be necessary.
None of this configuration is done by hand. We use Puppet but Chef works just fine. I highly recommend using Exported Resources,
puppet-related_nodes, or Chef Search to keep the Graphite cluster aware of itself.
Betable’s Graphite cluster learns its topology from
related_nodes queries like
$carbonlink_hosts = related_nodes(Graphite::Carbon::Cache, true) $cluster_servers = related_nodes(Package["graphite"]) $destinations = related_nodes(Graphite::Carbon::Cache, true)
that pick up Graphite nodes and the
graphite::carbon::cache resources declared with each
node stanza. These enforce that instance names (
d, etc.) are unique across the cluster. Each
graphite::carbon::cache resource generates a SysV-init script and a
service resource that runs a
local_settings.py templates Puppet uses are in https://gist.github.com/3248921.