2012-08-03

Federated Graphite

Graphite is a nearly undocumented thing of sweetness. We at Betable collect, store, and graph many tens of thousands of metrics and we quickly reached the limits of a single-instance Graphite infrastructure. That’s OK, though, because Graphite does actually support some pretty reasonable federated architectures if you know where to look. Consider this your treasure map.

Aside: https://gist.github.com/2311507 is how I build Debian packages for Graphite which I deploy via Freight.

Firewalls

You probably already have your firewalls configured to allow your entire infrastructure to send metrics to Graphite on port 2003. Your Graphite almost-cluster is going to need a bit more permission to communicate with itself.

Open ports 2014, 2114, and so on as you need between all Graphite nodes. These are how the carbon-relay.py instance on each Graphite node will communicate with the carbon-cache.py instance(s) on each Graphite node.

Also open the port on which the web app listens to other Graphite nodes. This is how the web apps will communicate with the other web apps.

`carbon-cache.py`

Each carbon-cache.py instance should listen on a non-loopback interface and on a port that’s open to the carbon-relay.pys. I have found two instances of carbon-cache.py per Graphite node to be advantageous on the Rackspace Cloud (YMMV). Their pickle receivers listen on 2014 for instance a and 2114 for instance b on each Graphite node.

Each carbon-cache.py instance must have a name unique with respect to all carbon-cache.py instances on all Graphite nodes. For example: 1.2.3.4:2014:a, 1.2.3.4:2114:b, 5.6.7.8:2014:c, 5.6.7.8:2114:d. This isn’t obvious from the documentation or even from reading the code, so keep this in mind.

Whisper databases

I started the migration from one Graphite node to two by rsyncing all Betable’s Whisper databases from the old node to the new node.

rsync -avz --exclude=carbon graphite-ops.betable.com:/var/lib/graphite/whisper /var/lib/graphite/

The original plan (which is executed below) was to go back later and “garbage collect” the Whisper databases that didn’t belong after the Graphite cluster was up-and-running. If you’re feeling adventurous, you should use whisper-clean.py from https://gist.github.com/3153844 to choose which Whisper databases to rsync and delete up-front because a Whisper database on the local filesystem will be preferred over querying a remote carbon-cache instance.

`carbon-relay.py`

Each Graphite node should run a single carbon-relay.py instance. It should listen on 2003 and 2004 so your senders don't have to be reconfigured.

List all carbon-cache.py instances on all Graphite nodes in DESTINATIONS in carbon.conf, each with its PICKLE_RECEIVER_INTERFACE, PICKLE_RECEIVER_PORT, and instance name. Sort the instances by their name. This is what allows metrics to arrive at any Graphite node and be routed to the right Whisper database.

To use consistent hashing or not to use consistent hashing? That’s your own problem. I use consistent hashing because I have better things to do than balance a Graphite cluster by hand.

With this configuration, Graphite’s write path is federated between two nodes. Its read path, however, appears to be missing half its data.

Web app

Each Graphite node should run the web app, even if you don’t plan to use it directly.

Each web app should list the local carbon-cache.py instances in CARBONLINK_HOSTS each with its CACHE_QUERY_INTERFACE, CACHE_QUERY_PORT, and instance name. Sort the instances by their name as before so the consistent hash ring works properly. If you’re not using consistent hashing, sort the instances by their name to appease my OCD.

Each web app should list the other web apps in CLUSTER_SERVERS, each with its address and port. If a web app lists itself in CLUSTER_SERVERS, it’s gonna have a bad time.

Garbage collection

Once you’re satisfied with your Graphite cluster, it’s time to collect the garbage left by rsyncing all those Whisper files around. whisper-clean.py from https://gist.github.com/3153844 does exactly that.

(Of course, the usual disclaimers about how this deletes data but is only known to work for me apply so tread lightly.)

On 1.2.3.4:

DJANGO_SETTINGS_MODULE="graphite.settings" python whisper-clean.py 1.2.3.4:a 1.2.3.4:b -5.6.7.8:c -5.6.7.8:d

On 5.6.7.8:

DJANGO_SETTINGS_MODULE="graphite.settings" python whisper-clean.py -1.2.3.4:a -1.2.3.4:b 5.6.7.8:c 5.6.7.8:d

If you happen to have been smarter than I was with your rsyncing, this step probably won’t be necessary.

Configuration management

None of this configuration is done by hand. We use Puppet but Chef works just fine. I highly recommend using Exported Resources, puppet-related_nodes, or Chef Search to keep the Graphite cluster aware of itself.

Betable’s Graphite cluster learns its topology from related_nodes queries like

$carbonlink_hosts = related_nodes(Graphite::Carbon::Cache, true)
$cluster_servers = related_nodes(Package["graphite"])
$destinations = related_nodes(Graphite::Carbon::Cache, true)

that pick up Graphite nodes and the graphite::carbon::cache resources declared with each node stanza. These enforce that instance names (a, b, c, d, etc.) are unique across the cluster. Each graphite::carbon::cache resource generates a SysV-init script and a service resource that runs a carbon-cache.py instance.

The carbon.conf and local_settings.py templates Puppet uses are in https://gist.github.com/3248921.