Getting metrics from logs and various other sources into Graphite is quite simple. The most interesting metrics do represent critical performance data, and the pro-active-monitoring approach, means a person sitting there and waching the dashboard, isn’t suited to our needs. We use Nagios with Centreon as our monitoring plattform, and we want to alert on some of the metrics collected in Graphite. Also since version 2.4 Centreon supports custom dashboard views and, although this might sound like doublemobble, we wanted to get the metrics graphically integrated into the Centreon interface, as RRD graphs that is.
Looking around I found the check_graphite plugin by obfuscurity, and greatly enhanced it to support multiple metrics in one call, performance data with customizable metric shortnames and retry calls in case there were no datapoints in the given duration. It’s called check_graphite_multi, available from my nagios-scripts perfdata branch on github, and is especially usefull if you’d like to get multiple metrics of the same type into one RRD graph in Centreon or PNP4Nagios or thelike. Our usecase is a graph with JVM heap generation usage and garbage collector statistics. We alert on full old generation and high GC durations.
Here are some short usage notes:
–metrics|-m accepts a string of metrics, seperated by a pipe |
–shortname|-s accepts a comma separated list of aliases for the output of status and performance data
If no –shortname is specified for the given metric, it defaults to the full metric name.
–warn|-w also accepts a comma separated list
At least one value is required, if only one value is given for multiple metrics, the given value counts for all
–critical|-c works the same as –warn
When specifying multiple metrics, make sure to keep the order for all parameters, like
-m "metric1|metric2" -s "alias1,alias2" -w "warn1,warn2" -c "crit1,crit2"
If at least one of the metrics returns a CRITICAL state, the plugin exits with CRITICAL return code. Dito for WARNING.
By default, if the metric has no datapoints in the given –duration timeframe, the plugin retries with 10times the given duration. This is mostly cosmetic to prevent holes in RRD graphs, and I might make that configurable in the future. Unfortunately Graphite has no option via the render API to just return the last datapoint, so this is a hack to work around that.