Prometheus: run queries against Elasticsearch and turn it into metrics and alerts

Long time no read. I'm in the middle of re-implementing our monitoring solution, this time using Prometheus.

We have tons of web- and application-server logs in Elasticsearch and need to query it for http 5xx error rates, application error rates and similar things to get that data into Prometheus for alerting fun.

Enter braedon's prometheus-es-exporter. There's a ready-to-go docker image on docker hub, so all it takes to get things working in kubernetes is this:

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: prometheus-es-exporter
  namespace: prometheus
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: prometheus-es-exporter
    spec:
      containers:
      - image: braedon/prometheus-es-exporter:0.4.4
        name: prometheus-es-exporter
        args: ["--indices-stats-disable", "-e", "http://elasticsearch:9200"]
        ports:
          - containerPort: 9206
        volumeMounts:
          - mountPath: /usr/src/app/exporter.cfg
            subPath: "exporter.cfg"
            name: es-exporter-config
      volumes:
        - name: es-exporter-config
          configMap:
            name: es-exporter-config
---
apiVersion: v1
kind: Service
metadata:
  name: prometheus-es-exporter
  namespace: prometheus
spec:
  ports:
  - port: 9206
    targetPort: 9206
    protocol: TCP
  selector:
    app: prometheus-es-exporter
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: es-exporter-config
  namespace: prometheus
data:
  exporter.cfg: |
    [DEFAULT]
    QueryIntervalSecs = 60
    QueryTimeoutSecs = 10
    QueryIndices = _all
    
    # anything after the first query_ is used for the metric name
    [query_es_query_wildfly_errors_1h_mak]
    QueryIndices = 
    QueryJson = {
            "query": {
              "bool": {
                "must": [
                  { "match": { "env": "prod" } },
                  { "match": { "server_group": "mak" } },
                  { "match": { "severity": "error" } }
                ],
                "filter": {
                  "bool": {
                    "must": [
                      { "range": { "@timestamp": { "gte": "now-1h", "lt": "now" } } }
                    ]
                  }
                }
              }
            }
          }
    
    [query_es_query_wildfly_errors_1h_fin]
    QueryIndices = 
    QueryJson = {
            "query": {
              "bool": {
                "must": [
                  { "match": { "env": "prod" } },
                  { "match": { "server_group": "fin" } },
                  { "match": { "severity": "error" } }
                ],
                "filter": {
                  "bool": {
                    "must": [
                      { "range": { "@timestamp": { "gte": "now-1h", "lt": "now" } } }
                    ]
                  }
                }
              }
            }
          }
    
    [query_es_query_wildfly_errors_1h_common]
    QueryIndices = 
    QueryJson = {
            "query": {
              "bool": {
                "must": [
                  { "match": { "env": "prod" } },
                  { "match": { "server_group": "common" } },
                  { "match": { "severity": "error" } }
                ],
                "filter": {
                  "bool": {
                    "must": [
                      { "range": { "@timestamp": { "gte": "now-1h", "lt": "now" } } }
                    ]
                  }
                }
              }
            }
          }

In the example above we have a Wildfly domain running 3 server groups (mak, fin, common) and we want separate error counts for each of those. We need 3 separate queries for this and will be ending up with 3*2 separate metrics:

es_query_wildfly_errors_1h_common_hits
es_query_wildfly_errors_1h_common_took_milliseconds
es_query_wildfly_errors_1h_fin_hits
es_query_wildfly_errors_1h_fin_took_seconds
es_query_wildfly_errors_1h_mak_hits
es_query_wildfly_errors_1h_mak_took_seconds

We don't want separate metrics, but labels instead. metric_relabel_configs to the rescue.
In our scrape_configs, we can relabel the metrics like so:

        metric_relabel_configs:
        - source_labels: [ __name__ ]
          regex: '(es_query_wildfly_errors_1h)_(mak|fin|common)_(hits|took_milliseconds)'
          replacement: 'wildfly'
          target_label: app
        - source_labels: [ __name__ ]
          regex: '(es_query_wildfly_errors_1h)_(mak|fin|common)_(hits|took_milliseconds)'
          replacement: '${2}'
          target_label: server_group
        - source_labels: [ __name__ ]
          regex: '(es_query_wildfly_errors_1h)_(mak|fin|common)_(hits|took_milliseconds)'
          replacement: '${1}_${3}'
          target_label: __name__

This will:

  • add app: wildfly label
  • move the wildfly server group name (mak|fin|common) into a label called server_group
  • rename the metric to es_query_wildfly_errors_1h_hits or es_query_wildfly_errors_1h_took_milliseconds

Much better. Thinking of it, the _hits and _took_milliseconds could be moved into a mode label maybe. I'll leave that as an excercise.

 

Category: