Indexing and searching Weblogic logs using Logstash, Elasticsearch and Kibana

This is a re-edit of my previous post "Indexing and searching Weblogic logs using Logstash and Graylog2". Meanwhile our setup has settled to use Kibana instead of the Graylog2 frontend. This Howto is meant to be a complete installation guide for "The Elasticsearch ELK stack" and using it to index tons of Weblogic server and application logs, from DEV over UA to the Production environment.

1 . Architecture overview

The common centralized setup consists of

  • a log shipper that reads in the logs and forwards them without any big parsing
  • message queue that temporarily stores the events
  • one or more indexers that pull the events from the queue, parse them, enrich them with metadata (fields, tags, types) and store them in..
  • an Elasticsearch cluster
  • and Kibana as a frontend to search, analyze and report on the events

We decided to use SSHFS mounts and a single logstash instance with file input -> multiline filter -> redis output as the shipper. Then a bunch of logstash indexer instances that do redis input -> filters gallore -> elasticsearch_http output.

SSHFS and file input because it was the easiest to get up'n'running, with administrative SSH access to the UA and production machines already being in place. An alternative (and most likely more elegant) approach would be to run logstash-forwarder instances on all machines, or in case of JAVA apps: logstash with log4j input in server mode.

Regarding performance, the file-to-redis shipper instance isn't really CPU intensive, as the only filter here is the multiline filter - no grok parsing, field extraction whatsoever. The latter is the job for the indexer instances which are horizontally scaled across both machines. We index about 3 million log entries a day summing up to around 5GB index data per ES cluster node. So actually our setup is a quite small one.

Regarding availabilty, that file-to-redis instance is a single point of complete failure, for sure. We wrote some cron scripts to check if the logstash instances are up and start them if they are not. The file input uses a sincedb so it remebers where it stopped indexing a file, which is ensurance enough.

So, our stack looks like <TODO: insert picture here> :-)

This is by no means a general recommendation to do it the same way. I just describe how we did it, and it should serve as an example (if it's a good or a bad one heavily depends on your own situation/budget/requirements and so on). The heart-piece of this guide should be how to get those Weblogic server and application logs in there.

2 . Software/Hardware used

  • 2 quad-core servers with 32GB RAM each
  • currently 2TB of iSCSI attached SAN for the index data
  • RHEL 6 x86_64
  • Elasticsearch 0.90.11
  • Redis 2.8.4
  • Logstash 1.3.3
  • Kibana3 latest

3 . Installing required packages

For CentOS/RHEL 6 following repositories provide most of the required Packages: remi (for redis) and elasticsearch (for elasticsearch). There is also a repo for logstash, but we use multiple logstash instances on each host and maintain the config in Git, and I haven't played with the RPM install yet. Please refer to the linked pages on how to add the repos for yum.

Then install the required stuff on both server nodes (for httpd and kibana actually one server is enough, you could also install it on a separate machine)

Elasticsearch and Redis

$ yum install elasticsearch redis httpd

Logstash

$ mkdir /opt/logstash && cd $_
$ wget http://download.elasticsearch.org/logstash/logstash/logstash-1.3.3-flatjar.jar
$ ln -s logstash-1.3.3-flatjar.jar logstash.jar
$ mkdir {etc,log,patterns,es-templates}

Kibana

Download the latest version from elasticsearch.org and extract it somewhere so the webserver can access it.

JDK 1.7

It's best to use a recent Oracle 1.7 JDK, so download and install the rpm package from oracle.com. Then run

$ alternatives --install /usr/bin/java java /usr/java/latest/bin/java 2000
$ alternatives --config java  # select /usr/java/latest/bin/java
$ java -version  # verify it works

4 . Configure Elasticsearch

  • Edit /etc/sysconfig/elasticsearch and set a reasonable amount of ES_HEAP (about 1/3 of physical memory, 12g here)
  • Edit /etc/elasticsearch/elasticsearch.yml and give your cluster a name

5 . Configure Redis

Edit /etc/redis.conf and comment out/delete the Bind 127.0.0.1 so the logstash instances on both hosts are able to connect to it (poor man's cluster failover, a true redis-cluster is still in the works)

6 . Configure Logstash

Now it get's more interesting. As said above we use separate logstash instances: a single file-to-redis instance on one of the hosts, and a bunch of indexer instances for each log type scaled across both hosts.

File-to-redis shipper instance

etc/file-to-redis.conf

input {
  # server logs
  file {
    type => "weblogic"
    tags => "weblogic"
    path => [ "/data/logfiles/weblogic/*/*/weblogic.log" ]
    codec => plain { charset => "ISO-8859-1" }
  }
  # application logs
  file {
    type => "application"
    tags => "weblogic"
    path => [ "/data/logfiles/weblogic/*/*/app1.log",
              "/data/logfiles/weblogic/*/*/app2.log",
              "/data/logfiles/weblogic/*/*/app3.log",
              "/data/logfiles/weblogic/*/*/app4.log" ]
    codec => plain { charset => "ISO-8859-1" }
  }                                                                                                                                                           
}

filter {
  multiline {
    type => "weblogic"
    pattern => "^####"
    negate => true
    what => "previous"
  }
  multiline {
    type => "application"
    # this will work until Dec 31st 2099, so..
    pattern => "^20"
    negate => true
    what => "previous"
  }
}

output {
  redis {
    host => ["phes01", "phes02"]
    data_type => "list"
    key => "logstash-%{type}"
  }
}

Important note about the file input: each file{} input runs in a separate thread. This means: if one of the applications goes totally nuts it will block the input from other logfiles. I recommend to use separate file{} input definitions for main environments (dev, ua, prod) and maybe also for the applications, so each has its own thread and cannot block others if it starts vomitting exceptions to the logfile.

Important note about the multiline filter: First, we use the filter, not the codec, because it simplifies the config a bit when using greater number of file inputs. The second thing to note is, that the multiline filter has two issues:

  • By design, a multiline event can only be created when the next event is coming in (it matches the pattern and holds up lines, until it matches the pattern the next time). So if the application logs that it's about to start a very long running task for example, you will not see that message in Kibana (yet).
  • The last line of every logfile is omitted for the same reason, though in that EOF case logstash could actually finalize the event and pass it on. Could be considered a bug, the devs are aware of this.

If that's a problem for you, you're better off using something like a logstash log4j input in server mode and log4j SocketAppenders in the application, or other techniques that do not rely on multiline filtering.

Redis-to-ES indexer instances

This is where all the magic happens. For Weblogic we need to define some grok patterns so logstash knows what to look for. Logstash comes with a set of pre-defined patterns which are very useful to build up on. We do have a messy mixture of log formats across the applications, so we use multiple patterns trying to match the more detailed ones first. Actually our list is longer, and below is just an excerpt. I prefer to use a list of patterns over using one single complex pattern, as that's easier to debug if developers change the log pattern without notice and suddenly things don't match (this does happen, quite regularly).

patterns/weblogic-patterns

WLS_DATESTAMP %{YEAR}[/-]%{MONTHNUM}[/-]%{MONTHDAY}[- ]%{TIME}

# server logs
WLS_SRV_LOG_FMT1 ####<%{DATA:wls_timestamp}> <%{WORD:severity}> <%{DATA:wls_topic}> <%{HOST:hostname}> <(%{WORD:server})?>( <(\[%{DATA:thread_status}\] )?ExecuteThread: '%{INT:thread_nr}' for queue: '%{DATA:thread_queue}'>)? %{GREEDYDATA:logmessage}
WLS_SRV_LOG_FMT2 ####<%{DATA:wls_timestamp}> <%{WORD:severity}> <%{DATA:wls_topic}> <%{HOST:hostname}> <(%{WORD:server})?> %{GREEDYDATA:logmessage}
WLS_SRV_LOG %{WLS_SRV_LOG_FMT1}|%{WLS_SRV_LOG_FMT2}

# application logs
WLS_APP_LOG_FMT1 %{WLS_DATESTAMP:timestamp}(?: %{NUMBER:some_id})? %{WORD:severity} ( )?\[%{JAVACLASS:java_class}\](:)?( \(\[%{DATA:thread_status}\] ExecuteThread: '%{INT:thread_nr}' for queue: '%{DATA:thread_queue}':\))? %{DATA:logmessage}$
WLS_APP_LOG_FMT2 %{WLS_DATESTAMP:timestamp}(?: %{NUMBER:some_id})? %{WORD:severity} ( )?\[%{JAVACLASS:java_class}\](:)? %{GREEDYDATA:logmessage}
WLS_APP_LOG %{WLS_APP_LOG_FMT1}|%{WLS_APP_LOG_FMT1}

# specific stuff
WLS_SERVERS server[0-9]{2}
WLS_APPS a|list|of|valid|application|names
WLS_ENVS dev[0-9]{2}|int|qsu|au[1-5]|prosi|prod

The "specific stuff" in there is used to extract fields from the file paths, see below.

Then the configuration for the weblogic server log indexer(s):

etc/weblogic-server.conf

input {                                                                                                                                                          
  redis {                                                                                                                                                        
    type => "weblogic"                                                                                                                                           
    host => "phes01"                                                                                                                                            
    data_type => "list"
    key => "logstash-weblogic"
  }
  redis {
    type => "weblogic"
    host => "phes02"
    data_type => "list"
    key => "logstash-weblogic"
  }
}

filter {
  grok {
    # extract environment name from file path
    patterns_dir => "./patterns"
    match => ["path", "%{WLS_ENVS:env}"]
  }
  grok {
    # grok the cr*p out of it
    patterns_dir => "./patterns"
    match => ["message", "%{WLS_SRV_LOG}"]
    # set "app" field to "server" so we can search weblogic.log using app:server 
    add_field => ["app", "server"]
  }
  date {
    # localization mess...
    match => ["wls_timestamp", "dd.MM.yyyy HH:mm 'Uhr' 'MEZ'", "dd.MM.yyyy HH:mm 'Uhr' 'MESZ'", "dd.MM.yyyy HH:mm:ss.SSS Z"]
  }
  #
  # implement stuck thread alerting for production env
  #
  grep {
    # check if thread is stuck (by BEA-000337 code)
    add_tag => ["stuck_thread"]
    match => [ "logmessage", "BEA-000337" ]
    drop => false
  }
  grep {
    # check if thread is unstuck (by BEA-000339 code)
    add_tag => ["unstuck_thread"]
    match => [ "logmessage", "BEA-000339" ]
    drop => false
  }
  if ("stuck_thread" in [tags]){
    mutate {
      replace => ["thread_status", "STUCK" ]
    }
  }
  if ("unstuck_thread" in [tags]){
    mutate {
      replace => ["thread_status", "UNSTUCK" ]
    }
  }

  mutate {
    # cosmetic unification
    uppercase => [ "severity" ]
  }
}

output {
  elasticsearch_http {
    host => "phes02"
    index => "logstash-weblogic-%{+YYYY.MM.dd}"
    manage_template => false
  }
  
  # alert on stuck threads in production via mail
  if [env] == "prod" and ([thread_status] == "STUCK" or [thread_status] == "UNSTUCK") {
   # exclude the worker nodes as long running stuff is their job
   if !([server] == "server09" or [server] == "server10") {
    email {
      via => "smtp"
      options => [ "smtpIporHost", "mail.domain.com" ]
      from => "logstash-alerts@domain.com"
      to => "weblogic-admins@domain.com"
      subject => "[logstash-alert] %{thread_status} thread in Prod, %{server}"
      body => "%{message}"
    }
   }
  }
}

It uses two redis inputs. That's because the redis output plugin in the shipper connects to the first available redis server on startup (or random, didn't check), and only fails over if that redis instance dies. So it requires two inputs to always check both redis instances for new messages.

Now the config for the application log indexer(s)

etc/weblogic-application.conf

input {
  redis {
    type => "application"
    host => "phes01"
    data_type => "list"
    key => "logstash-application"
  }
  redis {
    type => "application"
    host => "phes02"
    data_type => "list"
    key => "logstash-application"
  }
}

filter {
  grok {
    # extract environment (dev, qsu, prod etc) from file path
    patterns_dir => "./patterns"
    match => ["path", "%{WLS_ENVS:env}"]
  }
  grok {
    # extract app name from file path
    patterns_dir => "./patterns"
    match => ["path", "%{WLS_APPS:app}"]
  }
  grok {
    # extract server node number from file path
    patterns_dir => "./patterns"
    match => ["path", "%{WLS_SERVERS:server}"]
  }
  grok {
    # parse!
    patterns_dir => "./patterns"
    match => ["message", "%{WLS_APP_LOG}"]
  }
  date {
    match => ["timestamp", "YYYY-MM-dd HH:mm:ss,SSS"]
  }

  mutate {
    uppercase => [ "severity" ]
  }
}

output {
  elasticsearch_http {
    host => "phes02"
    index => "logstash-weblogic-%{+YYYY.MM.dd}"
    manage_template => true
    template => "/opt/logstash/es-templates/logstash-weblogic.json"
    template_name => "logstash-weblogic"
    template_overwrite => true
  }
}

We need to adopt the default ES template to match our index naming scheme:

es-templates/logstash-weblogic.json

{
  "template" : "logstash-weblogic*",
  "settings" : {
    "number_of_shards" : 2,
    "index.refresh_interval" : "5s",
    "analysis" : {
      "analyzer" : {
        "default" : {
          "type" : "standard",
          "stopwords" : "_none_"
        }
      }
    }
  },
  "mappings" : {
    "_default_" : {
       "_all" : {"enabled" : true},
       "dynamic_templates" : [ {
         "string_fields" : {
           "match" : "*",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "multi_field",
               "fields" : {
                 "{name}" : {"type": "string", "index" : "analyzed", "omit_norms" : true },
                 "raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256}
               }
           }
         }
       } ],
       "properties" : {
         "@version": { "type": "string", "index": "not_analyzed" }
       }
    }
  }
}

7 . Configure Kibana

Kibana is plain javascript with an index.html file. Just unpack it somewhere in your DocumentRoot or create a virtual host for it. Configuring Apache/Nginx goes beyond the scope of this guide.

Just one note about security: Kibana stores its dashboard configuration in Elasticsearch. You may want to protect it from accidently being overwritten by using a proxy with basic or LDAP auth in place. See https://github.com/elasticsearch/kibana/tree/master/sample for some virtual host examples that implement this.

8 . Start the show

First, start one of the elasticsearch nodes using the provided init script and wait unitl it's up and running. Then start the second one and check /var/log/elasticsearch/<clustername>.log if they correctly connected to each other, one should be picked as master, the other one as failback master. Also check http://phes01:9200/_cluster/health on one ot both of the nodes. You should see 2 nodes total, 2 data nodes, status: green.

Start the redis servers using the init script.

Start logstash. We use some simple custom script for this:

#!/bin/bash

JAVA_HOME="/usr"
JAVA_OPTS="-Xms128m -Xmx1024m -XX:PermSize=128m -XX:MaxPermSize=256m"
BASEDIR="/opt/logstash"

LOGTYPE=$1

if [ !  -r "${BASEDIR}/etc/${LOGTYPE}.conf" ]; then
   echo "ERROR: no config for ${LOGTYPE} exists in ${BASEDIR}/etc."
   exit 1
fi

cd ${BASEDIR}
test -d log || mkdir log

export LANG=en_US.UTF-8

$JAVA_HOME/bin/java ${JAVA_OPTS} -XX:OnOutOfMemoryError="kill -9 %p" -jar logstash.jar agent -f etc/${LOGTYPE}.conf --log log/logstash-${LOGTYPE}.log $2

9 . Housekeeping

Housekeeping here means: purge old indexes after some retention time (e.g. after 60 days) and optimize indexes that don't see new data (e.g. "yesterday's" logstash indexes). This reduces disk space usage and the resource demands of Elasticsearch. There is a tool which does all this in a very comfortable and configurable fashion: curator. A good read about it can be found on http://www.elasticsearch.org/blog/curator-tending-your-time-series-indices/.

 

Category: 

Comments

Maybe a line is missing?

Thanks! That should read WLS_APP_LOG (I edited the pattern names before posting). Corrected.

Thanks for creating this detailed guide. The official docs are good to get started but are missing the big picture for production use.

Why do you need redis in this setup? Cant logstash send it straight to elasticsearch?

Redis isn't a hard requirement to get data from logstash into elasticsearch, but it provides fault tolerance, load balancing for larger throughputs and eases maintenance of the indexers while the shipper(s) keep pushing data into redis. If loosing events is critical, and if throughput is high, having a message queue between the shippers and the indexer cluster is a good idea.