The way to Elasticsearch 2.0, or how to reindex your dot fields with logstash and ruby filters

The Elasticsearch 2.0 release intruced a major annoyance by removing support for dots in field names. We use ES for our apache logs, with retention policy of 365 days, and of course _all_ of the indices contained fields with a dot in the name.

What's even worse, at some point in time i had the idea to filter out the request parameters from the uri and run a kv filter on it. As we never used the resulting mess of request_params.* fields, those could just be dropped.

First step was to update our logstash configuration so no dots are used for field names.

Then we needed an automated way of re-indexing all of our indices, replacing all dots (.) with underscore (_) in the field names, dropping irrelevant fileds and move all data into a new index. I came up with method using logstash and a ruby filter, wrapped in a bash script that iterates over all indices, sed'ing the index name into below template, an running logstash with it. Logstash will shutdown itself after the index is read in completely.
 
This config makes use of Elasticsearch's "scan and scroll" feature inside an elasticsearch input, ruby filters, and output to elasticsearch using the original document_id, document_type and index name parameters from the input's @metadata.
 

Here's the logstash.conf template:

input {
  elasticsearch {
    hosts => [ "es01:9200" ]
    index => "XXX_INDEX_XXX"
    size => 1000
    scroll => "5m"
    docinfo => true
    scan => true
  }
}

filter {
  # first, drop irrelevant request_params.* fields
  ruby {
    code => "
      event.to_hash.keys.each { |k|
        if k.start_with?('request_params')
          event.remove(k)
        end
      }
    "
  }

  # now, rename other fields, mostly useragent filter's ua. -> ua_ 
  ruby {
     code => "
          event.to_hash.keys.each { |k| event[ k.gsub('.','_') ] = event.remove(k) if k.include?'.' }
        "
  }

}

output {
  elasticsearch {
    hosts => [ "phnfs01:9200" ]
    index => "es20-%{[@metadata][_index]}"
    document_type => "%{[@metadata][_type]}"
    document_id => "%{[@metadata][_id]}"
  }
#  stdout {
#    codec => "rubydebug"
#  }
}

It took almost one day for all indices, then I realized that the kv filter on request params caused more mess that i thought at first. We still had fileds in there like "request_params._SERVER[ADDRESS]" and more like that, caused by OpenVAS Tests. Daily tests, so almost all indices affected. The first ruby filter above didn't really delete those, for whatever reason. Heck! 

I tried to adjust the filter with some "ks = k.to_s" logic, nothing helped, so I decided to just drop those events completely.

  ruby {
    code => "
      event.to_hash.keys.each { |k|
        if k.start_with?('request_param')
          event.cancel
        end
      }
    "
  }

So it took two iterations of full reindexing. The first cleared up most things, the second removed the remaining mess.

Now we needed to delete the old indices, create aliases for the newly created es20-* indices with the old name, and all was good.

Category: