mardi 1 avril 2014

Elasticsearch housekeeping

Logstash is very cool but the underlying Elasticsearch engine can take up a lot of space, so I wrote a small cleaning up script that runs daily to either discard older than 30 days data or optimize active tables.


import pycurl
import json
import StringIO
from datetime import datetime, timedelta

retentionDays = 30

c = pycurl.Curl()
b = StringIO.StringIO()

c.setopt(c.URL, '')
c.setopt(pycurl.WRITEFUNCTION, b.write)

blob = json.loads( b.getvalue() )

for index in blob['indices']:
 if 'logstash' in index:
  old = - timedelta(days = retentionDays)
  indexDate = datetime.strptime(index, "logstash-%Y.%m.%d")
  if old > indexDate:
   print "delete", index
   c.setopt(pycurl.CUSTOMREQUEST, "DELETE")
   c.setopt(c.URL, ('').format(index))
   print "optimize", index
   c.setopt(c.URL, ('').format(index))

Turns out there is a much better tool to do all Elasticsearch related housekeeping called curator but anyway sometimes it's nice to make your own scripts :-)

Aucun commentaire:

Enregistrer un commentaire

Hadoop / Spark2 snippet that took way too long to figure out

This is a collection of links and snippet that took me way too long to figure out; I've copied them here with a bit of documentation in...