mardi 27 mars 2018

Lessons learned building a big data solution oriented towards security logs AKA a "security data lake"

My company decided to give a go at the big data security log search, most information out there talks about implementing machine learning and other features using (cool) Jupyter notebooks and proof of concept sort of setup to accomplish that. However, trying to implement an infrastructure to accomplish that at large scale, reliably turned out to be an epic story.

The high level requirements goes like this;

  • We want to be able to search events in near real time and perform full text search on events and create dashboards for specific use cases depending on the investigation.
  • We want to store both the data in pristine and parsed format for at least a year for audit and investigation purposes.
  • We want to be able to replay older logs through the pipeline to fix/update the parser-augmenter pipeline.
At this stage some of you might say that Apache Metron or Apache Spot and you would be right to do so, the problem we found was the list of supported log parser for metron isn't that long and are linked to specific version. This is not abnormal but in our large and mixed environment (the telecom world has a lot of strange devices) the benefit were not so clear, on top of that using 2 tools means 2 different data models, interfaces and so on to work with. The full story is a lot longer and in retrospect some point were overlooked and some difficulties were over evaluated, the importance of having a tool for the analyst to manipulate the data and query them extensively was overlooked, it's called Stellar on the Metron side.

Laying out the tracks

The bird's eye overview is around 3 steps, ingestion, parsing and augmentation, storage

Ingestion is typically done through a syslog server in a Linux environment and using a Windows event collector in a Microsoft shop, this is already a problem; none of these provide high availability and scaling can only be done via a divide an conquer approach (host A&B go on that server, B&C on that other server and so on). One approach to address that issue would be to use minifi to read the log files from the syslog servers (active-active) and ship them to the nifi cluster, there we use a detect duplicate processor to ship only one copy of the log to the next layer.  For windows, you might want to use NXLog to ship your windows logs either to files and then use minifi or emit them directly towards Kafka. Once the logs have reached nifi, the 1st thing to do is to wrap them in a json envelope to add the meta data (Time of ingestion, server from which it's coming from, the type of logs ...) and then ship them to Kafka. This will allow you to be handle to back pressure nicely and do point in time recovery if the subsequent part of the pipeline ever have problems or slow down.
The above architecture result in something like this:



Parsing and augmentation is where the heavy duty processing is done.
To parse an event is to extract business data out it like IP's, username and URL from it, you can do it with various tools depending on the skills and constraint of your shop, our proof of concept used logstash to do that task, the pros are it's easy to implement and very near real time oriented (low latency) the con's are that it doesn't handle back pressure very well (one slow parser will slow down all log sources) and it will generally require fire power than others. Nifi on the other hand is micro batches oriented which means more delay, if you have a hard requirement on low latency (less than 60 seconds) this is most likely a no go, on top of that any non trivial parsing on logs that aren't record oriented (Linux auth logs or windows logs) are a massive pain IMHO, there's some work being done on that with record orient processor that can use grok expression to do the parsing but it's still sub optimal. It's a tough problem, Metron uses apache storm to do their parsing to have enough fire power and flexibility to implement complex logic but storm is primarily Java and since our team doesn't have enough Java dev's we settle for Spark since it has a Python API pySpark.
To augment an event is to add more information into it, simple example is to add the geoip information for an IP address, a more complex example would be to add the username associated with an internal IP address or a internal user and IP reputation system (there's a lot to be done here so I'll probably expand on the subject in another post) ...
Again we'll use Spark for the job, there a fair amount of boiler plate code to help you start properly and focus on the business logic rather than fight the tools to get them to work.

Storage is where we store the event for later display, reporting and again we face an amusing issue in that the tools for long term archival and near real time searches are very different, for the real time search we settle on Solr and ElasticSearch, and at the time we settled for Solr since it was included in the stack we use (Ambari) for the long term retention (HDFS) , while Solr is not a bad product, feature wise it's lagging behind, on the UI front, the only generic interface that we found is "banana" which is a fork of Kibana 3 which is old to say the least...On top of all that Elastic has been moving extremely fast with either the machine learning component in the x-pack and the vega scriptable UI, on top of that the core business of Solr seems to be the content management search, which means that the typical workload is read heavy and few writes, exactly the opposite of a log search tool where there will be massive and continuous write and much fewer reads which means that a lot of default value are not suitable for near real time search, see the link for the Solr wiki on the topic.

No matter which one you choose, keep in mind to spread your data over a large number of disks to max the number IOPS the cluster can perform and maximize the effect of the buffer cache, increase the number of shards to match the number of physical disks to enhance write & read performances, keep the replica down to a manageable size.
On the long term storage front, there's is also a couple of things to consider like the file format, there's a lot of discussions about which file format is best but basically you want one of the packed/binary file format like avro, ORC or parquet. Preferably one of the column oriented ones which allow you to change the schema (add or remove fields) over time without it being a breaking change (leaving out avro), knowing all that, we settle for parquet because it has a good support in Spark and it's compression ratio are pretty good which allow us to archive a bit more before using up all the disk space. You will also want to partition the data, that means essentially to store the files in different directories based on the time, something like /hadoop/proxy/year=2018/month=02/day=01/file.parquet this makes it easy to write a scrubber job to delete old data per type depending on the retention period of your logs, pretty much the same way we used to do it on a syslog server.

Now that we layed out the tracks, metrics, out of memory, kerberos, config management, writing a spark job for the parsing, seting up hive table as external because you want to keep the data, ...

Aucun commentaire:

Enregistrer un commentaire

Hadoop / Spark2 snippet that took way too long to figure out

This is a collection of links and snippet that took me way too long to figure out; I've copied them here with a bit of documentation in...