lundi 2 octobre 2017

Incident response on a shoestring budget

This blog post is the first in a series that aim to setup a basic security monitoring on a low budget, nothing but a few servers, gray matter and a lot of open source software and try to avoid vendor$ and all.

One of the problem to solve on a large network is to know who is talking to whom. host to host communication are easy to monitor using bro or netflow but knowing which software communicates over what connection is a whole different problem (sometimes called endpoint visibility).
Good news is Facebook released osquery and it is now available on all platforms, osquery allow you to access pretty much all the data of a computer using SQL.
For instance something like 
 
osquery> SELECT uid, name FROM listening_ports l, processes p WHERE l.pid=p.pid;
+------+-------------+
| uid  | name        |
+------+-------------+
| 1000 | dbus-daemon |
+------+-------------+

Would give you all the uid and process name of the process having listening ports (server process if you want).
It also has a daemon (osqueryd) that allows you to have scheduled query and you can save the output of those queries in ElasticSearch for stacking and analysis.

The plan is to have queries running at regular interval that gives you the process name, uid, checksum (sha1) and the connections.
something like this:
osquery> select action, protocol, local_address, local_port, remote_address, remote_port, uid, name FROM processes p, socket_events s WHERE s.pid=p.pid;
scheduled at regular interval would do the trick.
Once setup to run it will return something like this on regular interval:
{
 "name": "network_info",
  "hostIdentifier": "goldorak",
  "calendarTime": "Tue Dec 13 15:10:08 2016 UTC",
  "unixTime": "1481641808",
  "decorations": {
   "host_uuid": "FC1D7B01-5138-11CB-B85D-C04D3A0C6645",
   "username": "someusername"
  },
 "columns": {
    "action": "connect",
    "local_address": "",
    "local_port": "0",
    "name": "DNS Res~er #115",
    "protocol": "14578",
    "remote_address": "192.168.1.1",
    "remote_port": "53",
    "uid": "1000"
  },
  "action": "added"
}

The next thing I want to look at is the network forensic problem, I define it like this: I want to know at any point in time which host on my network talk to what other host. First Ideally, I want to know the protocol and meta data associated like SSL/TLS cert, file extraction and so on. Now that we have the same sort of information already from osquery (pid, process name and connection ...) we should ideally be able to stitch it all that together to have a global view, the source could be netflow or bro metadata and the same stitching could be done for Snort or Suricata alerts to add more context and  details for the analyst.

vendredi 29 septembre 2017

Hive external tables shows no data

I'm currently working on a project that ingest a lot of security related data (firewall logs, http proxy logs, ...) and the goal is to keep them for a year or so to be able to do analytics and go back in time in case of security incident and have a clue what the "bad guy" did (follow compromised asset or user).

The plan is to store the data in avro format on HDFS and then create Hive table on top, ideally the table should be external so that Hive doesn't remove the original files in case I need it for something else.

To create an external table using avro formatted data you typically do this:

CREATE EXTERNAL TABLE cyber.proxy_logs
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/cyber/proxy_logs/'
TBLPROPERTIES ('avro.schema.literal'=' { "type" : "record", "name" : "proxy", "fields" : [ { "name" : "DeviceHostname", "type" : "string", "default" : "" }  { "name" : "ResponseTime", "type" : "string", "default" : "" }, { "name" : "SourceIp", "type" : "string", "default" : "" }, { "name" : "SourcePort", "type" : "string", "default" : "" }, { "name" : "Username", "type" : "string", "default" : "" } ] } ');

This will create a table mapped to our data, the files in HDFS are organized like this:

/hadoop/proxy/year=2017/month=09/day=19
/hadoop/proxy/year=2017/month=09/day=20
...

So that we have 3 levels of partitioning making it easy to do operations on yearly, monthly and daily basis or at least that's the plan.
Once you've done that, you need to tell Hive to add your partitions using:

ALTER TABLE proxy_logs ADD PARTITION(year=2017, month="09", day=19);
...

We need to do that every day since our partitioning is done daily, at this point when I ran a
select count(*) from proxy_logs
I didn't get any result at all... turns out you also need to recompute the statistics for the table metadata to be up to date, I did this with
ANALYZE TABLE proxy_logs partition(year=2017, month="09", day=19) compute statistics;
and only then was I able to access my logs, I will now need to do this for every partitions of every table, time to get scripting!

Hadoop / Spark2 snippet that took way too long to figure out

This is a collection of links and snippet that took me way too long to figure out; I've copied them here with a bit of documentation in...