vendredi 29 septembre 2017

Hive external tables shows no data

I'm currently working on a project that ingest a lot of security related data (firewall logs, http proxy logs, ...) and the goal is to keep them for a year or so to be able to do analytics and go back in time in case of security incident and have a clue what the "bad guy" did (follow compromised asset or user).

The plan is to store the data in avro format on HDFS and then create Hive table on top, ideally the table should be external so that Hive doesn't remove the original files in case I need it for something else.

To create an external table using avro formatted data you typically do this:

CREATE EXTERNAL TABLE cyber.proxy_logs
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/cyber/proxy_logs/'
TBLPROPERTIES ('avro.schema.literal'=' { "type" : "record", "name" : "proxy", "fields" : [ { "name" : "DeviceHostname", "type" : "string", "default" : "" }  { "name" : "ResponseTime", "type" : "string", "default" : "" }, { "name" : "SourceIp", "type" : "string", "default" : "" }, { "name" : "SourcePort", "type" : "string", "default" : "" }, { "name" : "Username", "type" : "string", "default" : "" } ] } ');

This will create a table mapped to our data, the files in HDFS are organized like this:

/hadoop/proxy/year=2017/month=09/day=19
/hadoop/proxy/year=2017/month=09/day=20
...

So that we have 3 levels of partitioning making it easy to do operations on yearly, monthly and daily basis or at least that's the plan.
Once you've done that, you need to tell Hive to add your partitions using:

ALTER TABLE proxy_logs ADD PARTITION(year=2017, month="09", day=19);
...

We need to do that every day since our partitioning is done daily, at this point when I ran a
select count(*) from proxy_logs
I didn't get any result at all... turns out you also need to recompute the statistics for the table metadata to be up to date, I did this with
ANALYZE TABLE proxy_logs partition(year=2017, month="09", day=19) compute statistics;
and only then was I able to access my logs, I will now need to do this for every partitions of every table, time to get scripting!

Hadoop / Spark2 snippet that took way too long to figure out

This is a collection of links and snippet that took me way too long to figure out; I've copied them here with a bit of documentation in...