The plan is to store the data in avro format on HDFS and then create Hive table on top, ideally the table should be external so that Hive doesn't remove the original files in case I need it for something else.
To create an external table using avro formatted data you typically do this:
CREATE EXTERNAL TABLE cyber.proxy_logs PARTITIONED BY (year STRING, month STRING, day STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION '/cyber/proxy_logs/' TBLPROPERTIES ('avro.schema.literal'=' { "type" : "record", "name" : "proxy", "fields" : [ { "name" : "DeviceHostname", "type" : "string", "default" : "" } { "name" : "ResponseTime", "type" : "string", "default" : "" }, { "name" : "SourceIp", "type" : "string", "default" : "" }, { "name" : "SourcePort", "type" : "string", "default" : "" }, { "name" : "Username", "type" : "string", "default" : "" } ] } ');
This will create a table mapped to our data, the files in HDFS are organized like this:
/hadoop/proxy/year=2017/month=09/day=19 /hadoop/proxy/year=2017/month=09/day=20 ...
So that we have 3 levels of partitioning making it easy to do operations on yearly, monthly and daily basis or at least that's the plan.
Once you've done that, you need to tell Hive to add your partitions using:
ALTER TABLE proxy_logs ADD PARTITION(year=2017, month="09", day=19); ...
We need to do that every day since our partitioning is done daily, at this point when I ran a
select count(*) from proxy_logsI didn't get any result at all... turns out you also need to recompute the statistics for the table metadata to be up to date, I did this with
ANALYZE TABLE proxy_logs partition(year=2017, month="09", day=19) compute statistics;and only then was I able to access my logs, I will now need to do this for every partitions of every table, time to get scripting!