Le blog de Jéré

vendredi 26 juin 2026

Lessons Learned Mapping the Criminal Underground Economy AKA Crimeware Has a People Problem

The starting point for this was fairly simple: there is a pile of leaked criminal infrastructure data sitting in https://github.com/D4RK-R4BB1T/Criminal-Leaks, and I wanted to see whether looking across it would tell a more useful story than looking at each leak by itself.

Most writeups around ransomware, forums, darknet markets, and malware crews tend to focus on the exciting part: the malware family, the takedown, the actor name, the Bitcoin address, the chat leak, the "here is the IOC table" section. This is useful, but it is also a little bit like trying to understand a company by reading only its outage reports.

If you line the datasets up next to each other, the criminal underground starts to look less like chaos and more like a slightly broken platform economy. There are operators, affiliates, builders, marketplaces, comms servers, payment systems, reputation systems, and customers. There are also unpaid invoices, bad contractors, internal leaks, abandoned accounts, and support processes that sometimes look more responsive than the legitimate companies they are extorting.

The high level question goes like this:

* Who does what in this economy?
* Where are the dependencies?
* What failure modes show up again and again?
* Which parts are technically sophisticated, and which parts are just people being lazy?

The data used here comes from public leaks: Lockbit panel data, BlackBasta chats, BlackMarketReloaded Bitcoin addresses, COPP and NLCOPP forum data, ZooVille user lists, URSNIF material, and a few smaller infrastructure leaks. This is not live access, and it is not magic attribution. It is mostly parsing boring files, checking assumptions, and letting the shape of the data make the point.

Laying Out the Actors

The bird's eye view is around four roles: operators, affiliates, marketplaces, and infrastructure providers.

This is not a perfect model, but it is good enough to avoid treating everything as "the group did X". In practice, the group is usually a platform, a few core people, and a lot of semi-independent participants who may or may not be loyal this week.

Operators: The Platform People

Lockbit is a good example of the operator side of ransomware-as-a-service. The operator provides the panel, builder, payment workflow, victim negotiation infrastructure, and commission logic. Affiliates bring access and do the dirty work.

The leaked Lockbit panel gives a compact view of that relationship:

* 75 total users in the system
* 1 admin account
* 40 active affiliates
* 35 paused affiliates
* 88 generated builds
* 246 victims
* 59,975 Bitcoin addresses in the panel
* 35 affiliates with registered Bitcoin payout addresses
* 7 paid commissions
* 0 decryptions provided

That last part is where the nice business model diagram starts to look less nice.

There are 246 victims in the panel, but only 7 paid commissions and no recorded decryptions. There are several possible explanations for that, and none of them are flattering. Maybe many victims did not pay. Maybe payment tracking was incomplete. Maybe the operators were selective about paying affiliates. Maybe victims paid and still did not get working decryption. Maybe some combination of all of the above.

The important bit is not that we can prove one interpretation from this table alone. The important bit is that the incentives are unstable. Affiliates are useful only as long as they believe the operator will pay. Once that belief disappears, the same people who had access to the tooling and internal process become a leak risk.

At this stage it is tempting to describe ransomware groups as startups, but that is too generous. A startup can have broken processes and still fall back on contracts, employment law, investors, and courts. A ransomware operation has forum reputation, threats, private chats, and the hope that nobody with access gets angry enough to dump the database.

That is not a governance model. That is a countdown.

Affiliates: The Long Tail With Admin Access

The Lockbit affiliate data shows steady recruitment, but it also shows churn. Out of 75 users, 35 were paused. That is 47% of the accounts not currently active, which is a lot if you imagine this as a stable organization and less surprising if you imagine it as a contractor marketplace.

The rough lifecycle looks like this:

1. Register or get recruited.
2. Get access to the panel.
3. Generate a build.
4. Use existing access or buy access elsewhere.
5. Deploy ransomware.
6. Negotiate through the platform.
7. Wait for a commission, in theory.

Build creation follows the usual power-law shape. There were 23 unique builders behind 88 builds, but only a handful drove most of the volume. Most participants generated one or two builds, which could be testing, failed campaigns, or low-volume operations. A smaller core did the real work.

This matters for defenders because "75 users" is not the same thing as "75 equally important operators". The useful target is the small set of people and workflows that keep the platform productive. The rest is noise, churn, and people trying to get rich with someone else's builder.

It also matters because the long tail is probably where a lot of the bad operational security lives. The core people might understand compartmentalization, or at least have learned by getting burned. The occasional affiliate is more likely to reuse handles, connect from home, paste the wrong thing into the wrong chat, or complain publicly when they do not get paid.

Marketplaces: Liquidity For Bad Ideas

BlackMarketReloaded sits in an older era, around the post-Silk-Road marketplace world. The leak contains 108,194 Bitcoin addresses, almost all legacy P2PKH addresses, which fits the 2013-2014 timeframe.

For comparison, the Lockbit panel had 59,975 Bitcoin addresses. The address counts are not directly comparable because the systems and years are different, but the scale is useful. These operations were not manually copying addresses from a wallet UI. They had automated address generation, order or victim separation, and enough workflow around payments that the wallet infrastructure became part of the product.

Marketplaces are the connective tissue. They make the rest of the economy easier to assemble:

* stolen data sellers find buyers;
* initial access brokers find ransomware affiliates;
* malware developers find packers and obfuscation services;
* cash-out specialists find people who need cryptocurrency turned into something spendable;
* reputation systems let strangers make just enough trust decisions to transact.

This is where the underground resembles a normal platform economy most closely. The marketplace does not need to commit every crime itself. It needs search, listings, escrow or pseudo-escrow, reputation, messaging, and enough dispute handling to keep fees flowing.

Of course the problem is that dispute handling in a criminal market is mostly spectacle. If enough money disappears, the final appeal process is doxing, leaking, or rebranding.

Infrastructure Providers: The B2B Layer

The less visible part of the ecosystem is the supplier layer. Ransomware crews and forums need hosting, chat, crypting and packing, wallet automation, Tor hidden services, domain registration, and a steady flow of disposable infrastructure.

BlackBasta, for example, used a self-hosted Matrix server. That is a sensible choice from their perspective. Commercial chat providers can receive legal requests and suspend accounts. Self-hosted chat gives more control, but also creates a juicy central point of failure. If the server or backups leak, the organization leaks with it.

Lockbit and BlackMarketReloaded both show large-scale Bitcoin address management. That is specialized engineering. It might be custom code, it might be a library glued into a panel, but either way someone had to make payment operations work at scale.

Tor hidden services are another example. They look anonymous from the outside, but the private keys and deployment process become crown jewels. Once those keys are leaked, the hidden service identity is burned. Any trust attached to that address burns with it.

This is one of the more useful defender takeaways: infrastructure is not just an indicator source, it is an organizational dependency. When a group self-hosts, automates wallets, or relies on a particular supplier, it creates places where disruption has effects beyond a single malware sample.

Forums And Fringe Communities: Same Mistakes, Different Crime

The COPP, NLCOPP, and ZooVille data is not ransomware data, but it is still useful because the human failure modes rhyme.

COPP had 387,392 accounts and 88,154 connection logs. ZooVille had more than 71,000 user records. Between ZooVille and COPP, there were 3,851 shared usernames. Between COPP and NLCOPP, there were 1,050 shared usernames, around 27.7% of the NLCOPP user base.

This is not advanced correlation. It is string matching.

That is exactly the point. A lot of exposure does not require clever graph analytics or a classified dataset. People reuse handles because they want continuity, recognition, convenience, or because they simply do not think about it. Once one platform leaks, the same handle becomes a thread you can pull across other communities.

The IP data is even less glamorous and more damaging. In the COPP logs, 99.5% of connections were residential ISP connections. Only 3 connections were classified as VPN usage.

That number is almost funny until you remember what it means. These were users of communities that should have been paranoid by default, and nearly everybody was connecting from normal home internet.

Security advice loses to convenience, again.

Where It Breaks

The failure modes are depressingly consistent.

Payment disputes create leaks. Operators can promise affiliates a percentage, but there is no court where an affiliate can recover unpaid commission. If the operator stops paying, the affiliate's leverage is reputational damage, leaking internal data, or attacking the operator.

That is what made the URSNIF material interesting: the leak was not only a technical event, it was a labor dispute in a criminal business. Someone believed they were owed something, and the enforcement mechanism was exposure.

Counter-hacking burns infrastructure. Criminal groups target each other for money, reputation, rivalry, or entertainment. When that happens, private keys, panels, chats, and databases become trophies. The practical result is the same as a law enforcement seizure in one respect: infrastructure that was trusted yesterday becomes unusable today.

Exit scams poison future markets. ExposedForums reportedly exits with users' money, then the same ecosystem reappears under another brand, then the new brand scams its own affiliates. This sounds absurd until you remember there is no regulator, no refund process, and no durable identity beyond reputation. Rebranding is cheap. Trust recovery is not.

Builder leaks fragment operations. Once a ransomware builder leaks, the operator does not merely lose exclusivity. The tooling can create copycat variants, confuse attribution, and spin up low-quality campaigns under adjacent names. Lockbit's history with leaked and copied builders is a good example of how control does not disappear cleanly. It splinters.

The common thread is simple: the technology can be professional, but the trust model is terrible.

The PGP And Reputation Problem

Criminal forums use reputation because they have to. Registration age, vouches, post count, transaction feedback, signed messages, and PGP keys all become substitutes for legal identity.

PGP is especially interesting because it gives a portable identity. If a user signs a message with the same key across forums, buyers can treat that as continuity. If the private key leaks, the continuity turns into a liability. Anyone with the key can impersonate the identity, and the old reputation becomes suspect.

Forum reputation has the same problem at a social level. It takes time to build and can disappear in one ugly thread. A failed escrow, an accusation, a leaked chat, a missing payout, or an exit scam can collapse years of credibility. The dispute process is public argument plus whatever screenshots people decide to post.

This is why leaks are not rare accidents in this ecosystem. They are one of the few enforcement mechanisms available.

If you cannot sue, you leak.

Cross-Dataset Things That Were Worth Checking

Some of the more useful findings came from metadata, not content.

The Lockbit and BlackMarketReloaded Bitcoin datasets had 168,169 total addresses with zero overlap. Lockbit was 100% segwit-native in the analyzed set, while BlackMarketReloaded was 99.97% legacy. Even before doing transaction graph work, the address formats tell you something about era, tooling, and compartmentalization.

The username overlaps were also useful. ZooVille to COPP produced 3,851 shared handles. COPP to NLCOPP produced 1,050. These are not proof that any one account belongs to the same human without more context, but they are excellent triage leads.

The BlackBasta chat timing was another good example. The 195,881 messages peaked around 12:00 UTC, which is 3:00 PM Moscow time. That does not prove location by itself, but it does show an operational rhythm. People have workdays, even when the work is crime.

Lockbit negotiation response time was also striking. Median response was around 3.5 minutes. That is not a casual mailbox someone checks twice a day. That implies process, alerting, dedicated staff, automation, or some mix of those.

Again, none of this requires a Hollywood interface. It is mostly counting, grouping, and being disciplined about not overclaiming.

Takeaways For Defenders

The first takeaway is to study the system, not just the malware. Malware changes quickly, but roles and incentives are stickier. Operators need affiliates. Affiliates need payment. Markets need reputation. Infrastructure needs maintenance. Every dependency is a place where pressure can produce intelligence.

The second takeaway is that payment disputes are useful signals. When affiliates complain, something is already breaking. Those complaints can precede leaks, rebrands, or splinter groups.

The third takeaway is that metadata is often enough to start. Address formats, username reuse, ISP classification, chat timing, and builder counts can reveal structure without needing to read every message or reverse every binary.

The fourth takeaway is that OPSEC failure is normal. We like to talk about the careful actors because they are hard. The bulk of many communities is not careful. COPP's 99.5% residential connection pattern is a reminder that most people choose convenience almost all the time.

The fifth takeaway is that criminal infrastructure is fragile in boring ways. It fails because people do not get paid, keys leak, forums burn reputations, admins rebrand, builders escape, and users reuse handles.

That is good news for defenders, because boring failures are measurable.

So What Did This Actually Show?

The underground economy is structured, but not stable.

It has platforms, but weak governance. It has contractors, but no enforceable contracts. It has reputation, but reputation is easy to poison. It has cryptographic identity, but private keys leak. It has payment automation, but payments still become disputes. It has self-hosted infrastructure, but that just moves the breach target closer to the operators.

The Lockbit panel numbers are a good summary of the whole thing: 75 users, 88 builds, 246 victims, 59,975 Bitcoin addresses, 7 paid commissions, and 0 decryptions recorded.

That is not just a ransomware dataset. That is a broken business process with malware attached.

For defenders, the lesson is not "wait for criminals to make mistakes" in the abstract. The lesson is to map the roles, find the dependencies, and watch the trust boundaries. The leaks will keep happening because the trust problem is not a bug in this economy. It is part of the design.

Data Sources

* Lockbit panel database: users, builds, victims, Bitcoin addresses, commission records
* BlackBasta internal communications: 195,881 messages across 79 rooms and 50 participants
* Lockbit negotiations: 3,087 messages
* BlackMarketReloaded: 108,194 Bitcoin addresses
* COPP: 387,392 accounts and 88,154 IP logs
* NLCOPP: 4,856 users
* ZooVille: 71,017 email addresses and 71,497 usernames

vendredi 19 juin 2026

Following a GitHub Malware Trail into LuaJIT, PEB Walking, Smart Contracts, and Screenshot Exfiltration

The starting point for this analysis was the OrchidFiles article on GitHub repositories distributing malware.

That write-up describes a campaign where fake or cloned-looking GitHub repositories point users to archive downloads. The archive layout matched the sample I had in hand: a small Windows launcher, a Lua runtime, and an obfuscated Lua script.

The launcher was almost boring:

start lua51.exe rest.txt

The interesting part was `rest.txt`.

At first glance, it looked like an obfuscated Lua script.
After instrumentation, it became clear that Lua was mostly the outer shell.
The payload was using LuaJIT FFI to behave like native Windows malware: manually resolving APIs, reading process internals, capturing the screen, querying a Polygon smart contract, and trying to upload data to attacker infrastructure.

This post walks through the analysis in a practical way: what the payload does, what the Process Environment Block is, why the LuaJIT harness was useful, and how the smart contract added another layer of infrastructure indirection.

The Lua Script Was Not Just Lua

LuaJIT includes an FFI, or Foreign Function Interface. FFI lets Lua code declare C types and call native functions directly. That means a Lua script can call Windows APIs without being a normal PE executable with a nice import table.

In normal Lua, code mostly lives inside the Lua runtime and calls whatever libraries the interpreter exposes. LuaJIT FFI changes that boundary. A script can describe C structs, function prototypes, pointer types, arrays, and callbacks, then pass raw pointers around almost like C code. From the analyst's point of view, the payload stops looking like a simple script and starts behaving like an in-memory native loader.

For a threat actor, that has several practical advantages:

The payload can ship as text-like script content instead of a conventional compiled executable.
Native API use can be hidden behind runtime-generated FFI declarations and dynamic symbol resolution.
The script can allocate memory, cast pointers, copy bytes, and call function pointers.
Windows internals can be re-created from C structure definitions inside the script.
Static detection based on PE imports becomes less useful because the Lua host process may not import the suspicious APIs directly.

This is why LuaJIT FFI is relevant tradecraft here.
It gives the operator the convenience of a script with many of the capabilities of native Windows code. The same payload can keep most of its logic in obfuscated Lua, then cross into native execution only when it needs lower-level behavior such as PEB walking, export parsing, GDI screen capture, or WinINet communication.

It also complicates analysis. A static strings pass might show fragments like `ffi`, `cdef`, or Windows type names, but the real behavior only appears when the FFI declarations are decoded and the script starts resolving and calling symbols. That is why the harness focused so heavily on wrapping `ffi.cdef`, `ffi.cast`, `ffi.copy`, `ffi.load`, and `ffi.C`.

The payload declared Windows structures such as:

IMAGE_DOS_HEADER
IMAGE_NT_HEADERS32
IMAGE_NT_HEADERS64
IMAGE_EXPORT_DIRECTORY
PEB
PEB_LDR_DATA
LDR_DATA_TABLE_ENTRY
UNICODE_STRING
BITMAPFILEHEADER
BITMAPINFOHEADER
BITMAP
BITMAPINFO

That set of types tells a story. The script wanted to understand PE files, walk the Windows loader structures, parse exports, and build bitmap data. In other words, it was preparing to resolve APIs manually and capture an image.

A Short Detour: What Is the PEB?

One of the clearest artifacts from the trace was a 7-byte native stub:

64 a1 30 00 00 00 c3

On 32-bit Windows, this disassembles to:

mov eax, fs:[0x30]
ret

That returns a pointer to the PEB, the Process Environment Block.

The PEB is an internal Windows process structure. It contains details about the current process, including a pointer to loader data. That loader data includes lists of modules already loaded into the process, such as `kernel32.dll`, `ntdll.dll`, and other DLLs.

Why does that matter for malware analysis?

Because malware can use the PEB to find DLLs without asking Windows through normal, easy-to-spot APIs. Once it finds a DLL base address, it can parse the module's PE export table and locate functions by name. This technique avoids ordinary imports and makes static analysis noisier.

In this sample, the Lua code used LuaJIT FFI to:

Allocate executable memory.
Copy the 7-byte PEB getter into that memory.
Cast the memory to a function pointer.
Call it.
Walk the PEB loader lists.
Parse fake or real PE export tables.
Resolve Windows APIs dynamically.

That is a native loader technique implemented from Lua.

Why a Linux Lua Harness?

The sample clearly expected Windows, but the analysis environment was Linux. Running it normally would either fail early or, worse, execute real behavior in an uncontrolled environment.

So the goal was not to "run the malware." The goal was to let the Lua logic progress while stubbing every dangerous Windows action.

The harness wrapped LuaJIT FFI and logged operations such as:

ffi.cdef
ffi.load
ffi.new
ffi.cast
ffi.copy
ffi.string
ffi.C.<symbol>

It also built fake Windows structures:

a fake PEB
fake loader lists
fake PE modules
fake export tables

The fake modules included:

lua51.dll
kernel32.dll
ntdll.dll
advapi32.dll
wininet.dll
shell32.dll
shlwapi.dll
user32.dll
gdi32.dll
winbrand.dll

When the malware tried to resolve exports, the harness returned fake function pointers. When it called those functions, the harness logged the arguments and returned controlled values.

This approach is useful for this campaign because the same LuaJIT loader pattern may appear across many archives. Rather than detonating each sample on a real Windows host, analysts can extract:

decoded FFI definitions
resolved API names
file paths
registry keys
network destinations
smart contract addresses
HTTP request bodies
generated exfiltration payloads

All of that can be done while blocking actual file, registry, process, service, and network side effects.

What the Payload Did

With the fake Windows environment in place, the script completed execution under the harness.

It hid the console:

GetConsoleWindow
ShowWindow(SW_HIDE)

It resolved a broad Windows API surface:

LdrLoadDll
RtlInitUnicodeString
InternetOpenW
InternetConnectW
HttpOpenRequestW
HttpSendRequestW
InternetReadFile
InternetOpenUrlW
RegOpenKeyExW
RegQueryValueExW
OpenProcessToken
GetTokenInformation
SHGetFolderPathW
PathFileExistsW
GetDC
CreateCompatibleDC
CreateDIBSection
BitBlt
WinExec
CreateThread
WaitForSingleObject

API resolution does not automatically mean execution. For example, `WinExec` and `CreateThread` were resolved, but this trace did not show process creation.

The trace did show a registry open attempt:

HKLM\Software\Microsoft\Cryptography

That key is commonly interesting because `MachineGuid` lives under it. The harness blocked the registry operation before a value query completed, so we should avoid overstating this. The evidence supports "host identity collection attempt," not a confirmed value read.

The script also tried to read:

C:\Users\analysis\AppData\Roaming\-1.json

In the harness, that read failed because the file did not exist.

No service creation was observed. No file creation or file write was observed.

Screen Capture and Multipart Upload

The payload performed classic GDI screen-capture behavior:

GetDC
CreateCompatibleDC
CreateDIBSection
SelectObject
BitBlt

Then it built a multipart request body with a file part that started with `BM`, the BMP magic. Because this was a Linux/ARM64 FFI harness emulating Windows structures, the generated BMP header had widened fields and host tooling reported it as generic data. Structurally, though, the intent is clear: the payload constructed a BMP-like screenshot object and prepared it for upload.

The upload target observed in the trace was:

http://217.119.129.99/api/NTE3YjdjNWU1NjYzNjU2YTA1N2Y=

The base64 path component decodes to:

517b7c5e5663656a057f

The multipart body included:

- a `file` part with a random-looking filename
- a `data` JSON field
- a base64 value that decoded to a long hex-looking blob, likely encrypted or obfuscated metadata

That was the first clear C2/exfiltration path.

The Smart Contract Layer

The sample also performed repeated Polygon JSON-RPC calls. The request body was:

{
"jsonrpc": "2.0",
"method": "eth_call",
"params": [
{
"to": "0x1823A9a0Ec8e0C25dD957D0841e3D41a4474bAdc",
"data": "0x3bc5de30"
},
"latest"
],
"id": 1
}

The selector `0x3bc5de30` maps to `getData()` according to 4byte.directory

Replaying the call safely with `eth_call` returned an ABI-encoded string:

http://85.137.52.21

That means the malware was not just using the smart contract as noise. The contract was acting as a configuration pointer.

Storage inspection showed:

slot 0 = 0xde275ad38c3352a7cb6b0d3efcbf45900c9716f2
slot 1 = "http://85.137.52.21"

So the contract appears to store an admin address in slot 0 and the active data/config string in slot 1.

Other selectors identified in the bytecode:

0x092a5cce -> destroyContract()
0x3bc5de30 -> getData()
0x68446ead -> updateData(string)
0xf851a440 -> admin()
0x58eea4ad -> unknown selector, also returns the stored string

This is a useful campaign design pattern. The malware can ship with a fixed contract address and function selector. The operator can update the live infrastructure by changing contract storage, without modifying the Lua payload.

The Contract Pointed to 85.137.52.21

The returned URL introduced a second infrastructure lead:

http://85.137.52.21

Enrichment on `85.137.52.21` classified it as external infrastructure in:

Network: 85.137.52.0/24
Provider context: Virtual Systems LLC / VSYS Host
ASN: AS43641
Location: Amsterdam, Netherlands
Abuse contact: abuse-ams@v-sys.org

RIPE RDAP confirms the network as `85.137.52.0 - 85.137.52.255`, named `VSYS-AMS`, country `NL`, with a Virtual Systems Amsterdam abuse contact at `abuse-ams@v-sys.org`

The provider context is notable. A 2024 IBCAP press release announced a lawsuit against Virtual Systems, alleging that it advertised a "DMCA Ignored" policy and ignored more than 500 infringement notices.

TorrentFreak later reported that DISH won a default judgment against Virtual Systems in November 2025, with damages of $41,850,000 and a permanent injunction:

That context does not prove that VSYS knowingly hosted this malware infrastructure. It does make the infrastructure choice interesting: the smart contract pointed to hosting associated with an offshore provider publicly discussed in relation to DMCA-ignored hosting and abuse-handling concerns.

From a defender perspective, this is exactly why the smart contract pivot matters. The initial trace exposed `217.119.129.99`; the on-chain configuration exposed `85.137.52.21`. Without querying the contract, that second IP would have been easy to miss.

Practical Reproduction: Safe Read-Only Contract Calls

The contract data can be investigated without sending transactions. Use read-only JSON-RPC methods such as `eth_call`, `eth_getCode`, and `eth_getStorageAt`.

Replay the malware's call:

curl -sS -X POST https://polygon-public.nodies.app \
-H 'content-type: application/json' \
--data '{"jsonrpc":"2.0","method":"eth_call","params":[{"to":"0x1823A9a0Ec8e0C25dD957D0841e3D41a4474bAdc","data":"0x3bc5de30"},"latest"],"id":1}'

Read the admin function:

curl -sS -X POST https://polygon-public.nodies.app \
-H 'content-type: application/json' \
--data '{"jsonrpc":"2.0","method":"eth_call","params":[{"to":"0x1823A9a0Ec8e0C25dD957D0841e3D41a4474bAdc","data":"0xf851a440"},"latest"],"id":1}'

Read raw storage slot 1:

curl -sS -X POST https://polygon-public.nodies.app \
-H 'content-type: application/json' \
--data '{"jsonrpc":"2.0","method":"eth_getStorageAt","params":["0x1823A9a0Ec8e0C25dD957D0841e3D41a4474bAdc","0x1","latest"],"id":1}'

Avoid state-changing functions such as `updateData(string)` and `destroyContract()`. For this kind of investigation, read-only calls are enough.

Detection and Hunting Ideas

Useful behavioral leads:

Lua or LuaJIT launched by a batch file from an archive.
LuaJIT processes using FFI to declare PE, PEB, and loader structures.
Small executable allocation containing `64 a1 30 00 00 00 c3`.
LuaJIT resolving APIs through `LdrLoadDll` and manual export parsing.
GDI screen capture followed by WinINet multipart upload.
Polygon JSON-RPC `eth_call` from non-browser, non-wallet processes.
Calls to `0x1823A9a0Ec8e0C25dD957D0841e3D41a4474bAdc` with selector `0x3bc5de30`.
Outbound HTTP to `217.119.129.99` or `85.137.52.21`.

Takeaways

There are three main lessons from this sample.

First, script malware can still be native malware. LuaJIT FFI gave this script direct access to Windows internals and native APIs.

Second, PEB walking is not limited to compiled loaders. Here, the malware implemented a native loader pattern from Lua by allocating a tiny x86 thunk and parsing PE exports through FFI-defined structures.

Third, blockchain infrastructure can be used as a small, resilient configuration layer. The payload did not need to hardcode every active endpoint. It could query a Polygon smart contract and recover an operator-controlled URL.

For this campaign, the GitHub repository is the lure, the Lua runtime is the execution substrate, LuaJIT FFI is the native bridge, the smart contract is the configuration pointer, and the observed payload behavior is screenshot collection and exfiltration.

Sources

OrchidFiles, GitHub repositories distributing malware: https://orchidfiles.com/github-repositories-distributing-malware
RIPE RDAP for `85.137.52.21`: https://rdap.db.ripe.net/ip/85.137.52.21
4byte signature lookup for `0x3bc5de30`: https://www.4byte.directory/api/v1/signatures/?hex_signature=0x3bc5de30
IBCAP press release on the Virtual Systems lawsuit: https://www.globenewswire.com/news-release/2024/10/16/2963975/0/en/IBCAP-announces-41-million-lawsuit-against-Virtual-Systems.html
TorrentFreak report on the 2025 default judgment: https://torrentfreak.com/dish-wins-42m-default-judgment-against-dmca-ignored-host-virtual-systems-251114/

jeudi 14 mai 2020

Hadoop / Spark2 snippet that took way too long to figure out

This is a collection of links and snippet that took me way too long to figure out; I've copied them here with a bit of documentation in the hope that others find it useful.

Copy data from one cluster to another

This snippet was made to copy large (42TB) from one cluster to another, the default options don't play very well here.
We basically increase the number of mapper (-m100 seems to work well with 40 data nodes), increase the timeout (-D mapred.task.timeout) run in update mode to be able to recover and retry (-update), ignore error (-i), log the errors to hdfs for investigation (-log ) and we also disable the crc check (-skipcrccheck) as we are copying between 2 different encryption zones.

hadoop distcp  -D mapred.task.timeout=1800000 -pb -pp -m100 -update -copybuffersize 65536 -i -log /application/distcp -skipcrccheck -v  hdfs://namenode1.com/application/logs/year=2018/month=11 /application/logs/year=2018/month=11

A snippet of python to convert data from Avro (old cluster) to parquet (new cluster)

Note that this requires Spark 2.3 and to run it you'll need to add an extra package to be able to read the Avro files:

/bin/spark-submit --queue batch_queue --packages com.databricks:spark-avro_2.11:3.2.0 sparkConversion.py

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext, types
from pyspark.sql.functions import udf
import logging

logger = logging.getLogger()

appName = "spark test"

conf = SparkConf().setAppName(appName)
conf.set('spark.hadoop.avro.mapred.ignore.inputs.without.extension', 'false')

sc = SparkContext(conf=conf)
sc.setLogLevel('ERROR')
sqlContext = SQLContext(sc)

df = sqlContext.read.format("com.databricks.spark.avro").load("/application/day=26")
logger.info("Data frame loaded, it contains {0} element".format(df.count()))

df.write.format("org.apache.spark.sql.execution.datasources.parquet").save("/application/app1/year=2018/month=11/day=26/")

The starting point for _start_ script:

In order this initialize:

Kerberos from a keytab
submit the job in client mode (usually the best for reading from HDFS)
force the use of the official HortonWorks repo to pull external artefacts
add a property file to override some Spark defaults
setup an HTTP proxy
force the use of a given jaas file for Kafa connection
pass on the external python modules in a zip
invoke the job from the src directory

#!/bin/bash -xp

export SPARK_MAJOR_VERSION=2
#export PYSPARK_PYTHON=python3.4

BASE_DIR=$(dirname $0)
TOOL_DIRECTORY=$(realpath ${BASE_DIR})
ROOT_DIRECTORY=$(realpath ${BASE_DIR}/../)
DIST_DIRECTORY=$(realpath ${TOOL_DIRECTORY}/../dist/)
SRC_DIRECTORY=$(realpath ${TOOL_DIRECTORY}/../src)

cd  ${TOOL_DIRECTORY}

kinit -kt ~/user.keytab user@REALM.NET

/bin/spark-submit --master=yarn --deploy-mode=client \
--repositories http://repo.hortonworks.com/content/groups/public \
--properties-file ${TOOL_DIRECTORY}/spark-streaming.conf \
--executor-memory 4g \
--driver-memory 60g \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=./kafkajaas.conf -Dhttp.proxyHost=proxy.net -Dhttp.proxyPort=3128 -Dhttps.proxyHost=proxy.net -Dhttps.proxyPort=3128 -Dhttp.nonProxyHosts='localhost|*.realm.net|10.*.*.*'  -Djava.io.tmpdir=/scratch/tmp " \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./kafkajaas.conf -Djavax.net.ssl.trustStore=cacerts.jks -Dhttps.proxyHost=proxy.net -Dhttps.proxyPort=3128 -Dhttp.proxyHost=proxy.net -Dhttp.proxyPort=3128 -Dhttp.nonProxyHosts='localhost|*.realm.net|10.*.*.*'  -Djava.io.tmpdir=/scratch/tmp" \
--keytab ~/user.keytab \
--principal user@REALM.NET \
--supervise \
--py-files  /scratch/libs.zip \
${SRC_DIRECTORY}/$1

Problem; reading from a remote HDFS cluster which has Kerberos enabled:
You start the spark job on a DR/newer cluster and you want to access data located on the other DR site or on an older cluster and has soon as Spark tries to access the remote FS you are greeted with this error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 42, host3.net, executor 1): java.io.IOException: DestHost:destPort host1.net:8020 , LocalHost:localPort host2.net/10.4.16.51:0. Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]

hdfs dfs commands perform exactly as they should and the user can access the files using other methods, only Spark fails.
This error means that Spark failed to obtain a delegation token when accessing the HDFS RPC services on the remote cluster...
So what is a delegation token?

When a cluster is Kerberized, applications need to obtain tickets prior to access services and run jobs, Kerberos is not only used for authentication it is also used to allow users to access services like a MapReduce job.
If all the worker's tasks have to authenticate via Kerberos using a delegated TGT (Ticket Granting Ticket), the Kerberos Key Distribution Center (KDC) be overwhelmed by the amount of request so Delegation Tokens are introduced as a lightweight authentication method to complement Kerberos authentication.
The way Delegation Tokens works is:

1. The client initially authenticates with each server via Kerberos, and obtains a Delegation Token from that server.
2. The client uses the Delegation Tokens for subsequent authentications with the servers instead of reaching out to the KDC again.

In this case, the service Spark was not able to get the delegation tokens for HDFS and YARN service(s) which are two main components for writing data and running jobs.

So we need to force Spark to obtain a delegation token for both clusters, we can do this by overwriting a property (as always) using

spark.yarn.access.hadoopFileSystems=hdfs://,hdfs://

Note that Spark must have access to the filesystems listed and Kerberos must be properly configured to be able to access them (either in the same realm or in a trusted realm), then Spark will acquire security tokens for each of the file systems so that the Spark application can access those remote Hadoop file systems.
This also means that you can only list *ACTIVE* name nodes in that property and not the standby ones.

Useful resources

http://mkuthan.github.io/blog/2016/09/30/spark-streaming-on-yarn/ for all things related to monitoring your Spark jobs once they reach production maturity.
https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/ on all things Kerberos and hadoop-ish. REALLY A MUST.
https://spark.apache.org/docs/latest/running-on-yarn.html#kerberos on Spark & Kerberos.
https://bigdatapath.wordpress.com/2018/01/09/hadoop-delegation-tokens-explained/ on delegation tokens.
http://lzhbrian.me/mywiki_mkdocs/pyspark/ convenient PySpark snippet to get you started the right way

mardi 27 mars 2018

Lessons learned building a big data solution oriented towards security logs AKA a "security data lake"

My company decided to give a go at the big data security log search, most information out there talks about implementing machine learning and other features using (cool) Jupyter notebooks and proof of concept sort of setup to accomplish that. However, trying to implement an infrastructure to accomplish that at large scale, reliably turned out to be an epic story.

The high level requirements goes like this;

We want to be able to search events in near real time and perform full text search on events and create dashboards for specific use cases depending on the investigation.
We want to store both the data in pristine and parsed format for at least a year for audit and investigation purposes.
We want to be able to replay older logs through the pipeline to fix/update the parser-augmenter pipeline.

At this stage some of you might say that Apache Metron or Apache Spot and you would be right to do so, the problem we found was the list of supported log parser for metron isn't that long and are linked to specific version. This is not abnormal but in our large and mixed environment (the telecom world has a lot of strange devices) the benefit were not so clear, on top of that using 2 tools means 2 different data models, interfaces and so on to work with. The full story is a lot longer and in retrospect some point were overlooked and some difficulties were over evaluated, the importance of having a tool for the analyst to manipulate the data and query them extensively was overlooked, it's called Stellar on the Metron side.

Laying out the tracks

The bird's eye overview is around 3 steps, ingestion, parsing and augmentation, storage.

Ingestion is typically done through a syslog server in a Linux environment and using a Windows event collector in a Microsoft shop, this is already a problem; none of these provide high availability and scaling can only be done via a divide an conquer approach (host A&B go on that server, B&C on that other server and so on). One approach to address that issue would be to use minifi to read the log files from the syslog servers (active-active) and ship them to the nifi cluster, there we use a detect duplicate processor to ship only one copy of the log to the next layer. For windows, you might want to use NXLog to ship your windows logs either to files and then use minifi or emit them directly towards Kafka. Once the logs have reached nifi, the 1st thing to do is to wrap them in a json envelope to add the meta data (Time of ingestion, server from which it's coming from, the type of logs ...) and then ship them to Kafka. This will allow you to be handle to back pressure nicely and do point in time recovery if the subsequent part of the pipeline ever have problems or slow down.
The above architecture result in something like this:

Parsing and augmentation is where the heavy duty processing is done.
To parse an event is to extract business data out it like IP's, username and URL from it, you can do it with various tools depending on the skills and constraint of your shop, our proof of concept used logstash to do that task, the pros are it's easy to implement and very near real time oriented (low latency) the con's are that it doesn't handle back pressure very well (one slow parser will slow down all log sources) and it will generally require fire power than others. Nifi on the other hand is micro batches oriented which means more delay, if you have a hard requirement on low latency (less than 60 seconds) this is most likely a no go, on top of that any non trivial parsing on logs that aren't record oriented (Linux auth logs or windows logs) are a massive pain IMHO, there's some work being done on that with record orient processor that can use grok expression to do the parsing but it's still sub optimal. It's a tough problem, Metron uses apache storm to do their parsing to have enough fire power and flexibility to implement complex logic but storm is primarily Java and since our team doesn't have enough Java dev's we settle for Spark since it has a Python API pySpark.
To augment an event is to add more information into it, simple example is to add the geoip information for an IP address, a more complex example would be to add the username associated with an internal IP address or a internal user and IP reputation system (there's a lot to be done here so I'll probably expand on the subject in another post) ...
Again we'll use Spark for the job, there a fair amount of boiler plate code to help you start properly and focus on the business logic rather than fight the tools to get them to work.

Storage is where we store the event for later display, reporting and again we face an amusing issue in that the tools for long term archival and near real time searches are very different, for the real time search we settle on Solr and ElasticSearch, and at the time we settled for Solr since it was included in the stack we use (Ambari) for the long term retention (HDFS) , while Solr is not a bad product, feature wise it's lagging behind, on the UI front, the only generic interface that we found is "banana" which is a fork of Kibana 3 which is old to say the least...On top of all that Elastic has been moving extremely fast with either the machine learning component in the x-pack and the vega scriptable UI, on top of that the core business of Solr seems to be the content management search, which means that the typical workload is read heavy and few writes, exactly the opposite of a log search tool where there will be massive and continuous write and much fewer reads which means that a lot of default value are not suitable for near real time search, see the link for the Solr wiki on the topic.

No matter which one you choose, keep in mind to spread your data over a large number of disks to max the number IOPS the cluster can perform and maximize the effect of the buffer cache, increase the number of shards to match the number of physical disks to enhance write & read performances, keep the replica down to a manageable size.

On the long term storage front, there's is also a couple of things to consider like the file format, there's a lot of discussions about which file format is best but basically you want one of the packed/binary file format like avro, ORC or parquet. Preferably one of the column oriented ones which allow you to change the schema (add or remove fields) over time without it being a breaking change (leaving out avro), knowing all that, we settle for parquet because it has a good support in Spark and it's compression ratio are pretty good which allow us to archive a bit more before using up all the disk space. You will also want to partition the data, that means essentially to store the files in different directories based on the time, something like /hadoop/proxy/year=2018/month=02/day=01/file.parquet this makes it easy to write a scrubber job to delete old data per type depending on the retention period of your logs, pretty much the same way we used to do it on a syslog server.

Now that we layed out the tracks, metrics, out of memory, kerberos, config management, writing a spark job for the parsing, seting up hive table as external because you want to keep the data, ...

Spark streaming with Kerberos enabled on Ambari

In the scope of a research project, I'm parsing syslog logs to stuff them into HDFS (another post on that will follow) and one of the requirement is to have Kerberos enabled on the Kafa brokers, well it turns out it's a bit more trouble than what I asked for and not so well documented.
So here's a brain dump in case you find yourself in the same situation;

My pySpark(2) job flow is as follow, it uses a kafka stream to read from one topic, parse the log line (raw field), add some fields and then emit the resulting json message to another topic, I kept the thing close to what I'm doing so that it's a somewhat relevant example ad not just a simple word count that doesn't illustrate much in my mind.

setup the jaas file

To tell the jvm we're using Kerberos and what credentials to use and all that, once again all the example I've seen use password based auth which is irrelevant in any production context (again ymmv), so here's the jaas file:

Client {
        com.sun.security.auth.module.Krb5LoginModule required
        useKeyTab=true
        keyTab="/home/user/user.keytab"
        useTicketCache=true
        serviceName="zookeeper"
        debug=true
        principal="user@DOMAIN.NET";
};
KafkaClient {
        com.sun.security.auth.module.Krb5LoginModule required
        useKeyTab=true
        keyTab="/home/user/user.keytab"
        useTicketCache=true
        serviceName="kafka"
        debug=true
        principal="user@DOMAIN.NET";
};

Create a keytab

We need to create a keytab file, as referenced by the jass file, you can create it with those instructions:

$ktutil
ktutil: addent  -password -p your_id97@BGC.NET -k 1 -e aes256-cts-hmac-sha1-96
password for user@DOMAIN.NET: (ktutil is prompting for the password)
ktutil:  wkt your_id.keytab
ktutil:  q
$kinit –kt your_id.keytab user@DOMAIN.NET

Now that the jvm kerberos setup is ok and the keytab is initialized we can do the adjustment in the python.

Setup spark streaming

ssc = StreamingContext(sc, 1)

 kafkaStream = KafkaUtils.createStream(ssc, 'zookeeper:2181', consumer_name, {'ingest-topic': 1}, 
  kafkaParams = {'security.protocol': 'PLAINTEXTSASL',
   'sasl.kerberos.service.name': 'kafka',
   'group.id': consumer_name,
   'rebalance.backoff.ms': '5000',
   'zookeeper.session.timeout.ms': '10000' },
  storageLevel = StorageLevel.MEMORY_AND_DISK_SER)

 kafkaStream.foreachRDD(handler)

The important bit for Kerberos is the 'kafkaParams', note that the security.protocol is *really* 'PLAINTEXTSASL', some other part of the eco system uses SASL_PLAINTEXT but this one uses PLAINTEXTSASL.
The sasl service name is linked to the content of the jaas file, so make sure they match as it's the way the jvm finds the other options required.

Setup kafka producer

Now we want to emit to Kafka, using KafkaProducer:

producer = KafkaProducer(security_protocol = "SASL_PLAINTEXT",
  sasl_mechanism="GSSAPI",
  bootstrap_servers="kafkabroker:6667",
  client_id=consumer_name)

Note that this time the security protocol is "SASL_PLAINTEXT" and no longer PLAINTEXTSASL, even though it's 2 different libraries, a bit of consistency would be nice, note that you'll need GSSAPI libraries (dev) and python gssapi modules for it to support Kerberos.

Fire it up

Now we need to start our spark job with the following invocation:
Launch script:

#!/bin/bash -xp
export SPARK_MAJOR_VERSION=2
/bin/spark-submit --master=yarn --deploy-mode=cluster \
--files kafka_jaas.conf,myuser.keytab \
--repositories http://repo.hortonworks.com/content/groups/public \
--packages  org.apache.spark:spark-streaming-kafka-0-8_2.11:2.1.1.2.6.1.0-129 \
--properties-file spark-streaming.conf \
--driver-memory 5g \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=kafka_jaas.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=kafka_jaas.conf -Djavax.net.ssl.trustStore=ca_certs.jks" \
--py-files main.py

The important bits here are the extraJavaOptions to point to the jass file we want to use and the packages option that tells spark to load extra jars to allow Kafka streaming to work.
Note that running a pyspark job like this isn't really practical, I recommend using something like https://github.com/ekampf/PySpark-Boilerplate that setups a zip file with all your data file and deps, remember since the spark job will be distributed all over your workers, they all need the same exact copy of the script and dependencies.
You'll also need to bootstrap the keytab somewhere in your job, simply using

print '[+] Setting up Kerberos: \n%s\n' % (subprocess.check_output(['kinit', '-kt', 'user.keytab', 'user@DOMAIN.NET']))

did the trick for me.

One additional trick if it doesn't work is to add the following line before doing anything in your script:

Debugging

# to debug kerberos problems
import logging
logging.basicConfig(filename='/tmp/example.log',level=logging.DEBUG)

It will generate a massive amount of logs but at least it gives you some data to help you debug the thing.

Putting everything together

# Import dependencies
from __future__ import print_function
from pyspark import SparkContext, StorageLevel
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from kafka import KafkaProducer
import json
from pygrok import Grok
from datetime import *
import pytz
from dateutil.parser import *

consumer_name = "SyslogImporter"
syslog_pattern = "(<%{NUMBER:SyslogPriority}>)?%{TIMESTAMP_ISO8601:Timestamp}\s%{HOSTNAME:SourceHostname}\s({SYSLOGPROG:SyslogProgram}\[%{NUMBER:SyslogProgramId}\])?"

def grok_syslog(event):
    grok = Grok(syslog_pattern)
    raw = json.loads(event)
    parsed_syslog = grok.match(raw['Raw'])
    if parsed_syslog:
     parsed_syslog["tags"] = ["syslog"]
     parsed_syslog["parsers"] = [consumer_name]
    else:
     parsed_syslog = json.loads(event)
     parsed_syslog["Timestamp"] = parsed_syslog["IngestionTimestamp"]
     parsed_syslog["tags"] = ["not_syslog"]
     parsed_syslog["parsers"] = [consumer_name]    
    new_event = {key: value for (key, value) in (raw.items() + parsed_syslog.items())}
    return new_event

def parse_syslog(event):
    utc_tz = pytz.timezone('UTC')
    log = grok_syslog(event)
    ts = log['Timestamp']
    ts_dt = parse(ts)
    log['Timestamp'] = ts_dt.astimezone(utc_tz).isoformat()
    return log

def handler(message):
 records = message.collect()
 for raw_record in records:
  #saving the original to send it as such in case of exception
  syslog_record_string = raw_record[1]
  event_id = "0"
  try:
   syslog_record = parse_syslog(raw_record[1])
   event_id = syslog_record["Id"]
   syslog_record_string = json.dumps(syslog_record, ensure_ascii=False).encode('utf8')
  except Exception as e:
   print("[-] Something barfed" + str(e))
   print(raw_record)
   pass
  producer.send(b'work-topic', bytes(syslog_record_string))
  producer.flush()

def main():
 sc = SparkContext(appName=consumer_name) 
 ssc = StreamingContext(sc, 1)

 kafkaStream = KafkaUtils.createStream(ssc, 'zookeeper:2181', consumer_name, {'ingest-topic': 1}, 
  kafkaParams = {'security.protocol': 'PLAINTEXTSASL',
   'sasl.kerberos.service.name': 'kafka',
   'group.id': consumer_name,
   'rebalance.backoff.ms': '5000',
   'zookeeper.session.timeout.ms': '10000' },
  storageLevel = StorageLevel.MEMORY_AND_DISK_SER)

 kafkaStream.foreachRDD(handler)

 # Start the streaming context
 ssc.start()
 ssc.awaitTermination()

if __name__ == "__main__":

print '[+] Setting up Kerberos: \n%s\n' % (subprocess.check_output(['kinit', '-kt', 'user.keytab', 'user@DOMAIN.NET']))
 producer = KafkaProducer(security_protocol = "SASL_PLAINTEXT",
  sasl_mechanism="GSSAPI",
  bootstrap_servers="kafkabroker:6667",
  client_id=consumer_name)
 main()

lundi 2 octobre 2017

Incident response on a shoestring budget

This blog post is the first in a series that aim to setup a basic security monitoring on a low budget, nothing but a few servers, gray matter and a lot of open source software and try to avoid vendor$ and all.

One of the problem to solve on a large network is to know who is talking to whom. host to host communication are easy to monitor using bro or netflow but knowing which software communicates over what connection is a whole different problem (sometimes called endpoint visibility).
Good news is Facebook released osquery and it is now available on all platforms, osquery allow you to access pretty much all the data of a computer using SQL.
For instance something like

osquery> SELECT uid, name FROM listening_ports l, processes p WHERE l.pid=p.pid;
+------+-------------+
| uid  | name        |
+------+-------------+
| 1000 | dbus-daemon |
+------+-------------+

Would give you all the uid and process name of the process having listening ports (server process if you want).
It also has a daemon (osqueryd) that allows you to have scheduled query and you can save the output of those queries in ElasticSearch for stacking and analysis.

The plan is to have queries running at regular interval that gives you the process name, uid, checksum (sha1) and the connections.
something like this:

osquery> select action, protocol, local_address, local_port, remote_address, remote_port, uid, name FROM processes p, socket_events s WHERE s.pid=p.pid;

scheduled at regular interval would do the trick.
Once setup to run it will return something like this on regular interval:

{
 "name": "network_info",
  "hostIdentifier": "goldorak",
  "calendarTime": "Tue Dec 13 15:10:08 2016 UTC",
  "unixTime": "1481641808",
  "decorations": {
   "host_uuid": "FC1D7B01-5138-11CB-B85D-C04D3A0C6645",
   "username": "someusername"
  },
 "columns": {
    "action": "connect",
    "local_address": "",
    "local_port": "0",
    "name": "DNS Res~er #115",
    "protocol": "14578",
    "remote_address": "192.168.1.1",
    "remote_port": "53",
    "uid": "1000"
  },
  "action": "added"
}

The next thing I want to look at is the network forensic problem, I define it like this: I want to know at any point in time which host on my network talk to what other host. First Ideally, I want to know the protocol and meta data associated like SSL/TLS cert, file extraction and so on. Now that we have the same sort of information already from osquery (pid, process name and connection ...) we should ideally be able to stitch it all that together to have a global view, the source could be netflow or bro metadata and the same stitching could be done for Snort or Suricata alerts to add more context and details for the analyst.

vendredi 29 septembre 2017

Hive external tables shows no data

I'm currently working on a project that ingest a lot of security related data (firewall logs, http proxy logs, ...) and the goal is to keep them for a year or so to be able to do analytics and go back in time in case of security incident and have a clue what the "bad guy" did (follow compromised asset or user).

The plan is to store the data in avro format on HDFS and then create Hive table on top, ideally the table should be external so that Hive doesn't remove the original files in case I need it for something else.

To create an external table using avro formatted data you typically do this:

CREATE EXTERNAL TABLE cyber.proxy_logs
PARTITIONED BY (year STRING, month STRING, day STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/cyber/proxy_logs/'
TBLPROPERTIES ('avro.schema.literal'=' { "type" : "record", "name" : "proxy", "fields" : [ { "name" : "DeviceHostname", "type" : "string", "default" : "" }  { "name" : "ResponseTime", "type" : "string", "default" : "" }, { "name" : "SourceIp", "type" : "string", "default" : "" }, { "name" : "SourcePort", "type" : "string", "default" : "" }, { "name" : "Username", "type" : "string", "default" : "" } ] } ');

This will create a table mapped to our data, the files in HDFS are organized like this:

/hadoop/proxy/year=2017/month=09/day=19
/hadoop/proxy/year=2017/month=09/day=20
...

So that we have 3 levels of partitioning making it easy to do operations on yearly, monthly and daily basis or at least that's the plan.
Once you've done that, you need to tell Hive to add your partitions using:

ALTER TABLE proxy_logs ADD PARTITION(year=2017, month="09", day=19);
...

We need to do that every day since our partitioning is done daily, at this point when I ran a

select count(*) from proxy_logs

I didn't get any result at all... turns out you also need to recompute the statistics for the table metadata to be up to date, I did this with

ANALYZE TABLE proxy_logs partition(year=2017, month="09", day=19) compute statistics;

and only then was I able to access my logs, I will now need to do this for every partitions of every table, time to get scripting!