Hardwood: Building a Parquet Parser From Scratch (With a Little Help From AI)

Hardwood: Building a Parquet Parser From Scratch (With a Little
Help From AI) © Dale Cruse https://flic.kr/p/2s68Vpa (CC BY 4.0) Gunnar Morling @gunnarmorling

#Hardwood #Parquet #Java · @gunnarmorling What if Parquet parsing in
Java didn’t need Hadoop — and used all your cores?

#Hardwood #Parquet #Java · @gunnarmorling Agenda

#Hardwood #Parquet #Java · @gunnarmorling • Creator of Hardwood •
Former project lead of Debezium • kcctl 🧸, JfrUnit, ModiTect, MapStruct • One Billion Row Challenge 1⃣🐝🏎 • Java Champion Gunnar Morling Technologist at Conﬂuent

Apache Parquet © Barry Silver https://flic.kr/p/8A6XQV (CC BY 2.0)

#Hardwood #Parquet #Java · @gunnarmorling Row-oriented (CSV, JSON, Avro) |
Columnar (Parquet, ORC, Arrow) | R1: id=1, name=Alice, fare=12.5 | id: 1 2 3 ... R2: id=2, name=Bob, fare=8.3 | name: Alice Bob Carol R3: id=3, name=Carol, fare=21.0 | fare: 12.5 8.3 21.0 Row-Oriented vs. Columnar-Oriented • Best for: per-row writes, replay one record • Best for: scan a few columns across millions of rows

#Hardwood #Parquet #Java · @gunnarmorling Why Columnar? Example ----------------------------------------------------------------- SELECT
AVG(fare) FROM trips WHERE pickup_date > '2025-01-01' Row format: read every byte of every row Columnar: read 2 columns only -> 10-100x less I/O similar values cluster -> 3-10x compression ----------------------------------------------------------------- • Where you find it ◦ Data lakes on S3 etc., Open Table formats ◦ OLAP engines: Spark, Trino, DuckDB, ClickHouse, Snowflake ◦ In-memory analytics: Apache Arrow ◦ ML feature stores; analytical mirrors of OLTP databases

#Hardwood #Parquet #Java · @gunnarmorling Nested Data: The Dremel Trick
Deﬁnition & Repetition Levels Schema Records message Doc { Doc(id=1, authors=[Alice, Bob]) required int64 id; Doc(id=2, authors=[]) repeated group authors { Doc(id=3, authors=[Carol]) required string name; } max D = 1 max R = 1 } (authors is the one repeated parent) Column 'authors.name' on disk: value: Alice Bob NULL Carol D-level: 1 1 0 1 R-level: 0 1 0 0 D=1 -> a name is present D=0 -> the list was empty R=0 -> start of a new Doc R=1 -> next author in same Doc

#Hardwood #Parquet #Java · @gunnarmorling • Different compression algorithms •
Many different encodings • Predicate push-down (statistics, bloom filters) • VARIANT column type • Encryption Apache Parquet A Long Tail of Capabilities

Hardwood © Michael Coghlan https://flic.kr/p/PDwXXS (CC BY-SA 2.0)

#Hardwood #Parquet #Java · @gunnarmorling Why Build a New Parser?
parquet-java pain points • Pulls in Hadoop: …and a truckload of transitive deps • Reader is single-threaded: leaves cores idle on modern hardware But also: Explore how far LLMs will take you?

#Hardwood #Parquet #Java · @gunnarmorling Hardwood Goals What we set
out to build • Light-weight: implement the Parquet format with zero mandatory transitive dependencies • Compatible: read every file that parquet-java reads • Fast: match or exceed parquet-java's read throughput • Concurrent: multi-threaded at the core • Embeddable: usable from native CLIs, S3-only pipelines etc.

#Hardwood #Parquet #Java · @gunnarmorling Reading a Parquet File try
(ParquetFileReader reader = ParquetFileReader.open(InputFile.of(path)); RowReader rows = reader.rowReader()) { while (rows.hasNext()) { rows.next(); long id = rows.getLong("id"); String name = rows.getString("name"); LocalDate birth = rows.getDate("birth_date"); // ... typed primitives, no auto-boxing } }

#Hardwood #Parquet #Java · @gunnarmorling Query Controls: Projection // Column
projection try (RowReader r = reader.buildRowReader() .projection(ColumnProjection.columns("id", "name", "created_at")) .build()) { while (r.hasNext())) { r.next(); long id = r.getLong("id"); String name = r.getString("name"); Instant ts = r.getTimestamp("created_at"); } }

#Hardwood #Parquet #Java · @gunnarmorling Query Controls: Filters // Predicate
pushdown (row group + page + record) FilterPredicate where = FilterPredicate.and( FilterPredicate.gtEq("salary", 50_000L), FilterPredicate.lt("age", 65), FilterPredicate.isNotNull("email")); FilterPredicate after = FilterPredicate.gt("birth_date", LocalDate.of(2000, 1, 1)); FilterPredicate amount = FilterPredicate.gtEq("amount", new BigDecimal("99.99"));

#Hardwood #Parquet #Java · @gunnarmorling // S3: static credentials (hardwood-s3)
S3Source source = S3Source.builder() .region("us-east-1") .credentials(S3Credentials.of("AKIA...", "secret")) .build(); try (ParquetFileReader fr = ParquetFileReader.open( source.inputFile("s3://my-bucket/data/trips.parquet")); RowReader r = fr.rowReader()) { /* ... */ } { … } S3 and Avro: Same API, Diﬀerent Plumbing

#Hardwood #Parquet #Java · @gunnarmorling S3 and Avro: Same API,
Diﬀerent Plumbing // Avro: read rows as GenericRecord try (ParquetFileReader fr = ParquetFileReader.open(...); AvroRowReader r = AvroReaders.buildRowReader(fr) .build()) { Schema avroSchema = r.getSchema(); while (r.hasNext()) { GenericRecord rec = r.next(); long id = (Long) rec.get("id"); } }

#Hardwood #Parquet #Java · @gunnarmorling Columnar API for Hot Loops
// Skip the per-row API. Pull an entire batch of values. try (ColumnReader col = reader.columnReader()) { long sum = 0; while (col.hasNextBatch()) { PqLongList batch = col.readLongBatch(); long[] vals = batch.values(); for (int i = 0; i < vals.length; i++) sum += vals[i]; // JIT auto-vectorizes this } }

Implementation © formulanone https://flic.kr/p/woErp4 (CC BY-SA 2.0)

#Hardwood #Parquet #Java · @gunnarmorling Page-Level Parallelism Why Pages, Not
Row Groups or Column Chunks Row-group parallel Column-chunk parallel Page parallel <-- Hardwood +--------+ <- W1 +--------+ <- W1 P1 P2 P3 P4 ... (100s) | RG 1 | | Col A | fast ^ ^ ^ ^ +--------+ +--------+ W1 W2 W3 W4 ... +--------+ <- W2 +--------+ <- W2 | RG 2 | | Col B | SLOW (within one row group) +--------+ +--------+ +--------+ <- W3 +--------+ <- W3 + bounded memory | RG 3 | | Col C | fast + virtual threads coordinate +--------+ +--------+ + adaptive prefetch: slow columns get more Many files have only Capped at # projected workers 1-2 row groups -> tiny columns; columns decode at fan-out; huge RGs -> very different speeds -> fast memory pressure. workers idle waiting on slow.

#Hardwood #Parquet #Java · @gunnarmorling Modern Java FTW Some Features
We Lean On • Virtual threads — per-column retriever/ drain coordinators • Vector API — SIMD bit-unpacking & dict lookups • FFM + libdeflate — native-speed gzip on Java 22+ • GraalVM — AOT-compiled native CLI binary • JFR — ship-with-the-binary profiling

#Hardwood #Parquet #Java · @gunnarmorling Modern Java FTW Some Features
We Lean On • Virtual threads — per-column retriever/drain coordinators; decode itself runs on a fixed OS-thread pool • Vector API — SIMD bit-unpacking & dict lookups • FFM + libdeflate — native-speed gzip on Java 22+ • GraalVM — AOT-compiled native CLI binary • JFR — ship-with-the-binary profiling

#Hardwood #Parquet #Java · @gunnarmorling S3 I/O Path

#Hardwood #Parquet #Java · @gunnarmorling Zero-Copy Backings: Local mmap, Remote
Range Cache file on disk | v +------------------------------------------------+ | mmap into the process (MappedByteBuffer) | +------------------------------------------------+ ^ ^ ^ slice slice slice <- ChunkHandle reads from the mapping directly Zero-copy ByteBuffer slices, no read() syscalls, OS page cache does the rest. first hit on range A..B subsequent hits on A..B +-----------------+ +-------------------+ | GET Range A..B | | served from mmap | +-----------------+ | (no GET, no copy) | | +-------------------+ v ^ +-----------------------------------------------+ | sparse temp file, mmapped (cap: 2 GB) | | holes for unread regions | | populated on demand | +-----------------------------------------------+ Local files MappedInputFile Remote files (S3, ...) RangeBackedInputFile

#Hardwood #Parquet #Java · @gunnarmorling Performance Numbers FlatPerformanceTest • 9.2
GB NYC taxi ride data • 20 columns, summing up values from three • Hardwood: Row reader–named (1), indexed (2), column reader (3) • parquet-java: row-based (4), columnar (5)

CLI © FlipFlopFlorida https://flic.kr/p/5ZWGsP (CC BY 2.0)

#Hardwood #Parquet #Java · @gunnarmorling Hardwood CLI: Swiss Army Knife
for Parquet $ hardwood print -n 20 -f data.parquet $ hardwood convert --format json -f s3://bucket/data.parquet $ hardwood inspect columns -f data.parquet

#Hardwood #Parquet #Java · @gunnarmorling Hardwood CLI: Swiss Army Knife
for Parquet Binaries for macOS, Linux & Windows https://github.com/hardwood-hq/hardwood/releases/

#Hardwood #Parquet #Java · @gunnarmorling dive: Interactive Parquet Exploration

#Hardwood #Parquet #Java · @gunnarmorling dive in Action

#Hardwood #Parquet #Java · @gunnarmorling What AI Does Well •
Implementing a spec — encodings, page headers, Thrift Compact Protocol • Driving test suites — parquet-testing has hundreds of files; great oracle • Triaging failures — “what does this hex dump tell us?” • Boilerplate — JMH harnesses, Testcontainers setup, GraalVM hints • Pair-debugging — the rubber duck talks back

#Hardwood #Parquet #Java · @gunnarmorling Where You Need to Stay
Sharp • Duplicated logic — two near-identical decoders instead of one, because it’s easier than refactoring • Paper-over corner cases — another if/else instead of fixing the underlying bug • Quietly excluding test cases — instead of figuring out why the unexpected result happened • “Plausible” performance “wins” — that benchmarks reveal as no-ops or regressions

#Hardwood #Parquet #Java · @gunnarmorling Make or Buy—Reframed by AI
Example: S3 Request Signer • The classical answer: pull in the AWS Java SDK • The Hardwood answer: write SigV4 from scratch ◦ 289 lines, JDK crypto only ◦ Validated against official AWS SigV4 test vectors What it bought • Zero mandatory deps for the S3 reader. • No transitive surface, no SDK version pin, no shading

#Hardwood #Parquet #Java · @gunnarmorling Code Reviews

#Hardwood #Parquet #Java · @gunnarmorling Code Reviews Custom Skills •
/hardwood-review PR xyz • /hardwood-address-review xyz

#Hardwood #Parquet #Java · @gunnarmorling Built with AI ≠ Vibe-Coded
LLM-assisted contributions are welcome. Vibe coding is not. • You read every diff. Every. Diff. • You own the architecture: invariants, threading, allocation budget. • Benchmarks are the ground truth, not Claude’s narration. AI-generated code is a starting point, not an end state.

#Hardwood #Parquet #Java · @gunnarmorling Status & Roadmap • 1.0.0.CR1
◦ Row + columnar reader; flat and nested files ◦ All types, encodings, compression schemes ◦ Local and remote files, predicate push-down ◦ CLI • 1.1 ◦ Writer ◦ Bloom filters ◦ Encryption

#Hardwood #Parquet #Java · @gunnarmorling hardwood.dev

#Hardwood #Parquet #Java · @gunnarmorling Summary • Apache Parquet: a
richer file format than “just columnar” • Hardwood: lean, fast, dependency-light ◦ Page-level parallelism + adaptive prefetching ◦ Modern Java pays off in real numbers ◦ AI is a force multiplier — but you stay the engineer ◦ ❤❤❤

#Hardwood #Parquet #Java · @gunnarmorling Get In Touch gmorling@conﬂuent.io @gunnarmorling
@gunnarmorling.dev morling.dev 📧

#Hardwood #Parquet #Java · @gunnarmorling.dev

Hardwood: Building a Parquet Parser From Scratc...

Hardwood: Building a Parquet Parser From Scratch (With a Little Help From AI)

More Decks by Gunnar Morling

Other Decks in Programming

Featured

Transcript