Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Hardwood_ Building a Parquet Parser From Scratc...

Sponsored · Ship Features Fearlessly Turn features on and off without deploys. Used by thousands of Ruby developers.

Hardwood_ Building a Parquet Parser From Scratch (With a Little Help From AI)

Apache Parquet has become the lingua franca of the modern data ecosystem, powering data lakes and table formats like Apache Iceberg—but for Java, the go-to library parquet-java pulls in Hadoop and a truckload of other dependencies, and its reader is single-threaded. This was bugging me enough to start Hardwood, a brand-new Parquet parser written from scratch in modern Java, applying some of the performance lessons learned from the One Billion Row Challenge.

Come and join me for this session, where we'll look at:

* The internals of the Parquet format and what makes parallelizing its decoding surprisingly tricky
* Techniques for achieving high throughput, such as page-level parallelism, adaptive prefetching, and avoiding auto-boxing in hot loops
* How to use JDK Flight Recorder for identifying performance bottlenecks
* Practical learnings from using AI (specifically, Claude Code) as a coding companion—what works well, where you need to stay sharp, and why "built with AI" doesn't mean "vibe-coded"

Whether you're interested in file formats, Java performance, or getting a realistic take on AI-assisted development, there should be something in here for you.

Avatar for Gunnar Morling

Gunnar Morling

May 21, 2026

More Decks by Gunnar Morling

Other Decks in Programming

Transcript

  1. Hardwood: Building a Parquet Parser From Scratch (With a Little

    Help From AI) © Dale Cruse https://flic.kr/p/2s68Vpa (CC BY 4.0) Gunnar Morling @gunnarmorling
  2. #Hardwood #Parquet #Java · @gunnarmorling What if Parquet parsing in

    Java didn’t need Hadoop — and used all your cores?
  3. #Hardwood #Parquet #Java · @gunnarmorling • Creator of Hardwood •

    Former project lead of Debezium • kcctl 🧸, JfrUnit, ModiTect, MapStruct • One Billion Row Challenge 1⃣🐝🏎 • Java Champion Gunnar Morling Technologist at Confluent
  4. #Hardwood #Parquet #Java · @gunnarmorling Row-oriented (CSV, JSON, Avro) |

    Columnar (Parquet, ORC, Arrow) | R1: id=1, name=Alice, fare=12.5 | id: 1 2 3 ... R2: id=2, name=Bob, fare=8.3 | name: Alice Bob Carol R3: id=3, name=Carol, fare=21.0 | fare: 12.5 8.3 21.0 Row-Oriented vs. Columnar-Oriented • Best for: per-row writes, replay one record • Best for: scan a few columns across millions of rows
  5. #Hardwood #Parquet #Java · @gunnarmorling Why Columnar? Example ----------------------------------------------------------------- SELECT

    AVG(fare) FROM trips WHERE pickup_date > '2025-01-01' Row format: read every byte of every row Columnar: read 2 columns only -> 10-100x less I/O similar values cluster -> 3-10x compression ----------------------------------------------------------------- • Where you find it ◦ Data lakes on S3 etc., Open Table formats ◦ OLAP engines: Spark, Trino, DuckDB, ClickHouse, Snowflake ◦ In-memory analytics: Apache Arrow ◦ ML feature stores; analytical mirrors of OLTP databases
  6. #Hardwood #Parquet #Java · @gunnarmorling Inside a Parquet File +------------------------------------------------------------+

    | PAR1 (4-byte magic, file start) | +------------------------------------------------------------+ | | | +=============== ROW GROUP 1 ======================+ | | | Column Chunk A | Column Chunk B | ... | | | | +--------------+ | +--------------+ | | | | | | DictPage | | | DataPage v2 | | | | | | | DataPage | | | DataPage v2 | | | | | | | DataPage | | | DataPage v2 | | | | | | +--------------+ | +--------------+ | | | | +====================================================+ | | | | +=============== ROW GROUP 2 ======================+ | | | ... | | | +====================================================+ | | | +------------------------------------------------------------+ | FOOTER (Thrift) schema | rg meta | stats | page index | +------------------------------------------------------------+ | footer length (4B) | PAR1 | +------------------------------------------------------------+ Read back-to-front: last 8 bytes -> footer -> jump to pages
  7. #Hardwood #Parquet #Java · @gunnarmorling Nested Data: The Dremel Trick

    Definition & Repetition Levels Schema Records message Doc { Doc(id=1, authors=[Alice, Bob]) required int64 id; Doc(id=2, authors=[]) repeated group authors { Doc(id=3, authors=[Carol]) required string name; } max D = 1 max R = 1 } (authors is the one repeated parent) Column 'authors.name' on disk: value: Alice Bob NULL Carol D-level: 1 1 0 1 R-level: 0 1 0 0 D=1 -> a name is present D=0 -> the list was empty R=0 -> start of a new Doc R=1 -> next author in same Doc
  8. #Hardwood #Parquet #Java · @gunnarmorling • Different compression algorithms •

    Many different encodings • Predicate push-down (statistics, bloom filters) • VARIANT column type • Encryption Apache Parquet A Long Tail of Capabilities
  9. #Hardwood #Parquet #Java · @gunnarmorling Why Build a New Parser?

    parquet-java pain points • Pulls in Hadoop: …and a truckload of transitive deps • Reader is single-threaded: leaves cores idle on modern hardware But also: Explore how far LLMs will take you?
  10. #Hardwood #Parquet #Java · @gunnarmorling Hardwood Goals What we set

    out to build • Light-weight: implement the Parquet format with zero mandatory transitive dependencies • Compatible: read every file that parquet-java reads • Fast: match or exceed parquet-java's read throughput • Concurrent: multi-threaded at the core • Embeddable: usable from native CLIs, S3-only pipelines etc.
  11. #Hardwood #Parquet #Java · @gunnarmorling Reading a Parquet File try

    (ParquetFileReader reader = ParquetFileReader.open(InputFile.of(path)); RowReader rows = reader.rowReader()) { while (rows.hasNext()) { rows.next(); long id = rows.getLong("id"); String name = rows.getString("name"); LocalDate birth = rows.getDate("birth_date"); // ... typed primitives, no auto-boxing } }
  12. #Hardwood #Parquet #Java · @gunnarmorling Query Controls: Projection // Column

    projection try (RowReader r = reader.buildRowReader() .projection(ColumnProjection.columns("id", "name", "created_at")) .build()) { while (r.hasNext())) { r.next(); long id = r.getLong("id"); String name = r.getString("name"); Instant ts = r.getTimestamp("created_at"); } }
  13. #Hardwood #Parquet #Java · @gunnarmorling Query Controls: Filters // Predicate

    pushdown (row group + page + record) FilterPredicate where = FilterPredicate.and( FilterPredicate.gtEq("salary", 50_000L), FilterPredicate.lt("age", 65), FilterPredicate.isNotNull("email")); FilterPredicate after = FilterPredicate.gt("birth_date", LocalDate.of(2000, 1, 1)); FilterPredicate amount = FilterPredicate.gtEq("amount", new BigDecimal("99.99"));
  14. #Hardwood #Parquet #Java · @gunnarmorling // S3: static credentials (hardwood-s3)

    S3Source source = S3Source.builder() .region("us-east-1") .credentials(S3Credentials.of("AKIA...", "secret")) .build(); try (ParquetFileReader fr = ParquetFileReader.open( source.inputFile("s3://my-bucket/data/trips.parquet")); RowReader r = fr.rowReader()) { /* ... */ } { … } S3 and Avro: Same API, Different Plumbing
  15. #Hardwood #Parquet #Java · @gunnarmorling S3 and Avro: Same API,

    Different Plumbing // Avro: read rows as GenericRecord try (ParquetFileReader fr = ParquetFileReader.open(...); AvroRowReader r = AvroReaders.buildRowReader(fr) .build()) { Schema avroSchema = r.getSchema(); while (r.hasNext()) { GenericRecord rec = r.next(); long id = (Long) rec.get("id"); } }
  16. #Hardwood #Parquet #Java · @gunnarmorling Columnar API for Hot Loops

    // Skip the per-row API. Pull an entire batch of values. try (ColumnReader col = reader.columnReader()) { long sum = 0; while (col.hasNextBatch()) { PqLongList batch = col.readLongBatch(); long[] vals = batch.values(); for (int i = 0; i < vals.length; i++) sum += vals[i]; // JIT auto-vectorizes this } }
  17. #Hardwood #Parquet #Java · @gunnarmorling Page-Level Parallelism Why Pages, Not

    Row Groups or Column Chunks Row-group parallel Column-chunk parallel Page parallel <-- Hardwood +--------+ <- W1 +--------+ <- W1 P1 P2 P3 P4 ... (100s) | RG 1 | | Col A | fast ^ ^ ^ ^ +--------+ +--------+ W1 W2 W3 W4 ... +--------+ <- W2 +--------+ <- W2 | RG 2 | | Col B | SLOW (within one row group) +--------+ +--------+ +--------+ <- W3 +--------+ <- W3 + bounded memory | RG 3 | | Col C | fast + virtual threads coordinate +--------+ +--------+ + adaptive prefetch: slow columns get more Many files have only Capped at # projected workers 1-2 row groups -> tiny columns; columns decode at fan-out; huge RGs -> very different speeds -> fast memory pressure. workers idle waiting on slow.
  18. #Hardwood #Parquet #Java · @gunnarmorling Modern Java FTW Some Features

    We Lean On • Virtual threads — per-column retriever/ drain coordinators • Vector API — SIMD bit-unpacking & dict lookups • FFM + libdeflate — native-speed gzip on Java 22+ • GraalVM — AOT-compiled native CLI binary • JFR — ship-with-the-binary profiling
  19. #Hardwood #Parquet #Java · @gunnarmorling Modern Java FTW Some Features

    We Lean On • Virtual threads — per-column retriever/drain coordinators; decode itself runs on a fixed OS-thread pool • Vector API — SIMD bit-unpacking & dict lookups • FFM + libdeflate — native-speed gzip on Java 22+ • GraalVM — AOT-compiled native CLI binary • JFR — ship-with-the-binary profiling
  20. #Hardwood #Parquet #Java · @gunnarmorling Zero-Copy Backings: Local mmap, Remote

    Range Cache file on disk | v +------------------------------------------------+ | mmap into the process (MappedByteBuffer) | +------------------------------------------------+ ^ ^ ^ slice slice slice <- ChunkHandle reads from the mapping directly Zero-copy ByteBuffer slices, no read() syscalls, OS page cache does the rest. first hit on range A..B subsequent hits on A..B +-----------------+ +-------------------+ | GET Range A..B | | served from mmap | +-----------------+ | (no GET, no copy) | | +-------------------+ v ^ +-----------------------------------------------+ | sparse temp file, mmapped (cap: 2 GB) | | holes for unread regions | | populated on demand | +-----------------------------------------------+ Local files MappedInputFile Remote files (S3, ...) RangeBackedInputFile
  21. #Hardwood #Parquet #Java · @gunnarmorling Performance Numbers FlatPerformanceTest • 9.2

    GB NYC taxi ride data • 20 columns, summing up values from three • Hardwood: Row reader–named (1), indexed (2), column reader (3) • parquet-java: row-based (4), columnar (5)
  22. #Hardwood #Parquet #Java · @gunnarmorling Hardwood CLI: Swiss Army Knife

    for Parquet $ hardwood print -n 20 -f data.parquet $ hardwood convert --format json -f s3://bucket/data.parquet $ hardwood inspect columns -f data.parquet
  23. #Hardwood #Parquet #Java · @gunnarmorling Hardwood CLI: Swiss Army Knife

    for Parquet Binaries for macOS, Linux & Windows https://github.com/hardwood-hq/hardwood/releases/
  24. #Hardwood #Parquet #Java · @gunnarmorling What AI Does Well •

    Implementing a spec — encodings, page headers, Thrift Compact Protocol • Driving test suites — parquet-testing has hundreds of files; great oracle • Triaging failures — “what does this hex dump tell us?” • Boilerplate — JMH harnesses, Testcontainers setup, GraalVM hints • Pair-debugging — the rubber duck talks back
  25. #Hardwood #Parquet #Java · @gunnarmorling Where You Need to Stay

    Sharp • Duplicated logic — two near-identical decoders instead of one, because it’s easier than refactoring • Paper-over corner cases — another if/else instead of fixing the underlying bug • Quietly excluding test cases — instead of figuring out why the unexpected result happened • “Plausible” performance “wins” — that benchmarks reveal as no-ops or regressions
  26. #Hardwood #Parquet #Java · @gunnarmorling Make or Buy—Reframed by AI

    Example: S3 Request Signer • The classical answer: pull in the AWS Java SDK • The Hardwood answer: write SigV4 from scratch ◦ 289 lines, JDK crypto only ◦ Validated against official AWS SigV4 test vectors What it bought • Zero mandatory deps for the S3 reader. • No transitive surface, no SDK version pin, no shading
  27. #Hardwood #Parquet #Java · @gunnarmorling Code Reviews Custom Skills •

    /hardwood-review PR xyz • /hardwood-address-review xyz
  28. #Hardwood #Parquet #Java · @gunnarmorling Built with AI ≠ Vibe-Coded

    LLM-assisted contributions are welcome. Vibe coding is not. • You read every diff. Every. Diff. • You own the architecture: invariants, threading, allocation budget. • Benchmarks are the ground truth, not Claude’s narration. AI-generated code is a starting point, not an end state.
  29. #Hardwood #Parquet #Java · @gunnarmorling Status & Roadmap • 1.0.0.CR1

    ◦ Row + columnar reader; flat and nested files ◦ All types, encodings, compression schemes ◦ Local and remote files, predicate push-down ◦ CLI • 1.1 ◦ Writer ◦ Bloom filters ◦ Encryption
  30. #Hardwood #Parquet #Java · @gunnarmorling Summary • Apache Parquet: a

    richer file format than “just columnar” • Hardwood: lean, fast, dependency-light ◦ Page-level parallelism + adaptive prefetching ◦ Modern Java pays off in real numbers ◦ AI is a force multiplier — but you stay the engineer ◦ ❤❤❤