Slide 1

Slide 1 text

Paul Dix – CTO & co-founder [email protected] @pauldix InfluxDB IOx data lifecycle and object store persistence

Slide 2

Slide 2 text

Terms • Writer ID – Unique u32 server identifier • Mutable Buffer – in-memory writable & queryable DB • Read Buffer – in-memory read-only optimized DB • Object Store – storage backed by S3, Azure, Google, Local Disk, or Memory

Slide 3

Slide 3 text

Terms • Partition – User defined area of non-overlapping data • Chunk – Block of data in a partition potentially overlapping • WAL Segment – Collection of writes, deletes, schema modifications

Slide 4

Slide 4 text

The Write, Replication, & Subscription Lifecyle

Slide 5

Slide 5 text

Input to Flatbuffers Writer ID Sequence Number

Slide 6

Slide 6 text

The WAL Buffer

Slide 7

Slide 7 text

WAL Buffer • Max Buffer Size • Segment Size • Persist? – On rollover and/or open time • Max Behavior – Reject write – Drop new write – Drop un-persisted Segment • Segments have monotonically increasing ID, writer ID

Slide 8

Slide 8 text

WAL Buffer Example (success) • 300MB max, 100MB Segments • Writes fill up segment 1 • Close segment, attempt to persist in background • New writes into segment 2, which fills up • Close segment, attempt to persist in background • New writes into segment 3, which fills up • Close segment, if segment 1 persisted, clear it • New writes (will create segment 4, or not)

Slide 9

Slide 9 text

WAL Buffer Properties • Persistence optional – Replication for durability – Local disk (later and if persisting) • Steady-state takes max buffer memory • Segments captured on schedule or space • Serverless replication via segments in object store

Slide 10

Slide 10 text

WAL Object Store Location • //wal///.segment • 1/mydb/wal/000/000/001.segment • Up to 999,999,998 segments – Maybe increase this by adding a few zero padding at top level • 1 list operation to get DB dirs • 3 list operations to get latest segment

Slide 11

Slide 11 text

What Segments to Read on Restart?

Slide 12

Slide 12 text

The Write, Replication, & Subscription Lifecyle

Slide 13

Slide 13 text

Mutable Buffer to Chunk Persistence

Slide 14

Slide 14 text

Mutable Buffer Structure • Data is partitioned – User configurable – Default based on time (every 2h, for instance) – Rows can only exist in a single partition • Each partition has chunks – Used to persist blocks of data within a partition – Closing a chunk and persisting it § Triggered on size § Triggered on time § Triggered by explicit API call – Chunks may have overlapping data

Slide 15

Slide 15 text

Chunk Persistence to Object Store • Parquet files, 1 per table (measurement) • Metadata file • Tables • Columns & types § Summary Stats (min, max, count) • Writer Stats § For each writer, min and max sequence § For each writer, min and max segment (if applicable) • Catalog file (for whole DB) • All partitions & chunks • New file each chunk persisted

Slide 16

Slide 16 text

Chunk Properties • Immutable once persisted • Can compact chunks • Bulk import new chunks • Read replicas • Load only recent chunks

Slide 17

Slide 17 text

Chunk Object Store Location • //chunks///… • 1/mydb/chunks/2020-01-13/1/cpu.parquet

Slide 18

Slide 18 text

Catalog Object Store Location • //catalog/<0 padded id>.checkpoint • 1/mydb/catalog/000000000001.checkpoint

Slide 19

Slide 19 text

Catalog Data • Partitions • Chunks – Metadata (when persisted, when last queried) – Schema – Summary Stats – Writer Stats • Writer Stats on Open Chunks – Min Segment ID with Data

Slide 20

Slide 20 text

Catalog Properties • Schema information instantly (validation!) • Cheap schema renames • Point-in-time recovery

Slide 21

Slide 21 text

Restart/Recovery (single database) • Get checkpoint from catalog • Determine oldest WAL Segment to start from • Read WAL into Mutable Buffer – Only write into buffer for writes not persisted • Fetch recently queried chunks

Slide 22

Slide 22 text

Handling Deletes • Tombstones in the WAL (and in-memory) • Applied at query time • Apply to open chunks (cached delete results) • Apply to read-buffer incrementally as chunks get read • When next chunk persisted, mark in catalog • Tombstones matched to each chunk it applies • Compact old chunks • In background • On different servers

Slide 23

Slide 23 text

Delete Properties • They’re expensive! – Expensive to rewrite data – Should be cheap(ish) at query time • You can do them in the background! – Even on a different server – Compact chunks to new and write catalog • Drops are incredibly cheap – New structure means drops on measurements!

Slide 24

Slide 24 text

Project Update • Mutable Buffer + Read Buffer Lifecycle – In process – Query (without optimizations) • WAL Segment Persistence PR up • Chunk Persistence within two weeks • Recovery Shortly after • Arrow Flight RPC! • Some optimization and numbers • Builds in Feb (hopefully)

Slide 25

Slide 25 text

Thank You