The Architecture of a Distributed Analytics and Storage Engine for Massive Time-Series Data

A ﬂy-by tour of the design of Snowth A distributed
database for storage and analysis of time-series telemetry http://l42.org/FQE

The many faces of Theo Schlossnagle @postwait CEO Circonus

Problem Space • System Availability • Signiﬁcant Retention (10 years)
• > 107 different metrics • Frequency Range [1mHz - 1GHz] • ~1ms for time range retrieval • Support tomorrow’s “data scientist” https://www.ﬂickr.com/photos/design-dog/4358548056

A rather epic data storage problem. What we are scribing
to disk [email protected] : 525000/yr [email protected] : 5.25x1012 /yr 1@1kHz : 31.5x109 /yr 10MM@1kHz : 3.15x1018 /yr Photo by: Nicolas Bufﬂer (ccby20) (modiﬁed)

Storing data requires a Data Format (stats) In: some number
samples Out: number of samples, average, stddev, counter, counter stddev, derivative, derivative stddev (in 32 bytes)

Storing data requires a Data Format (histogram) In: lots of
measurements Out: a set of buckets representing two signiﬁcant digits of precision in base ten and a count of samples seen in that bucket.

Managing the economics Histograms We solve this problem by supporting
“histogram” as a ﬁrst-class datatype without Snowth. Introduce some controlled time error. Introduce some controlled value error.

I didn’t come to talk about Illustration 14 Meta Object
Set L0 L1 Boot L2 L3 .... Blank Space Name/Value Pairs uberblock_phys_t array uint64_t ub_magic uint64_t ub_version uint64_t ub_txg uint64_t ub_vdev_sum uint64_t ub_timestamp blkptr_t ub_rootbp dnode_phys_t metadnode zil_header_t os_zil_header uint64_t os_type = DMU_OST_META uint8_t dn_type =DMU_OT_DNODE uint8_t dn_indblkshift; uint8_t dn_nlevels uint8_t dn_nblkptr; uint8_t dn_bonustype; uint8_t dn_checksum; uint8_t dn_compress; uint8_t dn_pad[1]; uint16_t dn_datablkszsec; uint16_t dn_bonuslen; uint8_t dn_pad2[4]; uint64_t dn_maxblkid; uint64_t dn_secphys; uint64_t dn_pad3[4]; blkptr_t dn_blkptr[3]; uint8_t dn_bonus[BONUSLEN] dnode_phys_t ... ... . 0 1 2 3 4 1022 1023 uint8_t dn_type= DMU_OT_OBJECT_DIRECTORY uint8_t dn_indblkshift; uint8_t dn_nlevels = 1 uint8_t dn_nblkptr = 1; uint8_t dn_bonustype; uint8_t dn_checksum; uint8_t dn_compress; uint8_t dn_pad[1]; uint16_t dn_datablkszsec; uint16_t dn_bonuslen; uint8_t dn_pad2[4]; uint64_t dn_maxblkid; uint64_t dn_secphys; uint64_t dn_pad3[4]; blkptr_t dn_blkptr[1]; uint8_t dn_bonus[BONUSLEN] root_dataset = 2 config = 4 sync_bplist = 1023 object_directory root_dataset config sync_bplist 1024 1025 1026 1027 1028 2046 2047 2048 2049 2050 2051 2052 Boot Hdr On-disk format We use a combination of a fork of leveldb and propriety on-disk formats… it also has changed a bit overtime and stands to change a bit going forward… but, that would be a different talk.

DISPLAYING 121 of 91 graphs PER PAGE: 21 ? 1
2 3 4 5 NEXT snowth6 IO latency Feb 24 2015 Anomaly Example BeaconReqRate Feb 20 2015 Public Trap Metrics Feb 19 2015 Beacon request rate Feb 18 2015 MQ Volume (fq) Feb 17 2015 API request rate Feb 17 2015 Anomaly Example 7 Feb 17 2015 Anomaly Example 8 Feb 17 2015 Stratcon uptime Feb 17 2015 Metric Velocity Feb 16 2015 lbva ATS (dc3) Feb 06 2015 Snowth DC3 space Feb 06 2015 Ashburn DC3 Egress Traffic Jan 22 2015 Snowth NNT Aggregate Put Calls Dec 04 2014 Snowth NNT Aggregate Put Calls (Raw Val… Dec 01 2014 Snowth Cluster Peer Lag Nov 26 2014 snowth6 IO latency (µs) Oct 31 2014 Snowth Space Oct 31 2014 Metrics seen by broker Oct 25 2014 Public broker noitd memory Oct 23 2014 Metrics / second Oct 22 2014 View: Jan 04 2015, 23:58 – Jan 05 2015, 23:59 Sort by: Last updated

Understanding the Data science + big data This is not
a new world, but we felt our constraints made the solution space new.

Quick Recap ❖ Multi petabyte scale ❖ Zero downtime ❖
Fast retrieval ❖ Fast data-local math

High-level architecture Consistent Hashing 2256 buckets, not v-buckets K-V, but
V are append-only http://www.ﬂickr.com/photos/colinzhu/312559485/

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2
n3-‐3 n4-‐1 n4-‐2 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐3 n3-‐4 n4-‐3 n6-‐2 n6-‐4

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2
n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4 o1

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2
n3-‐3 n4-‐1 n4-‐2 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐3 n3-‐4 n4-‐3 n6-‐2 n6-‐4

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2
n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2
n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4 Availability  Zone 2 Availability  Zone 1

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2
n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4 o1 Availability  Zone 2 Availability  Zone 1

Availability  Zone 1 Availability  Zone 2 n1-‐1 n1-‐2 n1-‐3 n1-‐4
n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4 o1

Availability  Zone 1 Availability  Zone 2 n1-‐1 n1-‐2 n1-‐3 n1-‐4
n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4 n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4

A real ring Keep it simple, stupid. We actually don’t
do split AZ.

This talk is about Basic System Architecture Threads, Disk I/O,
Network I/O

Write Path Architecture 0.1 event loop I/O worker job ✔
✔ Data Submission

Problems ❖ Slow. slow. slow. slow. slow. ❖ Apply DTrace
(and plockstat… which is DTrace)

✔ Data Submission

Problems ❖ It turns out we spend a ton of
time writing logs. ❖ So we wrote a log subsystem, optionally asynchronous ❖ non-blocking mpsc fifo between publishers and log writer ❖ one thread dedicated per log sink (usually a file) ❖ support POSIX files, jlogs, and pluggable log writers (modules) ❖ We also have a synchronous in-memory ring buffer log (w/ debugger support) ❖ DTrace instrumentation of logging calls (this is life-alteringly useful)

✔ Data Submission log access log error

Problems ❖ The subtasks have ❖ different workload characteristics due
to different backends job ✔ ✔

Write Path Architecture 2.0 event loop WL1 worker job ✔
✔ Data Submission WL2 worker WL3 worker log access log error

Problems ❖ The subtasks have ❖ contention based on key
locality: ❖ updating different metrics vs. different times of one metric* job ✔ ✔ *only for some backends

Write Path Architecture 3.0 event loop WL1 job ✔ ✔
Data Submission WL2 WL3.1 WL3.2 WL3.n … hash on resource log access log error

Problems ❖ plockstat showed we had signiﬁcant contention ❖ writing
replication journals ❖ we have several operations in each subtask ❖ operations that can be performed  asynchronously to the subtask job ✔ ✔ ✔

Write Path Architecture event loop WL1 job ✔ ✔ Data
Submission WL2 WL3.1 WL3.2 WL3.n … hash on resource job Journal node 1 Journal node 2 Journal node n … job log access log error

Job Queues [EVENTLOOP THREAD:X] ❖ while true ❖ while try
jobJ <- queue:BX ❖ jobJ do “asynch cleanup” ❖ eventloop sleep for activity ❖ some event -> callback ❖ jobJ -> queue:W1 & sem_post()  [JOBQ:W1 THREAD:Y] ❖ while true ❖ wakes up from sem_wait() ❖ jobJ <- queue:W1 ❖ jobJ do “asynch work” ❖ insert queue:BX ❖ wakeup eventloop on thr:X

Job Queues: implementation ❖ online thread concurrency is mutable ❖
smoothed mean wait time and run time ❖ will return a job to origin thread for synchronous completion. ❖ BFM job abortion using signals with sigsetjmp/siglongjmp [DRAGONS] ❖ we don’t use this feature in Snowth ❖ eventloop wakeup using:  port_send/kevent/eventfd Photograph by Annie Mole

Job Completion - simple refcnt ❖ begin ❖ refcnt ->
1 ❖ add initial jobs… ❖ def(refcnt)  ->? 0 : complete ❖ add job: ❖ inc(refcnt) ❖ complete job: ❖ dec(refcnt)  ->? 0 : complete 

So, how does all this play out? What’s the performance
look like? A telemetry store has benefits highly different workloads mostly uni-modal

Lorem Ipsum Dolor Visualizing all I/O latency the slice: 3.2⨉106
samples the graph: 300⨉106 samples retrieval pipeline is simple

Nothing is ever as simple as it seems. Retrieval seems
easy… but accessible to data scientists want to run math near data make it safe and make it fast

Computation is cheap Movement is expensive* It’s like packing a
truck, driving it to another state to have the inventory counted vs. just packing a truck and counting. https://www.ﬂickr.com/photos/kafka4prez/ *usually

Allowing data-local analysis Enabling Data Scientists Code in C? (no)
Must be fast. Must be process-local. LuaJIT.

Problems ❖ Lua (and LuaJIT) ❖ are not multi-thread safe
❖ garbage collection can wreak havoc in high performance systems ❖ lua’s math support is somewhat limited

Leveraging multiple cores for computation Threads ❖ Separate lua state
per OS thread: NPT ❖ Shared state requires lua/C crossover ❖ lua is very good at this, but…  still presents signiﬁcant impedance.

Tail collection Garbage Collection woes ❖ NPTs compete for work:
❖ wait for work (consume) ❖ disabled GC ❖ do work -> report completion ❖ enable GC ❖ force full GC run https://www.ﬂickr.com/photos/neate_photos/6160275942

Tail collection Maths, Math, and LuaJIT ❖ We use (a
forked) numlua: ❖ FFTW*, BLAS, LAPACK, CDFs ❖ It turns out that LuaJIT is:  wicked fast for our use-case. ❖ Memory management is an issue.

Overall (simplified) Architecture event loop Data Access WL1 job ✔
✔ WL2 WL3.1 WL3.2 WL3.n … hash on resource job Journal node 1 Journal node 2 Journal node n … job log access log error ✔ job NPT

The birth of mtev - https://github.com/circonus-labs/libmtev Heavy lifting: libmtev mtev
was a project to make the eventer itself multi-core and make it all a library https://www.ﬂickr.com/photos/kartlasarn/6477880613

Log Subsystem Mount Everest Framework log access log error Conﬁg
management Multi-core Eventloop Dynamic Job Queues Online Console EL Journal node 1 job # show mem # write mem # shutdown POSIX/TLS HTTP Protocol Listener Hook Framework DSO Modules LuaJIT Integration https:// COMING SOON

Thanks! We’re Hiring!

References ❖ Circonus - http://www.circonus.com ❖ libmtev - https://github.com/circonus-labs/libmtev ❖
Concurrency Kit - http://concurrencykit.org ❖ LuaJIT - http://luajit.org ❖ More on Snowth - http://l42.org/EwE ❖ plockstat - https://www.illumos.org/man/1M/plockstat

The Architecture of a Distributed Analytics and...

The Architecture of a Distributed Analytics and Storage Engine for Massive Time-Series Data

More Decks by Theo Schlossnagle

Other Decks in Technology

Featured

Transcript