The Architecture of a Distributed Analytics and Storage Engine for Massive Time-Series Data

Slide 1

Slide 1 text

A ﬂy-by tour of the design of Snowth A distributed database for storage and analysis of time-series telemetry http://l42.org/FQE

Slide 2

Slide 2 text

The many faces of Theo Schlossnagle @postwait CEO Circonus

Slide 3

Slide 3 text

Problem Space • System Availability • Signiﬁcant Retention (10 years) • > 107 different metrics • Frequency Range [1mHz - 1GHz] • ~1ms for time range retrieval • Support tomorrow’s “data scientist” https://www.ﬂickr.com/photos/design-dog/4358548056

Slide 4

Slide 4 text

A rather epic data storage problem. What we are scribing to disk [email protected] : 525000/yr [email protected] : 5.25x1012 /yr 1@1kHz : 31.5x109 /yr 10MM@1kHz : 3.15x1018 /yr Photo by: Nicolas Bufﬂer (ccby20) (modiﬁed)

Slide 5

Slide 5 text

Storing data requires a Data Format (stats) In: some number samples Out: number of samples, average, stddev, counter, counter stddev, derivative, derivative stddev (in 32 bytes)

Slide 6

Slide 6 text

Storing data requires a Data Format (histogram) In: lots of measurements Out: a set of buckets representing two signiﬁcant digits of precision in base ten and a count of samples seen in that bucket.

Slide 7

Slide 7 text

Managing the economics Histograms We solve this problem by supporting “histogram” as a ﬁrst-class datatype without Snowth. Introduce some controlled time error. Introduce some controlled value error.

Slide 8

Slide 8 text

I didn’t come to talk about Illustration 14 Meta Object Set L0 L1 Boot L2 L3 .... Blank Space Name/Value Pairs uberblock_phys_t array uint64_t ub_magic uint64_t ub_version uint64_t ub_txg uint64_t ub_vdev_sum uint64_t ub_timestamp blkptr_t ub_rootbp dnode_phys_t metadnode zil_header_t os_zil_header uint64_t os_type = DMU_OST_META uint8_t dn_type =DMU_OT_DNODE uint8_t dn_indblkshift; uint8_t dn_nlevels uint8_t dn_nblkptr; uint8_t dn_bonustype; uint8_t dn_checksum; uint8_t dn_compress; uint8_t dn_pad[1]; uint16_t dn_datablkszsec; uint16_t dn_bonuslen; uint8_t dn_pad2[4]; uint64_t dn_maxblkid; uint64_t dn_secphys; uint64_t dn_pad3[4]; blkptr_t dn_blkptr[3]; uint8_t dn_bonus[BONUSLEN] dnode_phys_t ... ... . 0 1 2 3 4 1022 1023 uint8_t dn_type= DMU_OT_OBJECT_DIRECTORY uint8_t dn_indblkshift; uint8_t dn_nlevels = 1 uint8_t dn_nblkptr = 1; uint8_t dn_bonustype; uint8_t dn_checksum; uint8_t dn_compress; uint8_t dn_pad[1]; uint16_t dn_datablkszsec; uint16_t dn_bonuslen; uint8_t dn_pad2[4]; uint64_t dn_maxblkid; uint64_t dn_secphys; uint64_t dn_pad3[4]; blkptr_t dn_blkptr[1]; uint8_t dn_bonus[BONUSLEN] root_dataset = 2 config = 4 sync_bplist = 1023 object_directory root_dataset config sync_bplist 1024 1025 1026 1027 1028 2046 2047 2048 2049 2050 2051 2052 Boot Hdr On-disk format We use a combination of a fork of leveldb and propriety on-disk formats… it also has changed a bit overtime and stands to change a bit going forward… but, that would be a different talk.

Slide 9

Slide 9 text

DISPLAYING 121 of 91 graphs PER PAGE: 21 ? 1 2 3 4 5 NEXT snowth6 IO latency Feb 24 2015 Anomaly Example BeaconReqRate Feb 20 2015 Public Trap Metrics Feb 19 2015 Beacon request rate Feb 18 2015 MQ Volume (fq) Feb 17 2015 API request rate Feb 17 2015 Anomaly Example 7 Feb 17 2015 Anomaly Example 8 Feb 17 2015 Stratcon uptime Feb 17 2015 Metric Velocity Feb 16 2015 lbva ATS (dc3) Feb 06 2015 Snowth DC3 space Feb 06 2015 Ashburn DC3 Egress Traffic Jan 22 2015 Snowth NNT Aggregate Put Calls Dec 04 2014 Snowth NNT Aggregate Put Calls (Raw Val… Dec 01 2014 Snowth Cluster Peer Lag Nov 26 2014 snowth6 IO latency (µs) Oct 31 2014 Snowth Space Oct 31 2014 Metrics seen by broker Oct 25 2014 Public broker noitd memory Oct 23 2014 Metrics / second Oct 22 2014 View: Jan 04 2015, 23:58 – Jan 05 2015, 23:59 Sort by: Last updated

Slide 10

Slide 10 text

Understanding the Data science + big data This is not a new world, but we felt our constraints made the solution space new.

Slide 11

Slide 11 text

Quick Recap ❖ Multi petabyte scale ❖ Zero downtime ❖ Fast retrieval ❖ Fast data-local math

Slide 12

Slide 12 text

High-level architecture Consistent Hashing 2256 buckets, not v-buckets K-V, but V are append-only http://www.ﬂickr.com/photos/colinzhu/312559485/

Slide 13

Slide 13 text

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n4-‐1 n4-‐2 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐3 n3-‐4 n4-‐3 n6-‐2 n6-‐4

Slide 14

Slide 14 text

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4 o1

Slide 15

Slide 15 text

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4 o1

Slide 16

Slide 16 text

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n4-‐1 n4-‐2 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐3 n3-‐4 n4-‐3 n6-‐2 n6-‐4

Slide 17

Slide 17 text

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4

Slide 18

Slide 18 text

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4 Availability  Zone 2 Availability  Zone 1

Slide 19

Slide 19 text

n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4 o1 Availability  Zone 2 Availability  Zone 1

Slide 20

Slide 20 text

Slide 21

Slide 21 text

Availability  Zone 1 Availability  Zone 2 n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4 n1-‐1 n1-‐2 n1-‐3 n1-‐4 n2-‐1 n2-‐2 n2-‐3 n2-‐4 n3-‐1 n3-‐2 n3-‐3 n3-‐4 n4-‐1 n4-‐2 n4-‐3 n4-‐4 n5-‐1 n5-‐2 n5-‐3 n5-‐4 n6-‐1 n6-‐2 n6-‐3 n6-‐4

Slide 22

Slide 22 text

A real ring Keep it simple, stupid. We actually don’t do split AZ.

Slide 23

Slide 23 text

This talk is about Basic System Architecture Threads, Disk I/O, Network I/O

Slide 24

Slide 24 text

Write Path Architecture 0.1 event loop I/O worker job ✔ ✔ Data Submission

Slide 25

Slide 25 text

Problems ❖ Slow. slow. slow. slow. slow. ❖ Apply DTrace (and plockstat… which is DTrace)

Slide 26

Slide 26 text

Write Path Architecture 1.0 event loop I/O worker job ✔ ✔ Data Submission

Slide 27

Slide 27 text

Problems ❖ It turns out we spend a ton of time writing logs. ❖ So we wrote a log subsystem, optionally asynchronous ❖ non-blocking mpsc fifo between publishers and log writer ❖ one thread dedicated per log sink (usually a file) ❖ support POSIX files, jlogs, and pluggable log writers (modules) ❖ We also have a synchronous in-memory ring buffer log (w/ debugger support) ❖ DTrace instrumentation of logging calls (this is life-alteringly useful)

Slide 28

Slide 28 text

Write Path Architecture 1.5 event loop I/O worker job ✔ ✔ Data Submission log access log error

Slide 29

Slide 29 text

Problems ❖ The subtasks have ❖ different workload characteristics due to different backends job ✔ ✔

Slide 30

Slide 30 text

Write Path Architecture 2.0 event loop WL1 worker job ✔ ✔ Data Submission WL2 worker WL3 worker log access log error

Slide 31

Slide 31 text

Problems ❖ The subtasks have ❖ contention based on key locality: ❖ updating different metrics vs. different times of one metric* job ✔ ✔ *only for some backends

Slide 32

Slide 32 text

Write Path Architecture 3.0 event loop WL1 job ✔ ✔ Data Submission WL2 WL3.1 WL3.2 WL3.n … hash on resource log access log error

Slide 33

Slide 33 text

Problems ❖ plockstat showed we had signiﬁcant contention ❖ writing replication journals ❖ we have several operations in each subtask ❖ operations that can be performed  asynchronously to the subtask job ✔ ✔ ✔

Slide 34

Slide 34 text

Write Path Architecture event loop WL1 job ✔ ✔ Data Submission WL2 WL3.1 WL3.2 WL3.n … hash on resource job Journal node 1 Journal node 2 Journal node n … job log access log error

Slide 35

Slide 35 text

Job Queues [EVENTLOOP THREAD:X] ❖ while true ❖ while try jobJ <- queue:BX ❖ jobJ do “asynch cleanup” ❖ eventloop sleep for activity ❖ some event -> callback ❖ jobJ -> queue:W1 & sem_post()  [JOBQ:W1 THREAD:Y] ❖ while true ❖ wakes up from sem_wait() ❖ jobJ <- queue:W1 ❖ jobJ do “asynch work” ❖ insert queue:BX ❖ wakeup eventloop on thr:X

Slide 36

Slide 36 text

Job Queues: implementation ❖ online thread concurrency is mutable ❖ smoothed mean wait time and run time ❖ will return a job to origin thread for synchronous completion. ❖ BFM job abortion using signals with sigsetjmp/siglongjmp [DRAGONS] ❖ we don’t use this feature in Snowth ❖ eventloop wakeup using:  port_send/kevent/eventfd Photograph by Annie Mole

Slide 37

Slide 37 text

Job Completion - simple refcnt ❖ begin ❖ refcnt -> 1 ❖ add initial jobs… ❖ def(refcnt)  ->? 0 : complete ❖ add job: ❖ inc(refcnt) ❖ complete job: ❖ dec(refcnt)  ->? 0 : complete 

Slide 38

Slide 38 text

So, how does all this play out? What’s the performance look like? A telemetry store has benefits highly different workloads mostly uni-modal

Slide 39

Slide 39 text

Lorem Ipsum Dolor Visualizing all I/O latency the slice: 3.2⨉106 samples the graph: 300⨉106 samples retrieval pipeline is simple

Slide 40

Slide 40 text

Nothing is ever as simple as it seems. Retrieval seems easy… but accessible to data scientists want to run math near data make it safe and make it fast

Slide 41

Slide 41 text

Computation is cheap Movement is expensive* It’s like packing a truck, driving it to another state to have the inventory counted vs. just packing a truck and counting. https://www.ﬂickr.com/photos/kafka4prez/ *usually

Slide 42

Slide 42 text

Allowing data-local analysis Enabling Data Scientists Code in C? (no) Must be fast. Must be process-local. LuaJIT.

Slide 43

Slide 43 text

Problems ❖ Lua (and LuaJIT) ❖ are not multi-thread safe ❖ garbage collection can wreak havoc in high performance systems ❖ lua’s math support is somewhat limited

Slide 44

Slide 44 text

Leveraging multiple cores for computation Threads ❖ Separate lua state per OS thread: NPT ❖ Shared state requires lua/C crossover ❖ lua is very good at this, but…  still presents signiﬁcant impedance.

Slide 45

Slide 45 text

Tail collection Garbage Collection woes ❖ NPTs compete for work: ❖ wait for work (consume) ❖ disabled GC ❖ do work -> report completion ❖ enable GC ❖ force full GC run https://www.ﬂickr.com/photos/neate_photos/6160275942

Slide 46

Slide 46 text

Tail collection Maths, Math, and LuaJIT ❖ We use (a forked) numlua: ❖ FFTW*, BLAS, LAPACK, CDFs ❖ It turns out that LuaJIT is:  wicked fast for our use-case. ❖ Memory management is an issue.

Slide 47

Slide 47 text

Overall (simplified) Architecture event loop Data Access WL1 job ✔ ✔ WL2 WL3.1 WL3.2 WL3.n … hash on resource job Journal node 1 Journal node 2 Journal node n … job log access log error ✔ job NPT

Slide 48

Slide 48 text

The birth of mtev - https://github.com/circonus-labs/libmtev Heavy lifting: libmtev mtev was a project to make the eventer itself multi-core and make it all a library https://www.ﬂickr.com/photos/kartlasarn/6477880613

Slide 49

Slide 49 text

Log Subsystem Mount Everest Framework log access log error Conﬁg management Multi-core Eventloop Dynamic Job Queues Online Console EL Journal node 1 job # show mem # write mem # shutdown POSIX/TLS HTTP Protocol Listener Hook Framework DSO Modules LuaJIT Integration https:// COMING SOON

Slide 50

Slide 50 text

Thanks! We’re Hiring!

Slide 51

Slide 51 text

References ❖ Circonus - http://www.circonus.com ❖ libmtev - https://github.com/circonus-labs/libmtev ❖ Concurrency Kit - http://concurrencykit.org ❖ LuaJIT - http://luajit.org ❖ More on Snowth - http://l42.org/EwE ❖ plockstat - https://www.illumos.org/man/1M/plockstat