A fly-by tour of the design of
Snowth A distributed database for
storage and analysis of
time-series telemetry
http://l42.org/FQE
Slide 2
Slide 2 text
The many faces of
Theo Schlossnagle @postwait
CEO Circonus
Slide 3
Slide 3 text
Problem Space
• System Availability
• Significant Retention (10 years)
• > 107 different metrics
• Frequency Range [1mHz - 1GHz]
• ~1ms for time range retrieval
• Support tomorrow’s “data scientist”
https://www.flickr.com/photos/design-dog/4358548056
Slide 4
Slide 4 text
A rather epic data storage problem.
What we are scribing to disk [email protected] : 525000/yr
[email protected] : 5.25x1012
/yr
1@1kHz : 31.5x109
/yr
10MM@1kHz : 3.15x1018
/yr
Photo by: Nicolas Buffler (ccby20) (modified)
Slide 5
Slide 5 text
Storing data requires a
Data Format (stats)
In:
some number samples
Out:
number of samples, average, stddev,
counter, counter stddev,
derivative, derivative stddev
(in 32 bytes)
Slide 6
Slide 6 text
Storing data requires a
Data Format (histogram)
In:
lots of measurements
Out:
a set of buckets representing two
significant digits of precision in base
ten and a count of samples seen in
that bucket.
Slide 7
Slide 7 text
Managing the economics
Histograms
We solve this problem by supporting
“histogram” as a first-class datatype
without Snowth.
Introduce some controlled time error.
Introduce some controlled value error.
Slide 8
Slide 8 text
I didn’t come to talk about
Illustration 14 Meta Object Set
L0 L1 Boot L2 L3
....
Blank
Space Name/Value Pairs
uberblock_phys_t array
uint64_t ub_magic
uint64_t ub_version
uint64_t ub_txg
uint64_t ub_vdev_sum
uint64_t ub_timestamp
blkptr_t ub_rootbp
dnode_phys_t metadnode
zil_header_t os_zil_header
uint64_t os_type =
DMU_OST_META
uint8_t dn_type =DMU_OT_DNODE
uint8_t dn_indblkshift;
uint8_t dn_nlevels
uint8_t dn_nblkptr;
uint8_t dn_bonustype;
uint8_t dn_checksum;
uint8_t dn_compress;
uint8_t dn_pad[1];
uint16_t dn_datablkszsec;
uint16_t dn_bonuslen;
uint8_t dn_pad2[4];
uint64_t dn_maxblkid;
uint64_t dn_secphys;
uint64_t dn_pad3[4];
blkptr_t dn_blkptr[3];
uint8_t dn_bonus[BONUSLEN]
dnode_phys_t
... ... .
0 1 2 3 4 1022 1023
uint8_t dn_type= DMU_OT_OBJECT_DIRECTORY
uint8_t dn_indblkshift;
uint8_t dn_nlevels = 1
uint8_t dn_nblkptr = 1;
uint8_t dn_bonustype;
uint8_t dn_checksum;
uint8_t dn_compress;
uint8_t dn_pad[1];
uint16_t dn_datablkszsec;
uint16_t dn_bonuslen;
uint8_t dn_pad2[4];
uint64_t dn_maxblkid;
uint64_t dn_secphys;
uint64_t dn_pad3[4];
blkptr_t dn_blkptr[1];
uint8_t dn_bonus[BONUSLEN]
root_dataset = 2
config = 4
sync_bplist = 1023
object_directory
root_dataset
config
sync_bplist
1024
1025
1026
1027
1028
2046
2047
2048
2049
2050
2051
2052
Boot
Hdr
On-disk format
We use a combination of a fork of
leveldb and propriety on-disk
formats…
it also has changed a bit overtime and
stands to change a bit going forward…
but, that would be a different talk.
Slide 9
Slide 9 text
DISPLAYING 121 of 91 graphs
PER PAGE: 21 ?
1 2 3 4 5 NEXT
snowth6 IO latency
Feb 24 2015
Anomaly Example BeaconReqRate
Feb 20 2015
Public Trap Metrics
Feb 19 2015
Beacon request rate
Feb 18 2015
MQ Volume (fq)
Feb 17 2015
API request rate
Feb 17 2015
Anomaly Example 7
Feb 17 2015
Anomaly Example 8
Feb 17 2015
Stratcon uptime
Feb 17 2015
Metric Velocity
Feb 16 2015
lbva ATS (dc3)
Feb 06 2015
Snowth DC3 space
Feb 06 2015
Ashburn DC3 Egress Traffic
Jan 22 2015
Snowth NNT Aggregate Put Calls
Dec 04 2014
Snowth NNT Aggregate Put Calls (Raw Val…
Dec 01 2014
Snowth Cluster Peer Lag
Nov 26 2014
snowth6 IO latency (µs)
Oct 31 2014
Snowth Space
Oct 31 2014
Metrics seen by broker
Oct 25 2014
Public broker noitd memory
Oct 23 2014
Metrics / second
Oct 22 2014
View: Jan 04 2015, 23:58 – Jan 05 2015, 23:59
Sort by:
Last updated
Slide 10
Slide 10 text
Understanding the
Data science + big data
This is not a new world,
but we felt our constraints made the
solution space new.
Slide 11
Slide 11 text
Quick Recap
❖ Multi petabyte scale
❖ Zero downtime
❖ Fast retrieval
❖ Fast data-local math
Slide 12
Slide 12 text
High-level architecture Consistent Hashing
2256 buckets, not v-buckets
K-V, but V are append-only
http://www.flickr.com/photos/colinzhu/312559485/
Problems
❖ It turns out we spend a ton of time writing logs.
❖ So we wrote a log subsystem, optionally asynchronous
❖ non-blocking mpsc fifo between publishers and log writer
❖ one thread dedicated per log sink (usually a file)
❖ support POSIX files, jlogs, and pluggable log writers (modules)
❖ We also have a synchronous in-memory ring buffer log (w/ debugger support)
❖ DTrace instrumentation of logging calls (this is life-alteringly useful)
Problems
❖ The subtasks have
❖ contention based on key locality:
❖ updating different metrics vs. different times of one metric*
job
✔
✔
*only for some backends
Problems
❖ plockstat showed we had significant contention
❖ writing replication journals
❖ we have several operations in each subtask
❖ operations that can be performed
asynchronously to the subtask
job
✔
✔
✔
Job Queues
[EVENTLOOP THREAD:X]
❖ while true
❖ while try jobJ <- queue:BX
❖ jobJ do “asynch cleanup”
❖ eventloop sleep for activity
❖ some event -> callback
❖ jobJ -> queue:W1 & sem_post()
[JOBQ:W1 THREAD:Y]
❖ while true
❖ wakes up from sem_wait()
❖ jobJ <- queue:W1
❖ jobJ do “asynch work”
❖ insert queue:BX
❖ wakeup eventloop on thr:X
Slide 36
Slide 36 text
Job Queues: implementation
❖ online thread concurrency is mutable
❖ smoothed mean wait time and run time
❖ will return a job to origin thread for
synchronous completion.
❖ BFM job abortion using signals with
sigsetjmp/siglongjmp [DRAGONS]
❖ we don’t use this feature in Snowth
❖ eventloop wakeup using:
port_send/kevent/eventfd
Photograph by Annie Mole
So, how does all this play out? What’s the performance look like?
A telemetry store has benefits highly different workloads
mostly uni-modal
Slide 39
Slide 39 text
Lorem Ipsum Dolor
Visualizing all I/O latency the slice: 3.2⨉106 samples
the graph: 300⨉106 samples
retrieval pipeline is simple
Slide 40
Slide 40 text
Nothing is ever as simple as it seems.
Retrieval seems easy… but accessible to data scientists
want to run math near data
make it safe and make it fast
Slide 41
Slide 41 text
Computation is cheap
Movement is expensive*
It’s like packing a truck, driving it to
another state to have the inventory
counted
vs.
just packing a truck and counting.
https://www.flickr.com/photos/kafka4prez/
*usually
Slide 42
Slide 42 text
Allowing data-local analysis
Enabling Data Scientists
Code in C? (no)
Must be fast.
Must be process-local.
LuaJIT.
Slide 43
Slide 43 text
Problems
❖ Lua (and LuaJIT)
❖ are not multi-thread safe
❖ garbage collection can wreak havoc in high performance systems
❖ lua’s math support is somewhat limited
Slide 44
Slide 44 text
Leveraging multiple cores for computation
Threads
❖ Separate lua state per OS thread: NPT
❖ Shared state requires lua/C crossover
❖ lua is very good at this, but…
still presents significant impedance.
Slide 45
Slide 45 text
Tail collection
Garbage Collection woes
❖ NPTs compete for work:
❖ wait for work (consume)
❖ disabled GC
❖ do work -> report completion
❖ enable GC
❖ force full GC run
https://www.flickr.com/photos/neate_photos/6160275942
Slide 46
Slide 46 text
Tail collection
Maths, Math, and LuaJIT
❖ We use (a forked) numlua:
❖ FFTW*, BLAS, LAPACK, CDFs
❖ It turns out that LuaJIT is:
wicked fast for our use-case.
❖ Memory management is an issue.
The birth of mtev - https://github.com/circonus-labs/libmtev
Heavy lifting: libmtev mtev was a project to make
the eventer itself multi-core
and make it all a library
https://www.flickr.com/photos/kartlasarn/6477880613
Slide 49
Slide 49 text
Log Subsystem
Mount Everest Framework
log
access
log
error
Config management
Multi-core Eventloop Dynamic Job Queues Online Console
EL
Journal
node 1
job
#
show
mem
#
write
mem
#
shutdown
POSIX/TLS
HTTP Protocol Listener Hook Framework
DSO Modules LuaJIT Integration
https://
COMING
SOON