$30 off During Our Annual Pro Sale. View Details »

The Architecture of a Distributed Analytics and Storage Engine for Massive Time-Series Data

The Architecture of a Distributed Analytics and Storage Engine for Massive Time-Series Data

The numerical analysis of time-series data isn't new. The scale of today's problems is. With millions of concurrent data streams, some of which run at 1MM samples per second, the challenge of storing the data and making it continuously available for analysis is a daunting challenge.

Theo Schlossnagle

February 26, 2015
Tweet

More Decks by Theo Schlossnagle

Other Decks in Technology

Transcript

  1. A fly-by tour of the design of
    Snowth A distributed database for
    storage and analysis of
    time-series telemetry
    http://l42.org/FQE

    View Slide

  2. The many faces of
    Theo Schlossnagle @postwait
    CEO Circonus

    View Slide

  3. Problem Space
    • System Availability
    • Significant Retention (10 years)
    • > 107 different metrics
    • Frequency Range [1mHz - 1GHz]
    • ~1ms for time range retrieval
    • Support tomorrow’s “data scientist”
    https://www.flickr.com/photos/design-dog/4358548056

    View Slide

  4. A rather epic data storage problem.
    What we are scribing to disk [email protected] : 525000/yr
    [email protected] : 5.25x1012
    /yr
    1@1kHz : 31.5x109
    /yr
    10MM@1kHz : 3.15x1018
    /yr
    Photo by: Nicolas Buffler (ccby20) (modified)

    View Slide

  5. Storing data requires a
    Data Format (stats)
    In:
    some number samples
    Out:
    number of samples, average, stddev,
    counter, counter stddev,
    derivative, derivative stddev
    (in 32 bytes)

    View Slide

  6. Storing data requires a
    Data Format (histogram)
    In:
    lots of measurements
    Out:
    a set of buckets representing two
    significant digits of precision in base
    ten and a count of samples seen in
    that bucket.

    View Slide

  7. Managing the economics
    Histograms
    We solve this problem by supporting
    “histogram” as a first-class datatype
    without Snowth.
    Introduce some controlled time error.
    Introduce some controlled value error.

    View Slide

  8. I didn’t come to talk about
    Illustration 14 Meta Object Set
    L0 L1 Boot L2 L3
    ....
    Blank
    Space Name/Value Pairs
    uberblock_phys_t array
    uint64_t ub_magic
    uint64_t ub_version
    uint64_t ub_txg
    uint64_t ub_vdev_sum
    uint64_t ub_timestamp
    blkptr_t ub_rootbp
    dnode_phys_t metadnode
    zil_header_t os_zil_header
    uint64_t os_type =
    DMU_OST_META
    uint8_t dn_type =DMU_OT_DNODE
    uint8_t dn_indblkshift;
    uint8_t dn_nlevels
    uint8_t dn_nblkptr;
    uint8_t dn_bonustype;
    uint8_t dn_checksum;
    uint8_t dn_compress;
    uint8_t dn_pad[1];
    uint16_t dn_datablkszsec;
    uint16_t dn_bonuslen;
    uint8_t dn_pad2[4];
    uint64_t dn_maxblkid;
    uint64_t dn_secphys;
    uint64_t dn_pad3[4];
    blkptr_t dn_blkptr[3];
    uint8_t dn_bonus[BONUSLEN]
    dnode_phys_t
    ... ... .
    0 1 2 3 4 1022 1023
    uint8_t dn_type= DMU_OT_OBJECT_DIRECTORY
    uint8_t dn_indblkshift;
    uint8_t dn_nlevels = 1
    uint8_t dn_nblkptr = 1;
    uint8_t dn_bonustype;
    uint8_t dn_checksum;
    uint8_t dn_compress;
    uint8_t dn_pad[1];
    uint16_t dn_datablkszsec;
    uint16_t dn_bonuslen;
    uint8_t dn_pad2[4];
    uint64_t dn_maxblkid;
    uint64_t dn_secphys;
    uint64_t dn_pad3[4];
    blkptr_t dn_blkptr[1];
    uint8_t dn_bonus[BONUSLEN]
    root_dataset = 2
    config = 4
    sync_bplist = 1023
    object_directory
    root_dataset
    config
    sync_bplist
    1024
    1025
    1026
    1027
    1028
    2046
    2047
    2048
    2049
    2050
    2051
    2052
    Boot
    Hdr
    On-disk format
    We use a combination of a fork of
    leveldb and propriety on-disk
    formats…
    it also has changed a bit overtime and
    stands to change a bit going forward…
    but, that would be a different talk.

    View Slide

  9. DISPLAYING 1­21 of 91 graphs
    PER PAGE: 21 ?
    1 2 3 4 5 NEXT
    snowth6 IO latency
    Feb 24 2015
    Anomaly Example BeaconReqRate
    Feb 20 2015
    Public Trap Metrics
    Feb 19 2015
    Beacon request rate
    Feb 18 2015
    MQ Volume (fq)
    Feb 17 2015
    API request rate
    Feb 17 2015
    Anomaly Example 7
    Feb 17 2015
    Anomaly Example 8
    Feb 17 2015
    Stratcon uptime
    Feb 17 2015
    Metric Velocity
    Feb 16 2015
    lbva ATS (dc3)
    Feb 06 2015
    Snowth DC3 space
    Feb 06 2015
    Ashburn DC3 Egress Traffic
    Jan 22 2015
    Snowth NNT Aggregate Put Calls
    Dec 04 2014
    Snowth NNT Aggregate Put Calls (Raw Val…
    Dec 01 2014
    Snowth Cluster Peer Lag
    Nov 26 2014
    snowth6 IO latency (µs)
    Oct 31 2014
    Snowth Space
    Oct 31 2014
    Metrics seen by broker
    Oct 25 2014
    Public broker noitd memory
    Oct 23 2014
    Metrics / second
    Oct 22 2014
    View: Jan 04 2015, 23:58 – Jan 05 2015, 23:59
    Sort by:
    Last updated

    View Slide

  10. Understanding the
    Data science + big data
    This is not a new world,
    but we felt our constraints made the
    solution space new.

    View Slide

  11. Quick Recap
    ❖ Multi petabyte scale
    ❖ Zero downtime
    ❖ Fast retrieval
    ❖ Fast data-local math

    View Slide

  12. High-level architecture Consistent Hashing
    2256 buckets, not v-buckets
    K-V, but V are append-only
    http://www.flickr.com/photos/colinzhu/312559485/

    View Slide

  13. n1-­‐1
    n1-­‐2
    n1-­‐3
    n1-­‐4
    n2-­‐1
    n2-­‐2
    n2-­‐3
    n2-­‐4
    n3-­‐1
    n3-­‐2
    n3-­‐3
    n4-­‐1
    n4-­‐2
    n4-­‐4
    n5-­‐1
    n5-­‐2
    n5-­‐3
    n5-­‐4
    n6-­‐1
    n6-­‐3
    n3-­‐4
    n4-­‐3
    n6-­‐2
    n6-­‐4

    View Slide

  14. n1-­‐1
    n1-­‐2
    n1-­‐3
    n1-­‐4
    n2-­‐1
    n2-­‐2
    n2-­‐3
    n2-­‐4
    n3-­‐1
    n3-­‐2
    n3-­‐3
    n3-­‐4
    n4-­‐1
    n4-­‐2
    n4-­‐3
    n4-­‐4
    n5-­‐1
    n5-­‐2
    n5-­‐3
    n5-­‐4
    n6-­‐1
    n6-­‐2
    n6-­‐3
    n6-­‐4
    o1

    View Slide

  15. n1-­‐1
    n1-­‐2
    n1-­‐3
    n1-­‐4
    n2-­‐1
    n2-­‐2
    n2-­‐3
    n2-­‐4
    n3-­‐1
    n3-­‐2
    n3-­‐3
    n3-­‐4
    n4-­‐1
    n4-­‐2
    n4-­‐3
    n4-­‐4
    n5-­‐1
    n5-­‐2
    n5-­‐3
    n5-­‐4
    n6-­‐1
    n6-­‐2
    n6-­‐3
    n6-­‐4
    o1

    View Slide

  16. n1-­‐1
    n1-­‐2
    n1-­‐3
    n1-­‐4
    n2-­‐1
    n2-­‐2
    n2-­‐3
    n2-­‐4
    n3-­‐1
    n3-­‐2
    n3-­‐3
    n4-­‐1
    n4-­‐2
    n4-­‐4
    n5-­‐1
    n5-­‐2
    n5-­‐3
    n5-­‐4
    n6-­‐1
    n6-­‐3
    n3-­‐4
    n4-­‐3
    n6-­‐2
    n6-­‐4

    View Slide

  17. n1-­‐1
    n1-­‐2
    n1-­‐3
    n1-­‐4
    n2-­‐1
    n2-­‐2
    n2-­‐3
    n2-­‐4
    n3-­‐1
    n3-­‐2
    n3-­‐3
    n3-­‐4
    n4-­‐1
    n4-­‐2
    n4-­‐3
    n4-­‐4
    n5-­‐1
    n5-­‐2
    n5-­‐3
    n5-­‐4
    n6-­‐1
    n6-­‐2
    n6-­‐3
    n6-­‐4

    View Slide

  18. n1-­‐1
    n1-­‐2
    n1-­‐3
    n1-­‐4
    n2-­‐1
    n2-­‐2
    n2-­‐3
    n2-­‐4
    n3-­‐1
    n3-­‐2
    n3-­‐3
    n3-­‐4
    n4-­‐1
    n4-­‐2
    n4-­‐3
    n4-­‐4
    n5-­‐1
    n5-­‐2
    n5-­‐3
    n5-­‐4
    n6-­‐1
    n6-­‐2
    n6-­‐3
    n6-­‐4
    Availability

    Zone  2
    Availability

    Zone  1

    View Slide

  19. n1-­‐1
    n1-­‐2
    n1-­‐3
    n1-­‐4
    n2-­‐1
    n2-­‐2
    n2-­‐3
    n2-­‐4
    n3-­‐1
    n3-­‐2
    n3-­‐3
    n3-­‐4
    n4-­‐1
    n4-­‐2
    n4-­‐3
    n4-­‐4
    n5-­‐1
    n5-­‐2
    n5-­‐3
    n5-­‐4
    n6-­‐1
    n6-­‐2
    n6-­‐3
    n6-­‐4
    o1
    Availability

    Zone  2
    Availability

    Zone  1

    View Slide

  20. Availability

    Zone  1
    Availability

    Zone  2
    n1-­‐1
    n1-­‐2
    n1-­‐3
    n1-­‐4
    n2-­‐1
    n2-­‐2
    n2-­‐3
    n2-­‐4
    n3-­‐1
    n3-­‐2
    n3-­‐3
    n3-­‐4
    n4-­‐1
    n4-­‐2
    n4-­‐3
    n4-­‐4
    n5-­‐1
    n5-­‐2
    n5-­‐3
    n5-­‐4
    n6-­‐1
    n6-­‐2
    n6-­‐3
    n6-­‐4
    o1

    View Slide

  21. Availability

    Zone  1
    Availability

    Zone  2
    n1-­‐1
    n1-­‐2
    n1-­‐3
    n1-­‐4
    n2-­‐1
    n2-­‐2
    n2-­‐3
    n2-­‐4
    n3-­‐1
    n3-­‐2
    n3-­‐3
    n3-­‐4
    n4-­‐1
    n4-­‐2
    n4-­‐3
    n4-­‐4
    n5-­‐1
    n5-­‐2
    n5-­‐3
    n5-­‐4
    n6-­‐1
    n6-­‐2
    n6-­‐3
    n6-­‐4
    n1-­‐1
    n1-­‐2
    n1-­‐3
    n1-­‐4
    n2-­‐1
    n2-­‐2
    n2-­‐3
    n2-­‐4
    n3-­‐1
    n3-­‐2
    n3-­‐3
    n3-­‐4
    n4-­‐1
    n4-­‐2
    n4-­‐3
    n4-­‐4
    n5-­‐1
    n5-­‐2
    n5-­‐3
    n5-­‐4
    n6-­‐1
    n6-­‐2
    n6-­‐3
    n6-­‐4

    View Slide

  22. A real ring Keep it simple, stupid.
    We actually don’t do split AZ.

    View Slide

  23. This talk is about
    Basic System Architecture Threads,
    Disk I/O,
    Network I/O

    View Slide

  24. Write Path Architecture 0.1
    event
    loop
    I/O
    worker
    job


    Data
    Submission

    View Slide

  25. Problems
    ❖ Slow. slow. slow. slow. slow.
    ❖ Apply DTrace (and plockstat… which is DTrace)

    View Slide

  26. Write Path Architecture 1.0
    event
    loop
    I/O
    worker
    job


    Data
    Submission

    View Slide

  27. Problems
    ❖ It turns out we spend a ton of time writing logs.
    ❖ So we wrote a log subsystem, optionally asynchronous
    ❖ non-blocking mpsc fifo between publishers and log writer
    ❖ one thread dedicated per log sink (usually a file)
    ❖ support POSIX files, jlogs, and pluggable log writers (modules)
    ❖ We also have a synchronous in-memory ring buffer log (w/ debugger support)
    ❖ DTrace instrumentation of logging calls (this is life-alteringly useful)

    View Slide

  28. Write Path Architecture 1.5
    event
    loop
    I/O
    worker
    job


    Data
    Submission
    log
    access
    log
    error

    View Slide

  29. Problems
    ❖ The subtasks have
    ❖ different workload characteristics due to different backends
    job


    View Slide

  30. Write Path Architecture 2.0
    event
    loop
    WL1
    worker
    job


    Data
    Submission WL2
    worker
    WL3
    worker
    log
    access
    log
    error

    View Slide

  31. Problems
    ❖ The subtasks have
    ❖ contention based on key locality:
    ❖ updating different metrics vs. different times of one metric*
    job


    *only for some backends

    View Slide

  32. Write Path Architecture 3.0
    event
    loop
    WL1
    job


    Data
    Submission
    WL2
    WL3.1
    WL3.2
    WL3.n

    hash on resource
    log
    access
    log
    error

    View Slide

  33. Problems
    ❖ plockstat showed we had significant contention
    ❖ writing replication journals
    ❖ we have several operations in each subtask
    ❖ operations that can be performed

    asynchronously to the subtask
    job



    View Slide

  34. Write Path Architecture
    event
    loop
    WL1
    job


    Data
    Submission
    WL2
    WL3.1
    WL3.2
    WL3.n

    hash on resource
    job
    Journal
    node 1
    Journal
    node 2
    Journal
    node n

    job
    log
    access
    log
    error

    View Slide

  35. Job Queues
    [EVENTLOOP THREAD:X]
    ❖ while true
    ❖ while try jobJ <- queue:BX
    ❖ jobJ do “asynch cleanup”
    ❖ eventloop sleep for activity
    ❖ some event -> callback
    ❖ jobJ -> queue:W1 & sem_post()

    [JOBQ:W1 THREAD:Y]
    ❖ while true
    ❖ wakes up from sem_wait()
    ❖ jobJ <- queue:W1
    ❖ jobJ do “asynch work”
    ❖ insert queue:BX
    ❖ wakeup eventloop on thr:X

    View Slide

  36. Job Queues: implementation
    ❖ online thread concurrency is mutable
    ❖ smoothed mean wait time and run time
    ❖ will return a job to origin thread for
    synchronous completion.
    ❖ BFM job abortion using signals with
    sigsetjmp/siglongjmp [DRAGONS]
    ❖ we don’t use this feature in Snowth
    ❖ eventloop wakeup using:

    port_send/kevent/eventfd
    Photograph by Annie Mole

    View Slide

  37. Job Completion - simple refcnt
    ❖ begin
    ❖ refcnt -> 1
    ❖ add initial jobs…
    ❖ def(refcnt)

    ->? 0 : complete
    ❖ add job:
    ❖ inc(refcnt)
    ❖ complete job:
    ❖ dec(refcnt)

    ->? 0 : complete


    View Slide

  38. So, how does all this play out? What’s the performance look like?
    A telemetry store has benefits highly different workloads
    mostly uni-modal

    View Slide

  39. Lorem Ipsum Dolor
    Visualizing all I/O latency the slice: 3.2⨉106 samples
    the graph: 300⨉106 samples
    retrieval pipeline is simple

    View Slide

  40. Nothing is ever as simple as it seems.
    Retrieval seems easy… but accessible to data scientists
    want to run math near data
    make it safe and make it fast

    View Slide

  41. Computation is cheap
    Movement is expensive*
    It’s like packing a truck, driving it to
    another state to have the inventory
    counted
    vs.
    just packing a truck and counting.
    https://www.flickr.com/photos/kafka4prez/
    *usually

    View Slide

  42. Allowing data-local analysis
    Enabling Data Scientists
    Code in C? (no)
    Must be fast.
    Must be process-local.
    LuaJIT.

    View Slide

  43. Problems
    ❖ Lua (and LuaJIT)
    ❖ are not multi-thread safe
    ❖ garbage collection can wreak havoc in high performance systems
    ❖ lua’s math support is somewhat limited

    View Slide

  44. Leveraging multiple cores for computation
    Threads
    ❖ Separate lua state per OS thread: NPT
    ❖ Shared state requires lua/C crossover
    ❖ lua is very good at this, but…

    still presents significant impedance.

    View Slide

  45. Tail collection
    Garbage Collection woes
    ❖ NPTs compete for work:
    ❖ wait for work (consume)
    ❖ disabled GC
    ❖ do work -> report completion
    ❖ enable GC
    ❖ force full GC run
    https://www.flickr.com/photos/neate_photos/6160275942

    View Slide

  46. Tail collection
    Maths, Math, and LuaJIT
    ❖ We use (a forked) numlua:
    ❖ FFTW*, BLAS, LAPACK, CDFs
    ❖ It turns out that LuaJIT is:

    wicked fast for our use-case.
    ❖ Memory management is an issue.

    View Slide

  47. Overall (simplified) Architecture
    event
    loop
    Data
    Access
    WL1
    job


    WL2
    WL3.1
    WL3.2
    WL3.n

    hash on resource
    job
    Journal
    node 1
    Journal
    node 2
    Journal
    node n

    job
    log
    access
    log
    error

    job
    NPT

    View Slide

  48. The birth of mtev - https://github.com/circonus-labs/libmtev
    Heavy lifting: libmtev mtev was a project to make
    the eventer itself multi-core
    and make it all a library
    https://www.flickr.com/photos/kartlasarn/6477880613

    View Slide

  49. Log Subsystem
    Mount Everest Framework
    log
    access
    log
    error
    Config management
    Multi-core Eventloop Dynamic Job Queues Online Console
    EL
    Journal
    node 1
    job
    #  show  mem  
    #  write  mem  
    #  shutdown
    POSIX/TLS
    HTTP Protocol Listener Hook Framework
    DSO Modules LuaJIT Integration
    https://
    COMING
    SOON

    View Slide

  50. Thanks! We’re Hiring!

    View Slide

  51. References
    ❖ Circonus - http://www.circonus.com
    ❖ libmtev - https://github.com/circonus-labs/libmtev
    ❖ Concurrency Kit - http://concurrencykit.org
    ❖ LuaJIT - http://luajit.org
    ❖ More on Snowth - http://l42.org/EwE
    ❖ plockstat - https://www.illumos.org/man/1M/plockstat

    View Slide