Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Evolution of a Real-Time Web Analytics Platform

Evolution of a Real-Time Web Analytics Platform

Talk about data stores in use at GoSquared at the AllYourBase conference.

Geoff Wagstaff

October 18, 2013
Tweet

More Decks by Geoff Wagstaff

Other Decks in Technology

Transcript

  1. The Evolution of a Real-Time
    Analytics Platform
    Geoff Wagstaff
    @TheDeveloper

    View full-size slide

  2. The Now dashboard

    View full-size slide

  3. The Trends dashboard

    View full-size slide

  4. Building Real-Time Analytics
    Behind the “Now” dashboard

    View full-size slide

  5. Back in 2009
    1 server
    LAMP stack
    Conventional hosting

    View full-size slide

  6. LiveStats v1

    View full-size slide

  7. Problem?
    First taste of scale
    WRITES

    View full-size slide

  8. Reads are easy to scale
    Primary
    Writes
    Replica 1
    Replica 2
    Replica 3
    Reads
    Reads
    Reads

    View full-size slide

  9. Writes? Not so much.
    Primary
    MANY WRITES!
    Replica 1
    Replica 2
    Replica 3
    Reads
    Reads
    Reads
    :(

    View full-size slide

  10. Scale Horizontally

    View full-size slide

  11. Node Node Node
    Requests Requests Requests
    NginX -> PHP-FPM <--> Memcache

    View full-size slide

  12. Stupidly high data transfer: several TB per day
    DB -> app -> DB round trips
    High latency on DB ops
    Race conditions

    View full-size slide

  13. Redis to the rescue!
    “Advanced in-memory key-value store”

    View full-size slide

  14. Rich Data types

    View full-size slide

  15. Rich Data types
    Keys Hashes Lists Sets Sorted Sets
    GET
    SET
    HGET
    HSET
    HMSET
    LPUSH
    LPOP
    BLPOP
    SADD
    SREM
    SRANGE
    ZADD
    ZREM
    ZRANGE
    ZINTERSTORE

    View full-size slide

  16. Distributed locks
    Service
    Service
    Service
    Fast counters
    Fan-out Pub/Sub broadcast
    Message queues
    redis-1
    redis-2
    Solved concurrency problems

    View full-size slide

  17. A
    C
    I
    D
    tomic
    onsistent
    solated
    urable
    MySQL
    MongoDB
    Other ACID DBs:

    View full-size slide

  18. Fast
    Redis 2.6.16 on 2.4GHz i7 MBP

    View full-size slide

  19. Single-process, one per core
    Run on m1.medium - 1 core, 3.5GB memory
    Redis cluster is coming!
    Now on Elasticache
    Redis deployment

    View full-size slide

  20. Behind the “Trends” dashboard
    Building Historical Analytics

    View full-size slide

  21. Sharded MySQL from outset
    Aging
    Unreliable
    Trends v1

    View full-size slide

  22. The Trends dashboard

    View full-size slide

  23. MongoDB vs Cassandra

    View full-size slide

  24. MongoDB
    Document store: no schema, flexible
    Compelling replication & sharding features
    Fast in-place field updates similar to Redis

    View full-size slide

  25. Attempt #1: Store & aggregate
    Document for each list item,
    timestamp and site
    Aggregation framework: match, group, sort
    Collection per list type
    Flexible
    Made app simpler
    Huge number of documents
    Slow aggregate queries: ~1s+


    X
    X

    View full-size slide

  26. Attempt #2
    Document per list, timestamp and site
    Collection per list type
    Faster lookups (no aggregation)
    Fewer documents
    Smaller _id
    Document size limit
    Unordered
    High data transfer



    X
    X
    X

    View full-size slide

  27. Downsides
    High random I/O
    Document size & relocation
    Fragmentation
    Database lock

    View full-size slide

  28. K.O. MongoDB

    View full-size slide

  29. Cassandra
    Distributed hash ring: masterless
    Linear scalability
    Built for scale + write throughput

    View full-size slide

  30. CQL
    SELECT sql AS cql FROM mysql WHERE query_language = “good”
    Not as scary as Column Families + Thrift
    SQL Schemas + Querying

    View full-size slide

  31. CQL
    CREATE TABLE d_aggregate_day (
    sid int,
    ts int,
    s text,
    v counter
    PRIMARY KEY (sid, ts, s))
    partition key cluster key
    Distributed counters!

    View full-size slide

  32. B A
    S
    E
    asically vailable
    oft-state
    ventually consistent

    View full-size slide

  33. Eventual consistency isn’t a problem
    More efficient with the disk
    Low maintenance
    Cheap

    View full-size slide

  34. Redis + Cassandra = win
    Redis as a speed layer + aggregator for lists
    Cassandra as timeseries counter storage
    Collector Redis Cassandra
    Periodic flushes to Cassandra

    View full-size slide

  35. Exploit DBs strengths
    Build an indestructible service
    Use the best tools for the job

    View full-size slide

  36. Thanks!
    Geoff Wagstaff
    @TheDeveloper
    engineering.gosquared.com

    View full-size slide