$30 off During Our Annual Pro Sale. View Details »

MongoDB for Analytics

MongoDB for Analytics

Presented at MongoSF on May 4th, 2012.

John Nunemaker
PRO

May 04, 2012
Tweet

More Decks by John Nunemaker

Other Decks in Programming

Transcript

  1. GitHub
    John Nunemaker
    MongoSF 2012
    May 4, 2012
    MongoDB for Analytics
    A loving conversation with @jnunemaker

    View Slide

  2. View Slide

  3. Background
    How hernias can be good for you

    View Slide

  4. View Slide

  5. View Slide

  6. 1 month
    Of evenings and weekends

    View Slide

  7. 1 year
    Since public launch

    View Slide

  8. 13 tiny servers
    2 web, 6 app, 3 db, 2 queue

    View Slide

  9. 7-8 Million
    Page views per day

    View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. Implementation
    Imma show you how we do what we do baby

    View Slide

  15. Doing It (mostly) Live
    No aggregate querying

    View Slide

  16. View Slide

  17. View Slide

  18. get('/track.gif') do
    track_service.record(...)
    TrackGif
    end

    View Slide

  19. class TrackService
    def record(attrs)
    message = MessagePack.pack(attrs)
    @client.set(@queue, message)
    end
    end

    View Slide

  20. class TrackProcessor
    def run
    loop { process }
    end
    def process
    record @client.get(@queue)
    end
    def record(message)
    attrs = MessagePack.unpack(message)
    Hit.record(attrs)
    end
    end

    View Slide

  21. http://bit.ly/rt-kestrel

    View Slide

  22. class Hit
    def record
    site.atomic_update(site_updates)
    Resolution.record(self)
    Technology.record(self)
    Location.record(self)
    Referrer.record(self)
    Content.record(self)
    Search.record(self)
    Notification.record(self)
    View.record(self)
    end
    end

    View Slide

  23. class Resolution
    def record(hit)
    query = {'_id' => "..."}
    update = {'$inc' => {}}
    update['$inc']["sx.#{hit.screenx}"] = 1
    update['$inc']["bx.#{hit.browserx}"] = 1
    update['$inc']["by.#{hit.browsery}"] = 1
    collection(hit.created_on)
    .update(query, update, :upsert => true)
    end
    end
    end

    View Slide

  24. Pros

    View Slide

  25. Pros
    Space

    View Slide

  26. Pros
    Space
    RAM

    View Slide

  27. Pros
    Space
    RAM
    Reads

    View Slide

  28. Pros
    Space
    RAM
    Reads
    Live

    View Slide

  29. Cons

    View Slide

  30. Cons
    Writes

    View Slide

  31. Cons
    Writes
    Constraints

    View Slide

  32. Cons
    Writes
    Constraints
    More Forethought

    View Slide

  33. Cons
    Writes
    Constraints
    More Forethought
    No raw data

    View Slide

  34. http://bit.ly/rt-counters
    http://bit.ly/rt-counters2

    View Slide

  35. Time Frame
    Minute, hour, month, day, year, forever?

    View Slide

  36. # of Variations
    One document vs many

    View Slide

  37. Single Document
    Per Time Frame

    View Slide

  38. View Slide

  39. {
    "t" => 336381,
    "u" => 158951,
    "2011" => {
    "02" => {
    "18" => {
    "t" => 9,
    "u" => 6
    }
    }
    }
    }

    View Slide

  40. {
    '$inc' => {
    't' => 1,
    'u' => 1,
    '2011.02.18.t' => 1,
    '2011.02.18.u' => 1,
    }
    }

    View Slide

  41. Single Document
    For all ranges in time frame

    View Slide

  42. View Slide

  43. {
    "_id" =>"...:10",
    "bx" => {
    "320" => 85,
    "480" => 318,
    "800" => 1938,
    "1024" => 5033,
    "1280" => 6288,
    "1440" => 2323,
    "1600" => 3817,
    "2000" => 137
    },
    "by" => {
    "480" => 2205,
    "600" => 7359,

    View Slide

  44. "600" => 7359,
    "768" => 4515,
    "900" => 3833,
    "1024" => 2026
    },
    "sx" => {
    "320" => 191,
    "480" => 179,
    "800" => 195,
    "1024" => 1059,
    "1280" => 5861,
    "1440" => 3533,
    "1600" => 7675,
    "2000" => 1279
    }
    }

    View Slide

  45. {
    '$inc' => {
    'sx.1440' => 1,
    'bx.1280' => 1,
    'by.768' => 1,
    }
    }

    View Slide

  46. Many Documents
    Search terms, content, referrers...

    View Slide

  47. View Slide

  48. [
    {
    "_id" => ":",
    "t" => "ruby class variables",
    "sid" => BSON::ObjectId(''),
    "v" => 352
    },
    {
    "_id" => ":",
    "t" => "ruby unless",
    "sid" => BSON::ObjectId(''),
    "v" => 347
    },
    ]

    View Slide

  49. Writes
    {'_id' => "#{sid}:#{hash}"}

    View Slide

  50. Reads
    [['sid', 1], ['v', -1]]

    View Slide

  51. Growth
    Don’t say shard, don’t say shard...

    View Slide

  52. Partition Hot Data
    Currently using collections for time frames

    View Slide

  53. Bigger, Faster Server
    More CPU, RAM, Disk Space

    View Slide

  54. Users
    Sites
    Content
    Referrers
    Terms
    Engines
    Resolutions
    Locations
    Users
    Sites
    Content
    Referrers
    Terms
    Engines
    Resolutions
    Locations

    View Slide

  55. Partition by Function
    Spread writes across a few servers

    View Slide

  56. Users
    Sites
    Content
    Referrers
    Terms
    Engines
    Resolutions
    Locations

    View Slide

  57. Partition by Server
    Spread writes across a ton of servers,
    way down the road, not worried yet

    View Slide

  58. GitHub
    Thank you!
    [email protected]
    John Nunemaker
    MongoSF 2012
    May 4, 2012
    @jnunemaker

    View Slide