Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Analytics at Wikipedia with Druid

Druid
December 11, 2017

Analytics at Wikipedia with Druid

Presented at Strata NYC 2017

Druid

December 11, 2017
Tweet

More Decks by Druid

Other Decks in Technology

Transcript

  1. Analytics at
    Wikipedia

    View Slide

  2. View Slide

  3. “Imagine a world in which every single
    human being can freely share in the
    sum of all knowledge.”

    View Slide

  4. Senior Systems Engineer
    I have been with Wikimedia since 2012, mostly on
    infrastructure for analytics.
    Previously Lead SysAdmin at CouchSurfing.org
    http://www.mediawiki.org/wiki/User:Ottomata
    Andrew Otto

    View Slide

  5. Fangjin Yang
    Druid Engineer
    One of the original authors behind Druid. Co-founder @
    Imply.

    View Slide

  6. This is an
    overview of
    Wikimedia data
    sources, their
    history and
    scale.

    View Slide

  7. View Slide

  8. Data Sources

    View Slide

  9. Technical
    Challenges
    2
    Data Sources

    View Slide

  10. Technical
    Challenges
    2
    Data Sources
    Scale and
    Complexity

    View Slide

  11. Data Sources
    Philosophical
    Stand
    1

    View Slide

  12. Data Sources
    Philosophical
    Stand
    1
    Privacy

    View Slide

  13. Wikipedia is ranked the
    6th largest website when
    it comes to pageviews [Alexa]
    Scale is an obvious issue
    ...

    View Slide

  14. Complexity is less
    obvious
    MediaWiki is not designed
    for analytics.
    It is hard to know the pages that
    were created by users with more
    than 5 edits in German wikipedia in
    2016.

    View Slide

  15. Privacy
    We do not send data to
    third parties.
    Commercial solutions to gather user
    data are not an option.

    View Slide

  16. Data Sources
    Privacy

    View Slide

  17. Event Logs

    View Slide

  18. Captures rich events of JSON
    schema-ed data.
    Events are validated and sent off
    to various data stores.
    EventLogging
    Event Logs

    View Slide

  19. Event Logs

    View Slide

  20. ~300 events per second (for now!)
    ~30 active schemas
    Event Logs

    View Slide

  21. Balance between
    transparency and
    privacy

    View Slide

  22. Must host all data in our
    datacenters (no Google Analytics)
    Event Logs

    View Slide

  23. Must host all data in our
    datacenters
    Event Logs
    Store as little private data as
    possible:
    - No IPs
    - Only sanitized user agents
    - Historical data (older than 90 days) must be
    fully cleansed of sensitive information
    - Should not be able to reconstruct browsing
    sessions

    View Slide

  24. Event Logs

    View Slide

  25. Event Logs
    This part is
    holding us back!
    ...Druid?

    View Slide

  26. Future work:
    Teach folks to design event
    schemas that fit an OLAP data
    model
    Event Logs
    i.e. dimensions and metrics.

    View Slide

  27. Future work:
    Teach folks to design event
    schemas that fit an OLAP data
    model
    Event Logs
    i.e. dimensions and metrics.
    Auto ingest these into Druid to
    allow for use with nice GUIs
    (Pivot / Superset)

    View Slide

  28. Data Sources
    Complexity

    View Slide

  29. MediaWiki
    Databases

    View Slide

  30. MediaWiki
    databases
    MediaWiki metadata is not huge,
    but analytics of edit history since
    the beginning of wiki time is
    dimensionally complex.

    View Slide

  31. ~1.5 million article revisions per day
    MediaWiki
    databases

    View Slide

  32. We want to learn about users and
    edit history.
    - How many new users over time?
    - How many edits over time?
    - How many edits do new users make?
    - How do editors stay active?
    - How many edits are made by bots?
    - Etc. etc. etc.
    MediaWiki
    databases

    View Slide

  33. Simple questions
    might be
    hard to answer

    View Slide

  34. MediaWiki
    databases
    How many
    new
    editors
    across all
    wikis?
    2 joins, 3 subqueries
    3 tables on 800+ wikis
    5 days to run

    View Slide

  35. MySQL (OLTP) works fine as a
    application storage
    MediaWiki
    databases

    View Slide

  36. MySQL (OLTP) works fine as a
    application storage
    but not so great for historical
    analytics.
    MediaWiki
    databases

    View Slide

  37. MySQL (OLTP) works fine as a
    application storage
    but not so great for historical
    analytics.
    Can only query current state.
    MediaWiki
    databases

    View Slide

  38. However...
    Historical events are saved in a ‘log’ table.
    Log has juicy information about:
    MediaWiki
    databases

    View Slide

  39. page (AKA article) events
    (move, delete and restore)
    user events
    (rename, block, rights)
    ...which both go back to the
    almost beginning of
    wiki-time
    MediaWiki
    databases
    However...
    Historical events are saved in a ‘log’ table.
    Log has juicy information about:

    View Slide

  40. MediaWiki
    databases
    However...HOWEVER!

    View Slide

  41. MediaWiki
    databases
    However...HOWEVER!
    MediaWiki is an old(ish) popular
    open source project.

    View Slide

  42. MediaWiki
    databases
    However...HOWEVER!
    MediaWiki is an old(ish) popular
    open source project.
    Super cool! But this also leads us
    to the land of database
    design insanity!

    View Slide

  43. MediaWiki
    databases
    However...HOWEVER!
    Primary ids are missing
    for most events!

    View Slide

  44. MediaWiki
    databases
    However...HOWEVER!
    Primary ids are missing
    for most events!
    Instead they have page
    title or user name.

    View Slide

  45. MediaWiki
    databases
    However...HOWEVER!
    Primary ids are missing
    for most events!
    Instead they have page
    title or user name.
    Pages and users are renamed all the time!

    View Slide

  46. MediaWiki
    databases
    However...HOWEVER!
    Primary ids are missing
    for most events!
    Instead they have page
    title or user name.
    Pages and users are renamed all the time!
    Data is stored in different
    formats depending on
    MediaWiki version that created
    the log record

    View Slide

  47. MediaWiki
    databases
    However...HOWEVER!
    Primary ids are missing
    for most events!
    Instead they have page
    title or user name.
    Pages and users are renamed all the time!
    Data is stored in different
    formats depending on
    MediaWiki version that created
    the log record
    sometimes in PHP encoded
    blobs!

    View Slide

  48. MediaWiki
    databases
    However...HOWEVER!
    Primary ids are missing
    for most events!
    Instead they have page
    title or user name.
    Pages and users are renamed all the time!
    Data is stored in different
    formats depending on
    MediaWiki version that created
    the log record
    page and user properties might
    be in different fields
    depending on age of event!
    sometimes in PHP encoded
    blobs!

    View Slide

  49. MediaWiki
    databases
    However...HOWEVER!
    Primary ids are missing
    for most events!
    Instead they have page
    title or user name.
    Pages and users are renamed all the time!
    Data is stored in different
    formats depending on
    MediaWiki version that created
    the log record
    Durations of user blocks is
    stored in 7 different date
    formats!
    page and user properties might
    be in different fields
    depending on age of event!
    sometimes in PHP encoded
    blobs!

    View Slide

  50. MediaWiki
    databases
    However...HOWEVER!
    Primary ids are missing
    for most events!
    Instead they have page
    title or user name.
    Pages and users are renamed all the time!
    Data is stored in different
    formats depending on
    MediaWiki version that created
    the log record
    Durations of user blocks is
    stored in 7 different date
    formats!
    page and user properties might
    be in different fields
    depending on age of event!
    Some revisions are
    historically orphaned
    (no existent page IDs!)
    sometimes in PHP encoded
    blobs!

    View Slide

  51. MediaWiki
    databases
    However...HOWEVER!
    Primary ids are missing
    for most events!
    Instead they have page
    title or user name.
    Pages and users are renamed all the time!
    Data is stored in different
    formats depending on
    MediaWiki version that created
    the log record
    Durations of user blocks is
    stored in 7 different date
    formats!
    page and user properties might
    be in different fields
    depending on age of event!
    Some revisions are
    historically orphaned
    (no existent page IDs!)
    Some page IDs were reused for
    different pages! (I guess IDs were a
    scarce commodity in 2006?)
    sometimes in PHP encoded
    blobs!

    View Slide

  52. MediaWiki
    databases
    Hooooboy,
    we have fun,
    don’t we?!

    View Slide

  53. MediaWiki history
    Solution: history reconstruction
    Complex series of Spark jobs that iterate
    over imported snapshot of all MediaWiki
    databases.
    Final output: large denormalized history
    Hive table of all MediaWiki events across
    all Wikimedia sites

    View Slide

  54. MediaWiki history
    How many
    new
    editors
    across all
    wikis?
    0 joins, 0 subqueries
    1 table
    5 minutes to run

    View Slide

  55. MediaWiki history

    View Slide

  56. Data Sources
    Scale

    View Slide

  57. Webrequests
    Unlike MediaWiki data,
    webrequest data is huge, but
    dimensionally simple (url, http
    status, user agent, etc.)

    View Slide

  58. Webrequests
    We want to know
    - How many HTTP requests?
    - What browsers do people use?
    - How many of those requests actual
    article page views?
    - How many of those are unique
    devices?

    View Slide

  59. ~150,000 HTTP requests per second
    Webrequests

    View Slide

  60. ~30 million pageviews per hour
    Webrequests

    View Slide

  61. ~150 million unique devices per day
    Webrequests

    View Slide

  62. Scaling
    issues: 2005 to
    2014

    View Slide

  63. Can max at > 200,000 requests per second.
    2014 World Cup Final
    Webrequests
    We lost data because:
    soccer!
    (this is the infamous Brazil vs Germany game)

    View Slide

  64. Varnish
    Webrequests handled by Varnish
    in multiple datacenters.
    Shared memory log
    varnishlog apps can access in Varnish’s logs in
    memory.
    Varnishncsa
    Varnishlog -> stdout formatter
    Wikimedia patched this to send logs over UDP.
    Webrequests

    View Slide

  65. udp2log
    Webrequests

    View Slide

  66. udp2log
    Listens for UDP traffic stream.
    Delimits messages by newlines.
    Tees out and samples traffic to custom filters.
    multicast relay
    socat relay sends varnishncsa traffic to
    a multicast group, allowing for multiple
    udp2log consumers.
    Webrequests

    View Slide

  67. works great but
    ...
    Webrequests
    udp2log

    View Slide

  68. doesn’t scale - every
    udp2log instance must see every
    network packet.
    Works for simple use
    cases and lower traffic
    scenarios.
    Webrequests - udp2log

    View Slide

  69. What we want:
    all requests saved for easy
    and fast analysis.
    Webrequests

    View Slide

  70. scalable log
    transport
    Webrequests

    View Slide

  71. View Slide

  72. Scaling
    issues under
    control: Kafka and
    Hadoop since 2014

    View Slide

  73. Webrequests

    View Slide

  74. Webrequests
    See that FutureDB?
    This diagram was made in
    2014.
    Our FutureDB today is
    Druid!

    View Slide

  75. View Slide

  76. Why?
    Druid

    View Slide

  77. Why?
    Druid
    Can’t use proprietary software
    (No Tableu)

    View Slide

  78. Why?
    Druid
    Can’t use proprietary software
    (No Tableu)
    Druid works with some really nice GUIs
    (Pivot, Superset, etc.), which provide a lot of
    value for us.

    View Slide

  79. Why?
    Druid
    Can’t use proprietary software
    (No Tableu)
    Druid works with some really nice GUIs
    (Pivot, Superset, etc.), which provides a lot of
    value for us.
    and we get a fast scalable OLAP database,
    with realtime support

    View Slide

  80. Ingestion
    Druid
    Batch from Hadoop
    Streaming from Kafka with Spark
    and Tranquility

    View Slide

  81. Ingestion
    Druid
    Batch
    Data is partially pre-aggregated in
    Hadoop into useful Hive tables
    (e.g. Mediawiki History, Pageview
    counts, etc.)
    This is then regularly loaded into
    Druid using custom tooling and
    Oozie

    View Slide

  82. Ingestion
    Druid
    Streaming
    Spark Streaming job consumes
    from Kafka and crunches
    dimensions and metrics.
    Spark-Tranquility used to beam
    directly to Druid for indexing (WIP)

    View Slide

  83. Druid
    July 2016 - Issue T141506 is submitted:
    “Suddenly outrageous higher
    pageviews for main pages”

    View Slide

  84. Druid
    http://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2016-07-01&end=2016-08-01&pages=Main_Page

    View Slide

  85. Druid
    Suddenly outrageous higher
    pageviews for main pages
    Lots of slicing and dicing in Hive,
    but we didn’t see a trend.

    View Slide

  86. Druid
    Suddenly outrageous higher
    pageviews for main pages
    Suspicion about browsers.
    Used Druid/Pivot to slice
    Lots of slicing and dicing in Hive,
    but we didn’t see a trend.

    View Slide

  87. Druid. Traffic Spike Investigation
    Definitely something weird happening here

    View Slide

  88. Is it Chrome?
    Looks like mostly Chrome

    View Slide

  89. Answer: Chrome 41 on Windows 7
    Chrome 51 users,
    upgrading to 52, cool
    And there’s our problem,
    definitely Chrome 41

    View Slide

  90. Let’s dig into the
    technology a little bit
    more...

    View Slide

  91. OLAP is about
    splitting apart a large
    dataset into smaller
    subsets and
    analyzing the subsets
    Aggregations are
    often involved
    (counts, sums,
    quantiles, etc)
    The OLAP Problem

    View Slide

  92. A/B Testing Marketing
    Networking Troubleshooting
    OLAP Use Cases

    View Slide

  93. High dimensionality: 10s, 100s, or even 1000s of
    columns
    High cardinality: each column may have millions of
    unique values
    Multi-tenancy: many concurrent users
    Freshness: load from streams, not batch files
    OLAP at scale is hard

    View Slide

  94. How Druid tackles OLAP
    Column storage format
    Compression & encoding
    Intelligent search filters
    Approximate algorithms
    Distributed architecture

    View Slide

  95. Storing data
    timestamp domain gender clicked
    2011-01-01T00:01:35Z bieber.com Female 1
    2011-01-01T00:03:03Z bieber.com Female 0
    2011-01-01T00:04:51Z ultra.com Male 1
    2011-01-01T00:05:33Z ultra.com Male 1
    2011-01-01T00:05:53Z ultra.com Female 0
    2011-01-01T00:06:17Z ultra.com Female 1
    2011-01-01T00:23:15Z bieber.com Female 0
    2011-01-01T00:38:51Z ultra.com Male 1
    2011-01-01T00:49:33Z bieber.com Female 1
    2011-01-01T00:49:53Z ultra.com Female 0

    View Slide

  96. timestamp domain gender clicked
    2011-01-01T00:00:00Z bieber.com Female 1
    2011-01-01T00:00:00Z ultra.com Female 2
    2011-01-01T00:00:00Z ultra.com Male 3
    timestamp domain gender clicked
    2011-01-01T00:01:35Z bieber.com Female 1
    2011-01-01T00:03:03Z bieber.com Female 0
    2011-01-01T00:04:51Z ultra.com Male 1
    2011-01-01T00:05:33Z ultra.com Male 1
    2011-01-01T00:05:53Z ultra.com Female 0
    2011-01-01T00:06:17Z ultra.com Female 1
    2011-01-01T00:23:15Z bieber.com Female 0
    2011-01-01T00:38:51Z ultra.com Male 1
    2011-01-01T00:49:33Z bieber.com Female 1
    2011-01-01T00:49:53Z ultra.com Female 0
    Rollup

    View Slide

  97. How does Druid work?
    Partitioning data

    View Slide

  98. Segments
    Fundamental storage unit in Druid
    No contention between reads and writes
    One thread scans one segment

    View Slide

  99. Create IDs
    - Justin Bieber -> 0, Ke$ha -> 1
    Store
    - page → [0 0 0 1 1 1]
    - language → [0 0 0 0 0 0]
    Compression

    View Slide

  100. Justin Bieber → [0, 1, 2] → [111000]
    Ke$ha → [3, 4, 5] → [000111]
    Justin Bieber OR Ke$ha → [111111]
    Compression!
    Fast filters

    View Slide

  101. timestamp domain gender clicked
    2011-01-01T00:00:00Z bieber.com Female 1
    2011-01-01T00:00:00Z ultra.com Female 2
    2011-01-01T00:00:00Z ultra.com Male 3
    timestamp domain gender clicked
    2011-01-01T00:01:35Z bieber.com Female 1
    2011-01-01T00:03:03Z bieber.com Female 0
    2011-01-01T00:04:51Z ultra.com Male 1
    2011-01-01T00:05:33Z ultra.com Male 1
    2011-01-01T00:05:53Z ultra.com Female 0
    2011-01-01T00:06:17Z ultra.com Female 1
    2011-01-01T00:23:15Z bieber.com Female 0
    2011-01-01T00:38:51Z ultra.com Male 1
    2011-01-01T00:49:33Z bieber.com Female 1
    2011-01-01T00:49:53Z ultra.com Female 0
    Approximation algorithms

    View Slide

  102. Approximation algorithms
    timestamp domain user gender clicked
    2011-01-01T00:01:35Z bieber.com 4312345532 Female 1
    2011-01-01T00:03:03Z bieber.com 3484920241 Female 0
    2011-01-01T00:04:51Z ultra.com 9530174728 Male 1
    2011-01-01T00:05:33Z ultra.com 4098310573 Male 1
    2011-01-01T00:05:53Z ultra.com 5832058870 Female 0
    2011-01-01T00:06:17Z ultra.com 5789283478 Female 1
    2011-01-01T00:23:15Z bieber.com 4730093842 Female 0
    2011-01-01T00:38:51Z ultra.com 9530174728 Male 1
    2011-01-01T00:49:33Z bieber.com 4930097162 Female 1
    2011-01-01T00:49:53Z ultra.com 3081837193 Female 0

    View Slide

  103. timestamp domain gender clicked unique_users
    2011-01-01T00:00:00Z bieber.com Female 1 [approximate_sketch]
    2011-01-01T00:00:00Z ultra.com Female 2 [approximate_sketch]
    2011-01-01T00:00:00Z ultra.com Male 3 [approximate_sketch]
    timestamp domain gender clicked
    2011-01-01T00:01:35Z bieber.com Female 1
    2011-01-01T00:03:03Z bieber.com Female 0
    2011-01-01T00:04:51Z ultra.com Male 1
    2011-01-01T00:05:33Z ultra.com Male 1
    2011-01-01T00:05:53Z ultra.com Female 0
    2011-01-01T00:06:17Z ultra.com Female 1
    2011-01-01T00:23:15Z bieber.com Female 0
    2011-01-01T00:38:51Z ultra.com Male 1
    2011-01-01T00:49:33Z bieber.com Female 1
    2011-01-01T00:49:53Z ultra.com Female 0
    Approximation algorithms

    View Slide

  104. Indexers
    Indexers
    Indexers
    Files
    Historicals
    Historicals
    Historicals
    Streams
    Architecture

    View Slide

  105. Indexers
    Indexers
    Indexers
    Files
    Historicals
    Historicals
    Historicals
    Streams
    Brokers Brokers
    Queries
    Architecture

    View Slide

  106. Scale
    (largest cluster I know about – not a hard limit)
    >3M events/sec ingested (peak)
    10 – 100k events/sec/cpu
    >500 TB of segments (>50 trillion raw events, >50
    PB raw data)
    mean 500ms query time
    90%ile < 1s, 95%ile < 5s, 99%ile < 10s

    View Slide

  107. Integration
    Druid
    Stream
    Processor
    Batch
    Processor
    Message bus
    Event
    s
    Apps

    View Slide

  108. Works well with:
    - Kafka
    - Hadoop
    - S3
    - Spark Streaming
    - Storm
    - Samza
    - I bet some other things that also start with S
    Connect

    View Slide

  109. http://druid.io/downloads.html
    https://imply.io/get-started

    View Slide

  110. Questions?
    https://xkcd.com/285

    View Slide