Analytics at Wikipedia with Druid

Slide 1

Slide 1 text

Analytics at Wikipedia

Slide 2

Slide 2 text

No content

Slide 3

Slide 3 text

“Imagine a world in which every single human being can freely share in the sum of all knowledge.”

Slide 4

Slide 4 text

Senior Systems Engineer I have been with Wikimedia since 2012, mostly on infrastructure for analytics. Previously Lead SysAdmin at CouchSurfing.org http://www.mediawiki.org/wiki/User:Ottomata Andrew Otto

Slide 5

Slide 5 text

Fangjin Yang Druid Engineer One of the original authors behind Druid. Co-founder @ Imply.

Slide 6

Slide 6 text

This is an overview of Wikimedia data sources, their history and scale.

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

Data Sources

Slide 9

Slide 9 text

Technical Challenges 2 Data Sources

Slide 10

Slide 10 text

Technical Challenges 2 Data Sources Scale and Complexity

Slide 11

Slide 11 text

Data Sources Philosophical Stand 1

Slide 12

Slide 12 text

Data Sources Philosophical Stand 1 Privacy

Slide 13

Slide 13 text

Wikipedia is ranked the 6th largest website when it comes to pageviews [Alexa] Scale is an obvious issue ...

Slide 14

Slide 14 text

Complexity is less obvious MediaWiki is not designed for analytics. It is hard to know the pages that were created by users with more than 5 edits in German wikipedia in 2016.

Slide 15

Slide 15 text

Privacy We do not send data to third parties. Commercial solutions to gather user data are not an option.

Slide 16

Slide 16 text

Data Sources Privacy

Slide 17

Slide 17 text

Event Logs

Slide 18

Slide 18 text

Captures rich events of JSON schema-ed data. Events are validated and sent off to various data stores. EventLogging Event Logs

Slide 19

Slide 19 text

Event Logs

Slide 20

Slide 20 text

~300 events per second (for now!) ~30 active schemas Event Logs

Slide 21

Slide 21 text

Balance between transparency and privacy

Slide 22

Slide 22 text

Must host all data in our datacenters (no Google Analytics) Event Logs

Slide 23

Slide 23 text

Must host all data in our datacenters Event Logs Store as little private data as possible: - No IPs - Only sanitized user agents - Historical data (older than 90 days) must be fully cleansed of sensitive information - Should not be able to reconstruct browsing sessions

Slide 24

Slide 24 text

Event Logs

Slide 25

Slide 25 text

Event Logs This part is holding us back! ...Druid?

Slide 26

Slide 26 text

Future work: Teach folks to design event schemas that fit an OLAP data model Event Logs i.e. dimensions and metrics.

Slide 27

Slide 27 text

Future work: Teach folks to design event schemas that fit an OLAP data model Event Logs i.e. dimensions and metrics. Auto ingest these into Druid to allow for use with nice GUIs (Pivot / Superset)

Slide 28

Slide 28 text

Data Sources Complexity

Slide 29

Slide 29 text

MediaWiki Databases

Slide 30

Slide 30 text

MediaWiki databases MediaWiki metadata is not huge, but analytics of edit history since the beginning of wiki time is dimensionally complex.

Slide 31

Slide 31 text

~1.5 million article revisions per day MediaWiki databases

Slide 32

Slide 32 text

We want to learn about users and edit history. - How many new users over time? - How many edits over time? - How many edits do new users make? - How do editors stay active? - How many edits are made by bots? - Etc. etc. etc. MediaWiki databases

Slide 33

Slide 33 text

Simple questions might be hard to answer

Slide 34

Slide 34 text

MediaWiki databases How many new editors across all wikis? 2 joins, 3 subqueries 3 tables on 800+ wikis 5 days to run

Slide 35

Slide 35 text

MySQL (OLTP) works fine as a application storage MediaWiki databases

Slide 36

Slide 36 text

MySQL (OLTP) works fine as a application storage but not so great for historical analytics. MediaWiki databases

Slide 37

Slide 37 text

MySQL (OLTP) works fine as a application storage but not so great for historical analytics. Can only query current state. MediaWiki databases

Slide 38

Slide 38 text

However... Historical events are saved in a ‘log’ table. Log has juicy information about: MediaWiki databases

Slide 39

Slide 39 text

page (AKA article) events (move, delete and restore) user events (rename, block, rights) ...which both go back to the almost beginning of wiki-time MediaWiki databases However... Historical events are saved in a ‘log’ table. Log has juicy information about:

Slide 40

Slide 40 text

MediaWiki databases However...HOWEVER!

Slide 41

Slide 41 text

MediaWiki databases However...HOWEVER! MediaWiki is an old(ish) popular open source project.

Slide 42

Slide 42 text

MediaWiki databases However...HOWEVER! MediaWiki is an old(ish) popular open source project. Super cool! But this also leads us to the land of database design insanity!

Slide 43

Slide 43 text

MediaWiki databases However...HOWEVER! Primary ids are missing for most events!

Slide 44

Slide 44 text

MediaWiki databases However...HOWEVER! Primary ids are missing for most events! Instead they have page title or user name.

Slide 45

Slide 45 text

MediaWiki databases However...HOWEVER! Primary ids are missing for most events! Instead they have page title or user name. Pages and users are renamed all the time!

Slide 46

Slide 46 text

Slide 47

Slide 47 text

Slide 48

Slide 48 text

Slide 49

Slide 49 text

MediaWiki databases However...HOWEVER! Primary ids are missing for most events! Instead they have page title or user name. Pages and users are renamed all the time! Data is stored in different formats depending on MediaWiki version that created the log record Durations of user blocks is stored in 7 different date formats! page and user properties might be in different fields depending on age of event! sometimes in PHP encoded blobs!

Slide 50

Slide 50 text

Slide 51

Slide 51 text

Slide 52

Slide 52 text

MediaWiki databases Hooooboy, we have fun, don’t we?!

Slide 53

Slide 53 text

MediaWiki history Solution: history reconstruction Complex series of Spark jobs that iterate over imported snapshot of all MediaWiki databases. Final output: large denormalized history Hive table of all MediaWiki events across all Wikimedia sites

Slide 54

Slide 54 text

MediaWiki history How many new editors across all wikis? 0 joins, 0 subqueries 1 table 5 minutes to run

Slide 55

Slide 55 text

MediaWiki history

Slide 56

Slide 56 text

Data Sources Scale

Slide 57

Slide 57 text

Webrequests Unlike MediaWiki data, webrequest data is huge, but dimensionally simple (url, http status, user agent, etc.)

Slide 58

Slide 58 text

Webrequests We want to know - How many HTTP requests? - What browsers do people use? - How many of those requests actual article page views? - How many of those are unique devices?

Slide 59

Slide 59 text

~150,000 HTTP requests per second Webrequests

Slide 60

Slide 60 text

~30 million pageviews per hour Webrequests

Slide 61

Slide 61 text

~150 million unique devices per day Webrequests

Slide 62

Slide 62 text

Scaling issues: 2005 to 2014

Slide 63

Slide 63 text

Can max at > 200,000 requests per second. 2014 World Cup Final Webrequests We lost data because: soccer! (this is the infamous Brazil vs Germany game)

Slide 64

Slide 64 text

Varnish Webrequests handled by Varnish in multiple datacenters. Shared memory log varnishlog apps can access in Varnish’s logs in memory. Varnishncsa Varnishlog -> stdout formatter Wikimedia patched this to send logs over UDP. Webrequests

Slide 65

Slide 65 text

udp2log Webrequests

Slide 66

Slide 66 text

udp2log Listens for UDP traffic stream. Delimits messages by newlines. Tees out and samples traffic to custom filters. multicast relay socat relay sends varnishncsa traffic to a multicast group, allowing for multiple udp2log consumers. Webrequests

Slide 67

Slide 67 text

works great but ... Webrequests udp2log

Slide 68

Slide 68 text

doesn’t scale - every udp2log instance must see every network packet. Works for simple use cases and lower traffic scenarios. Webrequests - udp2log

Slide 69

Slide 69 text

What we want: all requests saved for easy and fast analysis. Webrequests

Slide 70

Slide 70 text

scalable log transport Webrequests

Slide 71

Slide 71 text

No content

Slide 72

Slide 72 text

Scaling issues under control: Kafka and Hadoop since 2014

Slide 73

Slide 73 text

Webrequests

Slide 74

Slide 74 text

Webrequests See that FutureDB? This diagram was made in 2014. Our FutureDB today is Druid!

Slide 75

Slide 75 text

No content

Slide 76

Slide 76 text

Why? Druid

Slide 77

Slide 77 text

Why? Druid Can’t use proprietary software (No Tableu)

Slide 78

Slide 78 text

Why? Druid Can’t use proprietary software (No Tableu) Druid works with some really nice GUIs (Pivot, Superset, etc.), which provide a lot of value for us.

Slide 79

Slide 79 text

Why? Druid Can’t use proprietary software (No Tableu) Druid works with some really nice GUIs (Pivot, Superset, etc.), which provides a lot of value for us. and we get a fast scalable OLAP database, with realtime support

Slide 80

Slide 80 text

Ingestion Druid Batch from Hadoop Streaming from Kafka with Spark and Tranquility

Slide 81

Slide 81 text

Ingestion Druid Batch Data is partially pre-aggregated in Hadoop into useful Hive tables (e.g. Mediawiki History, Pageview counts, etc.) This is then regularly loaded into Druid using custom tooling and Oozie

Slide 82

Slide 82 text

Ingestion Druid Streaming Spark Streaming job consumes from Kafka and crunches dimensions and metrics. Spark-Tranquility used to beam directly to Druid for indexing (WIP)

Slide 83

Slide 83 text

Druid July 2016 - Issue T141506 is submitted: “Suddenly outrageous higher pageviews for main pages”

Slide 84

Slide 84 text

Druid http://tools.wmflabs.org/pageviews/?project=en.wikipedia.org&platform=all-access&agent=user&start=2016-07-01&end=2016-08-01&pages=Main_Page

Slide 85

Slide 85 text

Druid Suddenly outrageous higher pageviews for main pages Lots of slicing and dicing in Hive, but we didn’t see a trend.

Slide 86

Slide 86 text

Druid Suddenly outrageous higher pageviews for main pages Suspicion about browsers. Used Druid/Pivot to slice Lots of slicing and dicing in Hive, but we didn’t see a trend.

Slide 87

Slide 87 text

Druid. Traffic Spike Investigation Definitely something weird happening here

Slide 88

Slide 88 text

Is it Chrome? Looks like mostly Chrome

Slide 89

Slide 89 text

Answer: Chrome 41 on Windows 7 Chrome 51 users, upgrading to 52, cool And there’s our problem, definitely Chrome 41

Slide 90

Slide 90 text

Let’s dig into the technology a little bit more...

Slide 91

Slide 91 text

OLAP is about splitting apart a large dataset into smaller subsets and analyzing the subsets Aggregations are often involved (counts, sums, quantiles, etc) The OLAP Problem

Slide 92

Slide 92 text

A/B Testing Marketing Networking Troubleshooting OLAP Use Cases

Slide 93

Slide 93 text

High dimensionality: 10s, 100s, or even 1000s of columns High cardinality: each column may have millions of unique values Multi-tenancy: many concurrent users Freshness: load from streams, not batch files OLAP at scale is hard

Slide 94

Slide 94 text

How Druid tackles OLAP Column storage format Compression & encoding Intelligent search filters Approximate algorithms Distributed architecture

Slide 95

Slide 95 text

Storing data timestamp domain gender clicked 2011-01-01T00:01:35Z bieber.com Female 1 2011-01-01T00:03:03Z bieber.com Female 0 2011-01-01T00:04:51Z ultra.com Male 1 2011-01-01T00:05:33Z ultra.com Male 1 2011-01-01T00:05:53Z ultra.com Female 0 2011-01-01T00:06:17Z ultra.com Female 1 2011-01-01T00:23:15Z bieber.com Female 0 2011-01-01T00:38:51Z ultra.com Male 1 2011-01-01T00:49:33Z bieber.com Female 1 2011-01-01T00:49:53Z ultra.com Female 0

Slide 96

Slide 96 text

timestamp domain gender clicked 2011-01-01T00:00:00Z bieber.com Female 1 2011-01-01T00:00:00Z ultra.com Female 2 2011-01-01T00:00:00Z ultra.com Male 3 timestamp domain gender clicked 2011-01-01T00:01:35Z bieber.com Female 1 2011-01-01T00:03:03Z bieber.com Female 0 2011-01-01T00:04:51Z ultra.com Male 1 2011-01-01T00:05:33Z ultra.com Male 1 2011-01-01T00:05:53Z ultra.com Female 0 2011-01-01T00:06:17Z ultra.com Female 1 2011-01-01T00:23:15Z bieber.com Female 0 2011-01-01T00:38:51Z ultra.com Male 1 2011-01-01T00:49:33Z bieber.com Female 1 2011-01-01T00:49:53Z ultra.com Female 0 Rollup

Slide 97

Slide 97 text

How does Druid work? Partitioning data

Slide 98

Slide 98 text

Segments Fundamental storage unit in Druid No contention between reads and writes One thread scans one segment

Slide 99

Slide 99 text

Create IDs - Justin Bieber -> 0, Ke$ha -> 1 Store - page → [0 0 0 1 1 1] - language → [0 0 0 0 0 0] Compression

Slide 100

Slide 100 text

Justin Bieber → [0, 1, 2] → [111000] Ke$ha → [3, 4, 5] → [000111] Justin Bieber OR Ke$ha → [111111] Compression! Fast filters

Slide 101

Slide 101 text

Slide 102

Slide 102 text

Approximation algorithms timestamp domain user gender clicked 2011-01-01T00:01:35Z bieber.com 4312345532 Female 1 2011-01-01T00:03:03Z bieber.com 3484920241 Female 0 2011-01-01T00:04:51Z ultra.com 9530174728 Male 1 2011-01-01T00:05:33Z ultra.com 4098310573 Male 1 2011-01-01T00:05:53Z ultra.com 5832058870 Female 0 2011-01-01T00:06:17Z ultra.com 5789283478 Female 1 2011-01-01T00:23:15Z bieber.com 4730093842 Female 0 2011-01-01T00:38:51Z ultra.com 9530174728 Male 1 2011-01-01T00:49:33Z bieber.com 4930097162 Female 1 2011-01-01T00:49:53Z ultra.com 3081837193 Female 0

Slide 103

Slide 103 text

timestamp domain gender clicked unique_users 2011-01-01T00:00:00Z bieber.com Female 1 [approximate_sketch] 2011-01-01T00:00:00Z ultra.com Female 2 [approximate_sketch] 2011-01-01T00:00:00Z ultra.com Male 3 [approximate_sketch] timestamp domain gender clicked 2011-01-01T00:01:35Z bieber.com Female 1 2011-01-01T00:03:03Z bieber.com Female 0 2011-01-01T00:04:51Z ultra.com Male 1 2011-01-01T00:05:33Z ultra.com Male 1 2011-01-01T00:05:53Z ultra.com Female 0 2011-01-01T00:06:17Z ultra.com Female 1 2011-01-01T00:23:15Z bieber.com Female 0 2011-01-01T00:38:51Z ultra.com Male 1 2011-01-01T00:49:33Z bieber.com Female 1 2011-01-01T00:49:53Z ultra.com Female 0 Approximation algorithms

Slide 104

Slide 104 text

Indexers Indexers Indexers Files Historicals Historicals Historicals Streams Architecture

Slide 105

Slide 105 text

Indexers Indexers Indexers Files Historicals Historicals Historicals Streams Brokers Brokers Queries Architecture

Slide 106

Slide 106 text

Scale (largest cluster I know about – not a hard limit) >3M events/sec ingested (peak) 10 – 100k events/sec/cpu >500 TB of segments (>50 trillion raw events, >50 PB raw data) mean 500ms query time 90%ile < 1s, 95%ile < 5s, 99%ile < 10s

Slide 107

Slide 107 text

Integration Druid Stream Processor Batch Processor Message bus Event s Apps

Slide 108

Slide 108 text

Works well with: - Kafka - Hadoop - S3 - Spark Streaming - Storm - Samza - I bet some other things that also start with S Connect