“Imagine a world in which every single
human being can freely share in the
sum of all knowledge.”
Slide 4
Slide 4 text
Senior Systems Engineer
I have been with Wikimedia since 2012, mostly on
infrastructure for analytics.
Previously Lead SysAdmin at CouchSurfing.org
http://www.mediawiki.org/wiki/User:Ottomata
Andrew Otto
Slide 5
Slide 5 text
Fangjin Yang
Druid Engineer
One of the original authors behind Druid. Co-founder @
Imply.
Slide 6
Slide 6 text
This is an
overview of
Wikimedia data
sources, their
history and
scale.
Slide 7
Slide 7 text
No content
Slide 8
Slide 8 text
Data Sources
Slide 9
Slide 9 text
Technical
Challenges
2
Data Sources
Slide 10
Slide 10 text
Technical
Challenges
2
Data Sources
Scale and
Complexity
Slide 11
Slide 11 text
Data Sources
Philosophical
Stand
1
Slide 12
Slide 12 text
Data Sources
Philosophical
Stand
1
Privacy
Slide 13
Slide 13 text
Wikipedia is ranked the
6th largest website when
it comes to pageviews [Alexa]
Scale is an obvious issue
...
Slide 14
Slide 14 text
Complexity is less
obvious
MediaWiki is not designed
for analytics.
It is hard to know the pages that
were created by users with more
than 5 edits in German wikipedia in
2016.
Slide 15
Slide 15 text
Privacy
We do not send data to
third parties.
Commercial solutions to gather user
data are not an option.
Slide 16
Slide 16 text
Data Sources
Privacy
Slide 17
Slide 17 text
Event Logs
Slide 18
Slide 18 text
Captures rich events of JSON
schema-ed data.
Events are validated and sent off
to various data stores.
EventLogging
Event Logs
Slide 19
Slide 19 text
Event Logs
Slide 20
Slide 20 text
~300 events per second (for now!)
~30 active schemas
Event Logs
Slide 21
Slide 21 text
Balance between
transparency and
privacy
Slide 22
Slide 22 text
Must host all data in our
datacenters (no Google Analytics)
Event Logs
Slide 23
Slide 23 text
Must host all data in our
datacenters
Event Logs
Store as little private data as
possible:
- No IPs
- Only sanitized user agents
- Historical data (older than 90 days) must be
fully cleansed of sensitive information
- Should not be able to reconstruct browsing
sessions
Slide 24
Slide 24 text
Event Logs
Slide 25
Slide 25 text
Event Logs
This part is
holding us back!
...Druid?
Slide 26
Slide 26 text
Future work:
Teach folks to design event
schemas that fit an OLAP data
model
Event Logs
i.e. dimensions and metrics.
Slide 27
Slide 27 text
Future work:
Teach folks to design event
schemas that fit an OLAP data
model
Event Logs
i.e. dimensions and metrics.
Auto ingest these into Druid to
allow for use with nice GUIs
(Pivot / Superset)
Slide 28
Slide 28 text
Data Sources
Complexity
Slide 29
Slide 29 text
MediaWiki
Databases
Slide 30
Slide 30 text
MediaWiki
databases
MediaWiki metadata is not huge,
but analytics of edit history since
the beginning of wiki time is
dimensionally complex.
Slide 31
Slide 31 text
~1.5 million article revisions per day
MediaWiki
databases
Slide 32
Slide 32 text
We want to learn about users and
edit history.
- How many new users over time?
- How many edits over time?
- How many edits do new users make?
- How do editors stay active?
- How many edits are made by bots?
- Etc. etc. etc.
MediaWiki
databases
Slide 33
Slide 33 text
Simple questions
might be
hard to answer
Slide 34
Slide 34 text
MediaWiki
databases
How many
new
editors
across all
wikis?
2 joins, 3 subqueries
3 tables on 800+ wikis
5 days to run
Slide 35
Slide 35 text
MySQL (OLTP) works fine as a
application storage
MediaWiki
databases
Slide 36
Slide 36 text
MySQL (OLTP) works fine as a
application storage
but not so great for historical
analytics.
MediaWiki
databases
Slide 37
Slide 37 text
MySQL (OLTP) works fine as a
application storage
but not so great for historical
analytics.
Can only query current state.
MediaWiki
databases
Slide 38
Slide 38 text
However...
Historical events are saved in a ‘log’ table.
Log has juicy information about:
MediaWiki
databases
Slide 39
Slide 39 text
page (AKA article) events
(move, delete and restore)
user events
(rename, block, rights)
...which both go back to the
almost beginning of
wiki-time
MediaWiki
databases
However...
Historical events are saved in a ‘log’ table.
Log has juicy information about:
Slide 40
Slide 40 text
MediaWiki
databases
However...HOWEVER!
Slide 41
Slide 41 text
MediaWiki
databases
However...HOWEVER!
MediaWiki is an old(ish) popular
open source project.
Slide 42
Slide 42 text
MediaWiki
databases
However...HOWEVER!
MediaWiki is an old(ish) popular
open source project.
Super cool! But this also leads us
to the land of database
design insanity!
Slide 43
Slide 43 text
MediaWiki
databases
However...HOWEVER!
Primary ids are missing
for most events!
Slide 44
Slide 44 text
MediaWiki
databases
However...HOWEVER!
Primary ids are missing
for most events!
Instead they have page
title or user name.
Slide 45
Slide 45 text
MediaWiki
databases
However...HOWEVER!
Primary ids are missing
for most events!
Instead they have page
title or user name.
Pages and users are renamed all the time!
Slide 46
Slide 46 text
MediaWiki
databases
However...HOWEVER!
Primary ids are missing
for most events!
Instead they have page
title or user name.
Pages and users are renamed all the time!
Data is stored in different
formats depending on
MediaWiki version that created
the log record
Slide 47
Slide 47 text
MediaWiki
databases
However...HOWEVER!
Primary ids are missing
for most events!
Instead they have page
title or user name.
Pages and users are renamed all the time!
Data is stored in different
formats depending on
MediaWiki version that created
the log record
sometimes in PHP encoded
blobs!
Slide 48
Slide 48 text
MediaWiki
databases
However...HOWEVER!
Primary ids are missing
for most events!
Instead they have page
title or user name.
Pages and users are renamed all the time!
Data is stored in different
formats depending on
MediaWiki version that created
the log record
page and user properties might
be in different fields
depending on age of event!
sometimes in PHP encoded
blobs!
Slide 49
Slide 49 text
MediaWiki
databases
However...HOWEVER!
Primary ids are missing
for most events!
Instead they have page
title or user name.
Pages and users are renamed all the time!
Data is stored in different
formats depending on
MediaWiki version that created
the log record
Durations of user blocks is
stored in 7 different date
formats!
page and user properties might
be in different fields
depending on age of event!
sometimes in PHP encoded
blobs!
Slide 50
Slide 50 text
MediaWiki
databases
However...HOWEVER!
Primary ids are missing
for most events!
Instead they have page
title or user name.
Pages and users are renamed all the time!
Data is stored in different
formats depending on
MediaWiki version that created
the log record
Durations of user blocks is
stored in 7 different date
formats!
page and user properties might
be in different fields
depending on age of event!
Some revisions are
historically orphaned
(no existent page IDs!)
sometimes in PHP encoded
blobs!
Slide 51
Slide 51 text
MediaWiki
databases
However...HOWEVER!
Primary ids are missing
for most events!
Instead they have page
title or user name.
Pages and users are renamed all the time!
Data is stored in different
formats depending on
MediaWiki version that created
the log record
Durations of user blocks is
stored in 7 different date
formats!
page and user properties might
be in different fields
depending on age of event!
Some revisions are
historically orphaned
(no existent page IDs!)
Some page IDs were reused for
different pages! (I guess IDs were a
scarce commodity in 2006?)
sometimes in PHP encoded
blobs!
Slide 52
Slide 52 text
MediaWiki
databases
Hooooboy,
we have fun,
don’t we?!
Slide 53
Slide 53 text
MediaWiki history
Solution: history reconstruction
Complex series of Spark jobs that iterate
over imported snapshot of all MediaWiki
databases.
Final output: large denormalized history
Hive table of all MediaWiki events across
all Wikimedia sites
Slide 54
Slide 54 text
MediaWiki history
How many
new
editors
across all
wikis?
0 joins, 0 subqueries
1 table
5 minutes to run
Slide 55
Slide 55 text
MediaWiki history
Slide 56
Slide 56 text
Data Sources
Scale
Slide 57
Slide 57 text
Webrequests
Unlike MediaWiki data,
webrequest data is huge, but
dimensionally simple (url, http
status, user agent, etc.)
Slide 58
Slide 58 text
Webrequests
We want to know
- How many HTTP requests?
- What browsers do people use?
- How many of those requests actual
article page views?
- How many of those are unique
devices?
Slide 59
Slide 59 text
~150,000 HTTP requests per second
Webrequests
Slide 60
Slide 60 text
~30 million pageviews per hour
Webrequests
Slide 61
Slide 61 text
~150 million unique devices per day
Webrequests
Slide 62
Slide 62 text
Scaling
issues: 2005 to
2014
Slide 63
Slide 63 text
Can max at > 200,000 requests per second.
2014 World Cup Final
Webrequests
We lost data because:
soccer!
(this is the infamous Brazil vs Germany game)
Slide 64
Slide 64 text
Varnish
Webrequests handled by Varnish
in multiple datacenters.
Shared memory log
varnishlog apps can access in Varnish’s logs in
memory.
Varnishncsa
Varnishlog -> stdout formatter
Wikimedia patched this to send logs over UDP.
Webrequests
Slide 65
Slide 65 text
udp2log
Webrequests
Slide 66
Slide 66 text
udp2log
Listens for UDP traffic stream.
Delimits messages by newlines.
Tees out and samples traffic to custom filters.
multicast relay
socat relay sends varnishncsa traffic to
a multicast group, allowing for multiple
udp2log consumers.
Webrequests
Slide 67
Slide 67 text
works great but
...
Webrequests
udp2log
Slide 68
Slide 68 text
doesn’t scale - every
udp2log instance must see every
network packet.
Works for simple use
cases and lower traffic
scenarios.
Webrequests - udp2log
Slide 69
Slide 69 text
What we want:
all requests saved for easy
and fast analysis.
Webrequests
Slide 70
Slide 70 text
scalable log
transport
Webrequests
Slide 71
Slide 71 text
No content
Slide 72
Slide 72 text
Scaling
issues under
control: Kafka and
Hadoop since 2014
Slide 73
Slide 73 text
Webrequests
Slide 74
Slide 74 text
Webrequests
See that FutureDB?
This diagram was made in
2014.
Our FutureDB today is
Druid!
Slide 75
Slide 75 text
No content
Slide 76
Slide 76 text
Why?
Druid
Slide 77
Slide 77 text
Why?
Druid
Can’t use proprietary software
(No Tableu)
Slide 78
Slide 78 text
Why?
Druid
Can’t use proprietary software
(No Tableu)
Druid works with some really nice GUIs
(Pivot, Superset, etc.), which provide a lot of
value for us.
Slide 79
Slide 79 text
Why?
Druid
Can’t use proprietary software
(No Tableu)
Druid works with some really nice GUIs
(Pivot, Superset, etc.), which provides a lot of
value for us.
and we get a fast scalable OLAP database,
with realtime support
Slide 80
Slide 80 text
Ingestion
Druid
Batch from Hadoop
Streaming from Kafka with Spark
and Tranquility
Slide 81
Slide 81 text
Ingestion
Druid
Batch
Data is partially pre-aggregated in
Hadoop into useful Hive tables
(e.g. Mediawiki History, Pageview
counts, etc.)
This is then regularly loaded into
Druid using custom tooling and
Oozie
Slide 82
Slide 82 text
Ingestion
Druid
Streaming
Spark Streaming job consumes
from Kafka and crunches
dimensions and metrics.
Spark-Tranquility used to beam
directly to Druid for indexing (WIP)
Slide 83
Slide 83 text
Druid
July 2016 - Issue T141506 is submitted:
“Suddenly outrageous higher
pageviews for main pages”
Druid
Suddenly outrageous higher
pageviews for main pages
Lots of slicing and dicing in Hive,
but we didn’t see a trend.
Slide 86
Slide 86 text
Druid
Suddenly outrageous higher
pageviews for main pages
Suspicion about browsers.
Used Druid/Pivot to slice
Lots of slicing and dicing in Hive,
but we didn’t see a trend.
Slide 87
Slide 87 text
Druid. Traffic Spike Investigation
Definitely something weird happening here
Slide 88
Slide 88 text
Is it Chrome?
Looks like mostly Chrome
Slide 89
Slide 89 text
Answer: Chrome 41 on Windows 7
Chrome 51 users,
upgrading to 52, cool
And there’s our problem,
definitely Chrome 41
Slide 90
Slide 90 text
Let’s dig into the
technology a little bit
more...
Slide 91
Slide 91 text
OLAP is about
splitting apart a large
dataset into smaller
subsets and
analyzing the subsets
Aggregations are
often involved
(counts, sums,
quantiles, etc)
The OLAP Problem
Slide 92
Slide 92 text
A/B Testing Marketing
Networking Troubleshooting
OLAP Use Cases
Slide 93
Slide 93 text
High dimensionality: 10s, 100s, or even 1000s of
columns
High cardinality: each column may have millions of
unique values
Multi-tenancy: many concurrent users
Freshness: load from streams, not batch files
OLAP at scale is hard
Slide 94
Slide 94 text
How Druid tackles OLAP
Column storage format
Compression & encoding
Intelligent search filters
Approximate algorithms
Distributed architecture
Scale
(largest cluster I know about – not a hard limit)
>3M events/sec ingested (peak)
10 – 100k events/sec/cpu
>500 TB of segments (>50 trillion raw events, >50
PB raw data)
mean 500ms query time
90%ile < 1s, 95%ile < 5s, 99%ile < 10s
Slide 107
Slide 107 text
Integration
Druid
Stream
Processor
Batch
Processor
Message bus
Event
s
Apps
Slide 108
Slide 108 text
Works well with:
- Kafka
- Hadoop
- S3
- Spark Streaming
- Storm
- Samza
- I bet some other things that also start with S
Connect