Senior Systems Engineer I have been with Wikimedia since 2012, mostly on infrastructure for analytics. Previously Lead SysAdmin at CouchSurfing.org http://www.mediawiki.org/wiki/User:Ottomata Andrew Otto
Complexity is less obvious MediaWiki is not designed for analytics. It is hard to know the pages that were created by users with more than 5 edits in German wikipedia in 2016.
Must host all data in our datacenters Event Logs Store as little private data as possible: - No IPs - Only sanitized user agents - Historical data (older than 90 days) must be fully cleansed of sensitive information - Should not be able to reconstruct browsing sessions
Future work: Teach folks to design event schemas that fit an OLAP data model Event Logs i.e. dimensions and metrics. Auto ingest these into Druid to allow for use with nice GUIs (Pivot / Superset)
We want to learn about users and edit history. - How many new users over time? - How many edits over time? - How many edits do new users make? - How do editors stay active? - How many edits are made by bots? - Etc. etc. etc. MediaWiki databases
page (AKA article) events (move, delete and restore) user events (rename, block, rights) ...which both go back to the almost beginning of wiki-time MediaWiki databases However... Historical events are saved in a ‘log’ table. Log has juicy information about:
MediaWiki databases However...HOWEVER! MediaWiki is an old(ish) popular open source project. Super cool! But this also leads us to the land of database design insanity!
MediaWiki databases However...HOWEVER! Primary ids are missing for most events! Instead they have page title or user name. Pages and users are renamed all the time!
MediaWiki databases However...HOWEVER! Primary ids are missing for most events! Instead they have page title or user name. Pages and users are renamed all the time! Data is stored in different formats depending on MediaWiki version that created the log record
MediaWiki databases However...HOWEVER! Primary ids are missing for most events! Instead they have page title or user name. Pages and users are renamed all the time! Data is stored in different formats depending on MediaWiki version that created the log record sometimes in PHP encoded blobs!
MediaWiki databases However...HOWEVER! Primary ids are missing for most events! Instead they have page title or user name. Pages and users are renamed all the time! Data is stored in different formats depending on MediaWiki version that created the log record page and user properties might be in different fields depending on age of event! sometimes in PHP encoded blobs!
MediaWiki databases However...HOWEVER! Primary ids are missing for most events! Instead they have page title or user name. Pages and users are renamed all the time! Data is stored in different formats depending on MediaWiki version that created the log record Durations of user blocks is stored in 7 different date formats! page and user properties might be in different fields depending on age of event! sometimes in PHP encoded blobs!
MediaWiki databases However...HOWEVER! Primary ids are missing for most events! Instead they have page title or user name. Pages and users are renamed all the time! Data is stored in different formats depending on MediaWiki version that created the log record Durations of user blocks is stored in 7 different date formats! page and user properties might be in different fields depending on age of event! Some revisions are historically orphaned (no existent page IDs!) sometimes in PHP encoded blobs!
MediaWiki databases However...HOWEVER! Primary ids are missing for most events! Instead they have page title or user name. Pages and users are renamed all the time! Data is stored in different formats depending on MediaWiki version that created the log record Durations of user blocks is stored in 7 different date formats! page and user properties might be in different fields depending on age of event! Some revisions are historically orphaned (no existent page IDs!) Some page IDs were reused for different pages! (I guess IDs were a scarce commodity in 2006?) sometimes in PHP encoded blobs!
MediaWiki history Solution: history reconstruction Complex series of Spark jobs that iterate over imported snapshot of all MediaWiki databases. Final output: large denormalized history Hive table of all MediaWiki events across all Wikimedia sites
Webrequests We want to know - How many HTTP requests? - What browsers do people use? - How many of those requests actual article page views? - How many of those are unique devices?
Varnish Webrequests handled by Varnish in multiple datacenters. Shared memory log varnishlog apps can access in Varnish’s logs in memory. Varnishncsa Varnishlog -> stdout formatter Wikimedia patched this to send logs over UDP. Webrequests
udp2log Listens for UDP traffic stream. Delimits messages by newlines. Tees out and samples traffic to custom filters. multicast relay socat relay sends varnishncsa traffic to a multicast group, allowing for multiple udp2log consumers. Webrequests
Why? Druid Can’t use proprietary software (No Tableu) Druid works with some really nice GUIs (Pivot, Superset, etc.), which provide a lot of value for us.
Why? Druid Can’t use proprietary software (No Tableu) Druid works with some really nice GUIs (Pivot, Superset, etc.), which provides a lot of value for us. and we get a fast scalable OLAP database, with realtime support
Ingestion Druid Batch Data is partially pre-aggregated in Hadoop into useful Hive tables (e.g. Mediawiki History, Pageview counts, etc.) This is then regularly loaded into Druid using custom tooling and Oozie
Ingestion Druid Streaming Spark Streaming job consumes from Kafka and crunches dimensions and metrics. Spark-Tranquility used to beam directly to Druid for indexing (WIP)
Druid Suddenly outrageous higher pageviews for main pages Suspicion about browsers. Used Druid/Pivot to slice Lots of slicing and dicing in Hive, but we didn’t see a trend.
OLAP is about splitting apart a large dataset into smaller subsets and analyzing the subsets Aggregations are often involved (counts, sums, quantiles, etc) The OLAP Problem
High dimensionality: 10s, 100s, or even 1000s of columns High cardinality: each column may have millions of unique values Multi-tenancy: many concurrent users Freshness: load from streams, not batch files OLAP at scale is hard
Scale (largest cluster I know about – not a hard limit) >3M events/sec ingested (peak) 10 – 100k events/sec/cpu >500 TB of segments (>50 trillion raw events, >50 PB raw data) mean 500ms query time 90%ile < 1s, 95%ile < 5s, 99%ile < 10s