Slide 1

Slide 1 text

Lean Data Diet and Lessons Learned 's

Slide 2

Slide 2 text

Lean Data: Do I need this data to provide the value I’m trying to deliver?

Slide 3

Slide 3 text

6 years leading data engineering @ [email protected]

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

No content

Slide 6

Slide 6 text

Free Knowledge Movement

Slide 7

Slide 7 text

“Imagine a world in which every single human being can freely share in the sum of all knowledge”.

Slide 8

Slide 8 text

Should not have to provide any information to participate in free knowledge movement. There cannot be access to free knowledge without a strong guarantee of privacy. Free n le e Move nt Cor Bel ar nd Pri y

Slide 9

Slide 9 text

Privacy is a Hidden Feature

Slide 10

Slide 10 text

Slide 11

Slide 11 text

Anyone can edit!

Slide 12

Slide 12 text

http://dumps.wikimedia.com

Slide 13

Slide 13 text

Usage Data - Web Request project es.wikipedia ip_address 3x.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... page COVID-19

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Should not have to provide any information to participate in free knowledge movement. There cannot be access to free knowledge without a strong guarantee of privacy. Free n le e Move nt Cor Bel ar nd Pri y

Slide 16

Slide 16 text

How is this guarantee of Privacy Expressed?

Slide 17

Slide 17 text

https://foundation.wikimedia.org/wiki/Privacy_policy

Slide 18

Slide 18 text

Build the wiki way

Slide 19

Slide 19 text

Build the wiki way Dis sion to l 150,000 wo ds

Slide 20

Slide 20 text

Read or edit without account. Register account without name, email or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees

Slide 21

Slide 21 text

In Practice, the Privacy Policy has strong implications on how we do engineering

Slide 22

Slide 22 text

Lesson Learned #1: Principles are useful when choice is needed

Slide 23

Slide 23 text

Read or edit without account. Register account without name, email or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees

Slide 24

Slide 24 text

Wikipedia runs on-prem https://github.com/wikimedia/puppet

Slide 25

Slide 25 text

If tomorrow AWS goes black Wikipedia will continue working just the same https://github.com/wikimedia/puppet

Slide 26

Slide 26 text

Read or edit without account. Register account without name, email or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees

Slide 27

Slide 27 text

Lesson Learned #2: Deleting (and aggregating) data at scale is quite hard

Slide 28

Slide 28 text

Deleting Data At S a e!

Slide 29

Slide 29 text

Usage Data - Web Request project es.wikipedia ip_address 3x.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... page COVID-19

Slide 30

Slide 30 text

username pepito_23 ip_address 3x.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... session_id 8c878625792be023 button red-top-right ui_skin minerva Usage Data -Behavioura

Slide 31

Slide 31 text

200,000 web requests PER sec (at peak)

Slide 32

Slide 32 text

200,000 web requests PER sec (at peak) 4,000 analytics events PER sec

Slide 33

Slide 33 text

Deleting Data Deleting Data Are you sure? Cancel Delete

Slide 34

Slide 34 text

--dry-run undef -> execute --tables-to-delete undef -> all --execute undef -> dry-run --tables-to-delete undef -> none * -> all

Slide 35

Slide 35 text

--database=event --tables=menuClicks --wikis=en.wikipedia --older-than=90 --skip-trash=true Executing tests… Tests passed. Starting DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79 --execute=

Slide 36

Slide 36 text

--database=event --tables=menuClicks --wikis=en.wikipedia --older-than=90 --skip-trash=true Executing tests… Tests passed. Starting DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79 --execute= #1 Dry-run #2 Execute

Slide 37

Slide 37 text

Sanit ing Data

Slide 38

Slide 38 text

Clients Event Processor (Spark) HTTP Beacon Endpoint Varnishkafka Kafka Varnish HDFS analytics data

Slide 39

Slide 39 text

Clients Event Processor (Spark) Sanitized Events Events <90 days HTTP Beacon Endpoint Varnishkafka Kafka Varnish Allow-list HDFS https://github.com/wikimedia/analytics-refinery-source/tree/master/refinery-job Behavioural data

Slide 40

Slide 40 text

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu Unsanitized

Slide 41

Slide 41 text

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu Unsanitized Do-not-allow-list

Slide 42

Slide 42 text

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu Sanitized Do-not-allow-list

Slide 43

Slide 43 text

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu cookie_id 724310 Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id 724310 Sanitized Do-not-allow-list

Slide 44

Slide 44 text

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu Unsanitized Allow-list

Slide 45

Slide 45 text

date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu Sanitized Allow-list

Slide 46

Slide 46 text

Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list

Slide 47

Slide 47 text

Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list

Slide 48

Slide 48 text

Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list

Slide 49

Slide 49 text

Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id 8d56ab209e10 Sanitized Allow-list #

Slide 50

Slide 50 text

Sanit ing Data Ad ance

Slide 51

Slide 51 text

Read or edit without account. Register account without name, email or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees

Slide 52

Slide 52 text

(Last) Lesson Learned : Privacy cannot not the responsibility of one team. It is a shared commitment. Tech is easy, building culture is hard.

Slide 53

Slide 53 text

Privacy First Metric Computation

Slide 54

Slide 54 text

SELECT COUNT(DISTINCT uuid) FROM database.table WHERE date = ’2021-03-01’; UUID, REQ Unique Device - DAU or MAU UUID, REQ UUID

Slide 55

Slide 55 text

Unique Device UUID, REQ UUID SELECT page_title, uuid FROM database.table WHERE date = ’2021-03-01’ and uuid =

Slide 56

Slide 56 text

LAST ACCESS Unique Device 2021-09-01 https://diff.wikimedia.org/2016/03/30/unique-devices-dataset/

Slide 57

Slide 57 text

LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-09-01 2021-09-01 Unique Device

Slide 58

Slide 58 text

LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-09-01 2021-09-01 Timestamp IP Page Cookies 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 Unique Device

Slide 59

Slide 59 text

LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-10-15 Timestamp IP Page Cookies 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 2021-10-15 Unique Device

Slide 60

Slide 60 text

LAST ACCESS Unique Device 2021-10-15

Slide 61

Slide 61 text

LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-10-15 2021-10-15 Unique Device

Slide 62

Slide 62 text

LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-10-15 Timestamp IP Page Cookies 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 2021-10-15 2021-10-15 123.9.* Everest Last-Access=2021-10-15 Unique Device

Slide 63

Slide 63 text

SELECT COUNT(*) FROM database.table WHERE (last-access-date IS NULL OR last-access-date < timestamp) AND date = ’2021-10-15’; LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) Unique Device

Slide 64

Slide 64 text

Timestamp IP Page Cookies -> 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 2021-10-15 123.9.* Everest Last-Access=2021-10-15 SELECT COUNT(*) FROM database.table WHERE (last-access-date IS NULL OR last-access-date < timestamp) AND date = ’2021-10-15’; Unique Device

Slide 65

Slide 65 text

Strong Privacy Stand: The Lean Data Diet Less work related to data requests Easier to make data public Privacy as a hidden feature Extra work Building Privacy culture takes time Data Analysis needs a different mindset Pr Con

Slide 66

Slide 66 text

Questions? https://xkcd.com/285 @pantojacoder All pictures https://creativecommons.org/publicdomain/zero/1.0/

Slide 67

Slide 67 text

Bonus Lesson: You are never done with data quality issues but solving the big ones will help tremendously with other -seemly unrelated- problems.

Slide 68

Slide 68 text

Example: Vandalis

Slide 69

Slide 69 text

No content

Slide 70

Slide 70 text

Questions? https://xkcd.com/285 @pantojacoder All pictures https://creativecommons.org/publicdomain/zero/1.0/

Slide 71

Slide 71 text

No content