Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wikipedia and The Lean Data Diet

D90214eda8c2732ad073e1ba87f8e22b?s=47 nuria_ruiz
September 23, 2020
39

Wikipedia and The Lean Data Diet

Privacy is one of the lesser known charms of Wikipedia. Wikipedia’s stand on privacy allows users to access and modify a wiki in anonymity, without fear of giving away personal information, editorship or browsing history. In this talk we will go into the challenges that this strong privacy stance poses for the Wikimedia Foundation, including how it affects data collection and some creative workarounds that allow WMF to calculate metrics in a privacy conscious way.

D90214eda8c2732ad073e1ba87f8e22b?s=128

nuria_ruiz

September 23, 2020
Tweet

Transcript

  1. and the Lean Data Diet

  2. analytics@wikimedia.org @pantojacoder

  3. and the Lean Data Diet

  4. None
  5. Free Knowledge Movement

  6. “Imagine a world in which every single human being can

    freely share in the sum of all knowledge”.
  7. Should not have to provide any information to participate in

    free knowledge movement. There cannot be access to free knowledge without a strong guarantee of privacy. Free n le e Move nt Cor Bel ar nd Pri y
  8. Anyone can edit!

  9. How is this guarantee of Privacy Expressed?

  10. https://foundation.wikimedia.org/wiki/Privacy_policy

  11. Build the wiki way

  12. Build the wiki way Dis sion to l 150,000 wo

    ds
  13. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, data will be deleted, aggregated, or de-identified
  14. In Practice the Privacy Policy has strong implications on how

    we do engineering
  15. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, data will be deleted, aggregated, or de-identified
  16. Wikipedia runs on-prem https://github.com/wikimedia/puppet

  17. We compute metrics in privacy conscious ways, aggregate, release publicly

    and delete a lot of data
  18. Deleting Data

  19. Sanit ing Data

  20. Privacy Culture

  21. Deleting Data At S a e!

  22. Usage Data - Web Request project es.wikipedia ip_address 31.214.189.167 user_agent

    Mozilla/5.0 (X11; Linux ... page COVID-19
  23. username pepito_editor ip_address 3x.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... session_id

    8c878625792be023 edit_count 4257 ui_skin minerva Usage Data -Behavioura
  24. 200,000 web requests PER sec (at peak)

  25. 200,000 web requests PER sec (at peak) 2,000 events PER

    sec
  26. Deleting Data Deleting Data Are you sure? Cancel Delete

  27. --dry-run undef -> execute --tables-to-delete undef -> all --execute undef

    -> dry-run --tables-to-delete undef -> none * -> all
  28. --database=event --tables=menuClicks --wikis=en.wikipedia --older-than=90 --skip-trash=true Executing tests… Tests passed. Starting

    DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79 --execute=<checksum>
  29. --database=event --tables=menuClicks --wikis=en.wikipedia --older-than=90 --skip-trash=true Executing tests… Tests passed. Starting

    DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79 --execute=<checksum> #1 Dry-run #2 Execute
  30. Sanit ing Data

  31. Sanit ing Data Ad ance

  32. Clients Event Processor (Spark) HTTP Beacon Endpoint Varnishkafka Kafka Varnish

    HDFS Behavioural data
  33. Clients Event Processor (Spark) Sanitized Events Events <90 days HTTP

    Beacon Endpoint Varnishkafka Kafka Varnish Allow-list HDFS https://github.com/wikimedia/analytics-refinery-source/tree/master/refinery-job Behavioural data
  34. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized
  35. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized Do-not-allow-list
  36. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu Sanitized Do-not-allow-list
  37. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu cookie_id 724310 Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id 724310 Sanitized Do-not-allow-list
  38. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized Allow-list
  39. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu Sanitized Allow-list
  40. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  41. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  42. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  43. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id 8d56ab209e10 Sanitized Allow-list #
  44. Privacy Culture

  45. Privacy is not the responsibility of one team. All processes

    and metrics take privacy into account from the beginning until the end.
  46. SELECT COUNT(DISTINCT uuid) FROM database.table WHERE date = ’2019-01-01’; UUID,

    REQ Unique Device - DAU or MAU UUID, REQ UUID
  47. Unique Device UUID, REQ UUID SELECT page_title uuid FROM database.table

    WHERE date = ’2019-01-01’ and uuid =<some>
  48. LAST ACCESS Unique Device 2020-09-01 https://diff.wikimedia.org/2016/03/30/unique-devices-dataset/

  49. LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) 2020-09-01 2020-09-01

    Unique Device
  50. LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) 2020-09-01 2020-09-01

    Timestamp IP Page Cookies 2020-10-15 776.9.* Titanic Last-Access=2020-09-01 Unique Device
  51. LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) 2020-09-01 Timestamp

    IP Page Cookies 2020-10-15 776.9.* Titanic Last-Access=2020-09-01 2020-09-01 Unique Device
  52. LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) 2020-10-15 Timestamp

    IP Page Cookies 2020-10-15 776.9.* Titanic Last-Access=2020-09-01 2020-10-15 Unique Device
  53. LAST ACCESS Unique Device 2020-10-15

  54. LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) 2020-10-15 2020-10-15

    Unique Device
  55. LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) 2020-10-15 Timestamp

    IP Page Cookies 2020-10-15 776.9.* Titanic Last-Access=2020-09-01 2020-10-15 2020-10-15 776.9.* Everest Last-Access=2020-10-15 Unique Device
  56. SELECT COUNT(*) FROM database.table WHERE (last-access-date IS NULL OR last-access-date

    < date) AND date = ’2020-10-15’; LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) Unique Device
  57. Timestamp IP Page Cookies -> 2020-03-15 776.9.* Titanic Last-Access=2020-09-01 2020-03-15

    776.9.* Everest Last-Access=2020-10-15 SELECT COUNT(*) FROM database.table WHERE (last-access-date IS NULL OR last-access-date < date) AND date = ’2020-10-15’; Unique Device
  58. The Lean Data Diet Less work related to data requests

    Easier to make data public Guarantee of Privacy Extra work Privacy culture needs time Data Analysis needs a different mindset Pr Con
  59. Privacy is a Feature

  60. Questions? https://xkcd.com/285 analytics@wikimedia.org @pantojacoder All pictures https://creativecommons.org/publicdomain/zero/1.0/