Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wikipedia and The Lean Data Diet

nuria_ruiz
September 23, 2020
52

Wikipedia and The Lean Data Diet

Privacy is one of the lesser known charms of Wikipedia. Wikipedia’s stand on privacy allows users to access and modify a wiki in anonymity, without fear of giving away personal information, editorship or browsing history. In this talk we will go into the challenges that this strong privacy stance poses for the Wikimedia Foundation, including how it affects data collection and some creative workarounds that allow WMF to calculate metrics in a privacy conscious way.

nuria_ruiz

September 23, 2020
Tweet

Transcript

  1. “Imagine a world in which every single human being can

    freely share in the sum of all knowledge”.
  2. Should not have to provide any information to participate in

    free knowledge movement. There cannot be access to free knowledge without a strong guarantee of privacy. Free n le e Move nt Cor Bel ar nd Pri y
  3. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, data will be deleted, aggregated, or de-identified
  4. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, data will be deleted, aggregated, or de-identified
  5. username pepito_editor ip_address 3x.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... session_id

    8c878625792be023 edit_count 4257 ui_skin minerva Usage Data -Behavioura
  6. --dry-run undef -> execute --tables-to-delete undef -> all --execute undef

    -> dry-run --tables-to-delete undef -> none * -> all
  7. --database=event --tables=menuClicks --wikis=en.wikipedia --older-than=90 --skip-trash=true Executing tests… Tests passed. Starting

    DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79 --execute=<checksum>
  8. --database=event --tables=menuClicks --wikis=en.wikipedia --older-than=90 --skip-trash=true Executing tests… Tests passed. Starting

    DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79 --execute=<checksum> #1 Dry-run #2 Execute
  9. Clients Event Processor (Spark) Sanitized Events Events <90 days HTTP

    Beacon Endpoint Varnishkafka Kafka Varnish Allow-list HDFS https://github.com/wikimedia/analytics-refinery-source/tree/master/refinery-job Behavioural data
  10. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized
  11. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized Do-not-allow-list
  12. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu Sanitized Do-not-allow-list
  13. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu cookie_id 724310 Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id 724310 Sanitized Do-not-allow-list
  14. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized Allow-list
  15. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu Sanitized Allow-list
  16. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  17. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  18. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  19. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id 8d56ab209e10 Sanitized Allow-list #
  20. Privacy is not the responsibility of one team. All processes

    and metrics take privacy into account from the beginning until the end.
  21. Unique Device UUID, REQ UUID SELECT page_title uuid FROM database.table

    WHERE date = ’2019-01-01’ and uuid =<some>
  22. LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) 2020-09-01 2020-09-01

    Timestamp IP Page Cookies 2020-10-15 776.9.* Titanic Last-Access=2020-09-01 Unique Device
  23. LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) 2020-09-01 Timestamp

    IP Page Cookies 2020-10-15 776.9.* Titanic Last-Access=2020-09-01 2020-09-01 Unique Device
  24. LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) 2020-10-15 Timestamp

    IP Page Cookies 2020-10-15 776.9.* Titanic Last-Access=2020-09-01 2020-10-15 Unique Device
  25. LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) 2020-10-15 Timestamp

    IP Page Cookies 2020-10-15 776.9.* Titanic Last-Access=2020-09-01 2020-10-15 2020-10-15 776.9.* Everest Last-Access=2020-10-15 Unique Device
  26. SELECT COUNT(*) FROM database.table WHERE (last-access-date IS NULL OR last-access-date

    < date) AND date = ’2020-10-15’; LAST ACCESS LA, REQ LA, REQ (today: 2020-10-15) Unique Device
  27. Timestamp IP Page Cookies -> 2020-03-15 776.9.* Titanic Last-Access=2020-09-01 2020-03-15

    776.9.* Everest Last-Access=2020-10-15 SELECT COUNT(*) FROM database.table WHERE (last-access-date IS NULL OR last-access-date < date) AND date = ’2020-10-15’; Unique Device
  28. The Lean Data Diet Less work related to data requests

    Easier to make data public Guarantee of Privacy Extra work Privacy culture needs time Data Analysis needs a different mindset Pr Con