Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wikipedia's Lean Data Diet and Lessons Learned

nuria_ruiz
October 08, 2021
140

Wikipedia's Lean Data Diet and Lessons Learned

Privacy is one of the lesser known charms of Wikipedia. Wikipedia’s stand on privacy allows users to access and modify a wiki in anonymity, without fear of giving away personal information, editorship or browsing history. As of this writing, readers and editors are sending more than 2000 custom analytics events per second to the Wikipedia analytics pipeline and constantly feeding 200+ data sets. That is in addition to the 10 billion (US) web request logs that are ingested daily into the Hadoop cluster and are used to populate several important tools like public analytics APIs [1][2][3]. The long term existence of this data is key to the work the foundation does to assess product efforts but not only that, Wikipedia public data fees are used by researchers all over the world and, most importantly, community members need data to better target their edition efforts. Is it possible to retain value from these data sets when they are controlled by strict privacy policies? [1] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews [2] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Unique_Devices [3] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2

nuria_ruiz

October 08, 2021
Tweet

Transcript

  1. Lean Data: Do I need this data to provide the

    value I’m trying to deliver?
  2. “Imagine a world in which every single human being can

    freely share in the sum of all knowledge”.
  3. Should not have to provide any information to participate in

    free knowledge movement. There cannot be access to free knowledge without a strong guarantee of privacy. Free n le e Move nt Cor Bel ar nd Pri y
  4. Should not have to provide any information to participate in

    free knowledge movement. There cannot be access to free knowledge without a strong guarantee of privacy. Free n le e Move nt Cor Bel ar nd Pri y
  5. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees
  6. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees
  7. If tomorrow AWS goes black Wikipedia will continue working just

    the same https://github.com/wikimedia/puppet
  8. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees
  9. username pepito_23 ip_address 3x.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... session_id

    8c878625792be023 button red-top-right ui_skin minerva Usage Data -Behavioura
  10. --dry-run undef -> execute --tables-to-delete undef -> all --execute undef

    -> dry-run --tables-to-delete undef -> none * -> all
  11. --database=event --tables=menuClicks --wikis=en.wikipedia --older-than=90 --skip-trash=true Executing tests… Tests passed. Starting

    DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79 --execute=<checksum>
  12. --database=event --tables=menuClicks --wikis=en.wikipedia --older-than=90 --skip-trash=true Executing tests… Tests passed. Starting

    DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79 --execute=<checksum> #1 Dry-run #2 Execute
  13. Clients Event Processor (Spark) Sanitized Events Events <90 days HTTP

    Beacon Endpoint Varnishkafka Kafka Varnish Allow-list HDFS https://github.com/wikimedia/analytics-refinery-source/tree/master/refinery-job Behavioural data
  14. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized
  15. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized Do-not-allow-list
  16. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu Sanitized Do-not-allow-list
  17. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu cookie_id 724310 Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id 724310 Sanitized Do-not-allow-list
  18. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized Allow-list
  19. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu Sanitized Allow-list
  20. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  21. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  22. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  23. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id 8d56ab209e10 Sanitized Allow-list #
  24. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees
  25. (Last) Lesson Learned : Privacy cannot not the responsibility of

    one team. It is a shared commitment. Tech is easy, building culture is hard.
  26. Unique Device UUID, REQ UUID SELECT page_title, uuid FROM database.table

    WHERE date = ’2021-03-01’ and uuid =<some>
  27. LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-09-01 2021-09-01

    Timestamp IP Page Cookies 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 Unique Device
  28. LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-10-15 Timestamp

    IP Page Cookies 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 2021-10-15 Unique Device
  29. LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-10-15 Timestamp

    IP Page Cookies 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 2021-10-15 2021-10-15 123.9.* Everest Last-Access=2021-10-15 Unique Device
  30. SELECT COUNT(*) FROM database.table WHERE (last-access-date IS NULL OR last-access-date

    < timestamp) AND date = ’2021-10-15’; LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) Unique Device
  31. Timestamp IP Page Cookies -> 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 2021-10-15

    123.9.* Everest Last-Access=2021-10-15 SELECT COUNT(*) FROM database.table WHERE (last-access-date IS NULL OR last-access-date < timestamp) AND date = ’2021-10-15’; Unique Device
  32. Strong Privacy Stand: The Lean Data Diet Less work related

    to data requests Easier to make data public Privacy as a hidden feature Extra work Building Privacy culture takes time Data Analysis needs a different mindset Pr Con
  33. Bonus Lesson: You are never done with data quality issues

    but solving the big ones will help tremendously with other -seemly unrelated- problems.