Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Wikipedia's Lean Data Diet and Lessons Learned

D90214eda8c2732ad073e1ba87f8e22b?s=47 nuria_ruiz
October 08, 2021
48

Wikipedia's Lean Data Diet and Lessons Learned

Privacy is one of the lesser known charms of Wikipedia. Wikipedia’s stand on privacy allows users to access and modify a wiki in anonymity, without fear of giving away personal information, editorship or browsing history. As of this writing, readers and editors are sending more than 2000 custom analytics events per second to the Wikipedia analytics pipeline and constantly feeding 200+ data sets. That is in addition to the 10 billion (US) web request logs that are ingested daily into the Hadoop cluster and are used to populate several important tools like public analytics APIs [1][2][3]. The long term existence of this data is key to the work the foundation does to assess product efforts but not only that, Wikipedia public data fees are used by researchers all over the world and, most importantly, community members need data to better target their edition efforts. Is it possible to retain value from these data sets when they are controlled by strict privacy policies? [1] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews [2] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Unique_Devices [3] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2

D90214eda8c2732ad073e1ba87f8e22b?s=128

nuria_ruiz

October 08, 2021
Tweet

Transcript

  1. Lean Data Diet and Lessons Learned 's

  2. Lean Data: Do I need this data to provide the

    value I’m trying to deliver?
  3. 6 years leading data engineering @ analytics@lists.wikimedia.org

  4. None
  5. None
  6. Free Knowledge Movement

  7. “Imagine a world in which every single human being can

    freely share in the sum of all knowledge”.
  8. Should not have to provide any information to participate in

    free knowledge movement. There cannot be access to free knowledge without a strong guarantee of privacy. Free n le e Move nt Cor Bel ar nd Pri y
  9. Privacy is a Hidden Feature

  10. <interlude>

  11. Anyone can edit!

  12. http://dumps.wikimedia.com

  13. Usage Data - Web Request project es.wikipedia ip_address 3x.214.189.167 user_agent

    Mozilla/5.0 (X11; Linux ... page COVID-19
  14. </interlude>

  15. Should not have to provide any information to participate in

    free knowledge movement. There cannot be access to free knowledge without a strong guarantee of privacy. Free n le e Move nt Cor Bel ar nd Pri y
  16. How is this guarantee of Privacy Expressed?

  17. https://foundation.wikimedia.org/wiki/Privacy_policy

  18. Build the wiki way

  19. Build the wiki way Dis sion to l 150,000 wo

    ds
  20. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees
  21. In Practice, the Privacy Policy has strong implications on how

    we do engineering
  22. Lesson Learned #1: Principles are useful when choice is needed

  23. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees
  24. Wikipedia runs on-prem https://github.com/wikimedia/puppet

  25. If tomorrow AWS goes black Wikipedia will continue working just

    the same https://github.com/wikimedia/puppet
  26. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees
  27. Lesson Learned #2: Deleting (and aggregating) data at scale is

    quite hard
  28. Deleting Data At S a e!

  29. Usage Data - Web Request project es.wikipedia ip_address 3x.214.189.167 user_agent

    Mozilla/5.0 (X11; Linux ... page COVID-19
  30. username pepito_23 ip_address 3x.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... session_id

    8c878625792be023 button red-top-right ui_skin minerva Usage Data -Behavioura
  31. 200,000 web requests PER sec (at peak)

  32. 200,000 web requests PER sec (at peak) 4,000 analytics events

    PER sec
  33. Deleting Data Deleting Data Are you sure? Cancel Delete

  34. --dry-run undef -> execute --tables-to-delete undef -> all --execute undef

    -> dry-run --tables-to-delete undef -> none * -> all
  35. --database=event --tables=menuClicks --wikis=en.wikipedia --older-than=90 --skip-trash=true Executing tests… Tests passed. Starting

    DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79 --execute=<checksum>
  36. --database=event --tables=menuClicks --wikis=en.wikipedia --older-than=90 --skip-trash=true Executing tests… Tests passed. Starting

    DRY-RUN. Checking partitions to delete… Partitions that would be deleted by execution: - year=2019, month=1, day=1, hour=0, wiki=en.wikipedia - year=2019, month=1, day=1, hour=0, wiki=es.wiktionary - year=2019, month=1, day=1, hour=0, wiki=de.wikibooks - year=2019, month=1, day=1, hour=1, wiki=en.wikipedia - year=2019, month=1, day=1, hour=1, wiki=es.wiktionary - year=2019, month=1, day=1, hour=1, wiki=de.wikibooks - year=2019, month=1, day=1, hour=2, wiki=en.wikipedia - year=2019, month=1, day=1, hour=2, wiki=es.wiktionary - year=2019, month=1, day=1, hour=2, wiki=de.wikibooks DRY-RUN finished. Parameter checksum: 57ca7987d987e9e98a6c79 --execute=<checksum> #1 Dry-run #2 Execute
  37. Sanit ing Data

  38. Clients Event Processor (Spark) HTTP Beacon Endpoint Varnishkafka Kafka Varnish

    HDFS analytics data
  39. Clients Event Processor (Spark) Sanitized Events Events <90 days HTTP

    Beacon Endpoint Varnishkafka Kafka Varnish Allow-list HDFS https://github.com/wikimedia/analytics-refinery-source/tree/master/refinery-job Behavioural data
  40. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized
  41. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized Do-not-allow-list
  42. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu Sanitized Do-not-allow-list
  43. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu cookie_id 724310 Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id 724310 Sanitized Do-not-allow-list
  44. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized Allow-list
  45. date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ... wiki

    en.wikipedia action click target menu Unsanitized date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu Sanitized Allow-list
  46. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip NULL user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  47. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent NULL wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  48. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id NULL Sanitized Allow-list
  49. Unsanitized date 2019-01-01 ip 31.214.189.167 user_agent Mozilla/5.0 (X11; Linux ...

    wiki en.wikipedia action click target menu cookie_id 724310 date 2019-01-01 ip Spain user_agent Linux wiki en.wikipedia action click target menu cookie_id 8d56ab209e10 Sanitized Allow-list #
  50. Sanit ing Data Ad ance

  51. Read or edit without account. Register account without name, email

    or any other info. Never selling/sharing your info with third parties. After at most 90 days, user data will be deleted, aggregated, or de-identified Privacy Guarantees
  52. (Last) Lesson Learned : Privacy cannot not the responsibility of

    one team. It is a shared commitment. Tech is easy, building culture is hard.
  53. Privacy First Metric Computation

  54. SELECT COUNT(DISTINCT uuid) FROM database.table WHERE date = ’2021-03-01’; UUID,

    REQ Unique Device - DAU or MAU UUID, REQ UUID
  55. Unique Device UUID, REQ UUID SELECT page_title, uuid FROM database.table

    WHERE date = ’2021-03-01’ and uuid =<some>
  56. LAST ACCESS Unique Device 2021-09-01 https://diff.wikimedia.org/2016/03/30/unique-devices-dataset/

  57. LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-09-01 2021-09-01

    Unique Device
  58. LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-09-01 2021-09-01

    Timestamp IP Page Cookies 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 Unique Device
  59. LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-10-15 Timestamp

    IP Page Cookies 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 2021-10-15 Unique Device
  60. LAST ACCESS Unique Device 2021-10-15

  61. LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-10-15 2021-10-15

    Unique Device
  62. LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) 2021-10-15 Timestamp

    IP Page Cookies 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 2021-10-15 2021-10-15 123.9.* Everest Last-Access=2021-10-15 Unique Device
  63. SELECT COUNT(*) FROM database.table WHERE (last-access-date IS NULL OR last-access-date

    < timestamp) AND date = ’2021-10-15’; LAST ACCESS LA, REQ LA, REQ (today: 2021-10-15) Unique Device
  64. Timestamp IP Page Cookies -> 2021-10-15 776.9.* Titanic Last-Access=2021-09-01 2021-10-15

    123.9.* Everest Last-Access=2021-10-15 SELECT COUNT(*) FROM database.table WHERE (last-access-date IS NULL OR last-access-date < timestamp) AND date = ’2021-10-15’; Unique Device
  65. Strong Privacy Stand: The Lean Data Diet Less work related

    to data requests Easier to make data public Privacy as a hidden feature Extra work Building Privacy culture takes time Data Analysis needs a different mindset Pr Con
  66. Questions? https://xkcd.com/285 @pantojacoder All pictures https://creativecommons.org/publicdomain/zero/1.0/

  67. Bonus Lesson: You are never done with data quality issues

    but solving the big ones will help tremendously with other -seemly unrelated- problems.
  68. Example: Vandalis

  69. None
  70. Questions? https://xkcd.com/285 @pantojacoder All pictures https://creativecommons.org/publicdomain/zero/1.0/

  71. None