Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Privacy and Data Science - PyData London

Jason McFall
February 07, 2017

Privacy and Data Science - PyData London

Data science on customer data opens up huge opportunities, both for economic benefit and social good. But as datasets become richer, individual privacy comes under threat, and indeed responsible organisations are blocked from innovating because they have no way to guarantee privacy.

Technology has created this problem, and technology can solve it.

I talk about Privacy Engineering techniques that enable the safe and effective use of data, including tokenisation and masking, statistical generalisation and blurring of data (such as k-anonymity), controlled privacy-preserving querying of data (such as differential privacy), homomorphic encryption and randomised response. I describe the state of the art, and outline the hard problems that must be solved next.

Jason McFall

February 07, 2017

Other Decks in Technology


  1. Privacy and Data Science 31st PyData London Meetup 7 Feb

    2017 Jason McFall CTO Privitar [email protected]
  2. Credit: James Cridland

  3. None
  4. Egor Tsvetkov https://birdinflight.com/ru/vdohnovenie/ fotoproect/06042016-face-big-data.html

  5. Credit: Jorge Láscar

  6. ©Egor Tsvetkov

  7. ©Egor Tsvetkov

  8. ©Egor Tsvetkov

  9. ©Egor Tsvetkov

  10. None
  11. http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

  12. None
  13. None
  14. Using Private Data for Good

  15. https://www.newscientist.com/article/2086454- revealed-google-ai-has-access-to-huge-haul-of-nhs- patient-data/ https://deepmind.com/health

  16. Replace this

  17. Location data

  18. None
  19. Credit: Highways Agency

  20. Human mobility dynamics in Pakistan. Amy Wesolowski et al. PNAS

    2015;112:11887-11892 http://www.pnas.org/content/112/38/11887.long ©2015 by National Academy of Sciences Impact of human mobility on the emergence of dengue epidemics in Pakistan
  21. None
  22. Publish and be Damned?

  23. Credit: Matthew W. Hutchins, Harvard Law Record

  24. Date of birth Gender ZIP code Ethnicity Visit date Diagnosis

    Procedure Medication Hospital visits Voter registration Name Address Phone Number Date of birth Gender ZIP code
  25. None
  26. None
  27. https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

  28. https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

  29. Technology to the rescue?

  30. 984598498 33 Male N6 6DT Heart disease 804528909 39 Male

    N6 5PA Cancer 976234507 36 Female N6 5LB Asthma 789453297 38 Male N6 4TA HIV dgkdakhkjf 33 Male N6 6DT Heart disease ajhfqricddk 39 Male N6 5PA Cancer mndbhbnai 36 Female N6 5LB Asthma lalhfkippfaj 38 Male N6 4TA HIV John Smith 984598498 33 Male N6 6DT Heart disease James Brown 804528909 39 Male N6 5PA Cancer Sarah Jones 976234507 36 Female N6 5LB Asthma David Evans 789453297 38 Male N6 4TA HIV Step 0: Remove Identifiers
  31. None
  32. 33 Male N6 6DT Heart disease 39 Male N6 5PA

    Cancer 36 Female N6 5LB Asthma 38 Male N6 4TA HIV 30-39 * N6 Heart disease 30-39 * N6 Cancer 30-39 * N6 Asthma 30-39 * N6 HIV Statistical Generalisation (k-anonymity) l-diversity
  33. Analysis on Generalised Data

  34. None
  35. Tracker Attack Average Salary = £36,000 Average Salary = £35,300

  36. ( ) ( 9) ≤ 1 + Differential Privacy

  37. Adding noise roughly 1/ε x (effect any individual can have

    on outcome) gives the desired ratio eε ≈ 1+ε Add Laplace Noise to Query Result 0 0.1 0.2 0.3 0.4 0.5 0.6 30 31 32 33 34 35 36 37 38 39 40 41 42
  38. Replace this

  39. None
  40. Randomised Response Answer honestly Answer yes

  41. https://github.com/google/rappor

  42. Some Lessons

  43. 1. Your data lives for a long time – maybe

    forever 2. Linking datasets can be very revealing 3. Today’s technology can match data and find sensitive patterns at huge scale 4. Hold organisations to account! Lessons for Citizens
  44. 1. Only store data you need 2. Always remove primary

    identifiers 3. Aggregate and statistically anonymise data before sharing • You can’t foresee all future datasets and potential linkage risks • You can’t foresee the future power of machine learning on this data 4. If data is too complex to anonymise: • extract the features of interest • anonymise and share only those 5. Even better, don’t share the data itself, allow secure queries on the data 6. Be open and clear about how you protect and use private data Lessons for Data Practitioners
  45. Please help We’re looking to meet data scientist/analysts who are

    working with structured sensitive data, to discuss and user test some of our advanced products. Please contact me at: [email protected]
  46. Thank you [email protected]