$30 off During Our Annual Pro Sale. View Details »

Privacy and Data Science - PyData London

Jason McFall
February 07, 2017

Privacy and Data Science - PyData London

Data science on customer data opens up huge opportunities, both for economic benefit and social good. But as datasets become richer, individual privacy comes under threat, and indeed responsible organisations are blocked from innovating because they have no way to guarantee privacy.

Technology has created this problem, and technology can solve it.

I talk about Privacy Engineering techniques that enable the safe and effective use of data, including tokenisation and masking, statistical generalisation and blurring of data (such as k-anonymity), controlled privacy-preserving querying of data (such as differential privacy), homomorphic encryption and randomised response. I describe the state of the art, and outline the hard problems that must be solved next.

Jason McFall

February 07, 2017
Tweet

Other Decks in Technology

Transcript

  1. Privacy and Data Science
    31st PyData London
    Meetup
    7 Feb 2017
    Jason McFall
    CTO Privitar
    [email protected]

    View Slide

  2. Credit: James Cridland

    View Slide

  3. View Slide

  4. Egor Tsvetkov
    https://birdinflight.com/ru/vdohnovenie/
    fotoproect/06042016-face-big-data.html

    View Slide

  5. Credit: Jorge Láscar

    View Slide

  6. ©Egor Tsvetkov

    View Slide

  7. ©Egor Tsvetkov

    View Slide

  8. ©Egor Tsvetkov

    View Slide

  9. ©Egor Tsvetkov

    View Slide

  10. View Slide

  11. http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

    View Slide

  12. View Slide

  13. View Slide

  14. Using Private Data for Good

    View Slide

  15. https://www.newscientist.com/article/2086454-
    revealed-google-ai-has-access-to-huge-haul-of-nhs-
    patient-data/
    https://deepmind.com/health

    View Slide

  16. Replace this

    View Slide

  17. Location data

    View Slide

  18. View Slide

  19. Credit: Highways Agency

    View Slide

  20. Human mobility dynamics in Pakistan.
    Amy Wesolowski et al. PNAS 2015;112:11887-11892
    http://www.pnas.org/content/112/38/11887.long
    ©2015 by National Academy of Sciences
    Impact of human mobility on the emergence of dengue epidemics in Pakistan

    View Slide

  21. View Slide

  22. Publish and be Damned?

    View Slide

  23. Credit: Matthew W. Hutchins, Harvard Law Record

    View Slide

  24. Date of birth
    Gender
    ZIP code
    Ethnicity
    Visit date
    Diagnosis
    Procedure
    Medication
    Hospital visits Voter registration
    Name
    Address
    Phone Number
    Date of birth
    Gender
    ZIP code

    View Slide

  25. View Slide

  26. View Slide

  27. https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

    View Slide

  28. https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

    View Slide

  29. Technology to the rescue?

    View Slide

  30. 984598498 33 Male N6 6DT Heart disease
    804528909 39 Male N6 5PA Cancer
    976234507 36 Female N6 5LB Asthma
    789453297 38 Male N6 4TA HIV
    dgkdakhkjf 33 Male N6 6DT Heart disease
    ajhfqricddk 39 Male N6 5PA Cancer
    mndbhbnai 36 Female N6 5LB Asthma
    lalhfkippfaj 38 Male N6 4TA HIV
    John Smith 984598498 33 Male N6 6DT Heart disease
    James Brown 804528909 39 Male N6 5PA Cancer
    Sarah Jones 976234507 36 Female N6 5LB Asthma
    David Evans 789453297 38 Male N6 4TA HIV
    Step 0: Remove Identifiers

    View Slide

  31. View Slide

  32. 33 Male N6 6DT Heart disease
    39 Male N6 5PA Cancer
    36 Female N6 5LB Asthma
    38 Male N6 4TA HIV
    30-39 * N6 Heart disease
    30-39 * N6 Cancer
    30-39 * N6 Asthma
    30-39 * N6 HIV
    Statistical Generalisation (k-anonymity)
    l-diversity

    View Slide

  33. Analysis on Generalised Data

    View Slide

  34. View Slide

  35. Tracker Attack
    Average Salary = £36,000
    Average Salary = £35,300

    View Slide

  36. ( )
    ( 9)
    ≤ 1 +
    Differential Privacy

    View Slide

  37. Adding noise roughly 1/ε x (effect any individual can have on outcome)
    gives the desired ratio eε ≈ 1+ε
    Add Laplace Noise to Query Result
    0
    0.1
    0.2
    0.3
    0.4
    0.5
    0.6
    30 31 32 33 34 35 36 37 38 39 40 41 42

    View Slide

  38. Replace this

    View Slide

  39. View Slide

  40. Randomised Response
    Answer honestly
    Answer yes

    View Slide

  41. https://github.com/google/rappor

    View Slide

  42. Some Lessons

    View Slide

  43. 1. Your data lives for a long time – maybe forever
    2. Linking datasets can be very revealing
    3. Today’s technology can match data and find sensitive patterns at huge scale
    4. Hold organisations to account!
    Lessons for Citizens

    View Slide

  44. 1. Only store data you need
    2. Always remove primary identifiers
    3. Aggregate and statistically anonymise data before sharing
    • You can’t foresee all future datasets and potential linkage risks
    • You can’t foresee the future power of machine learning on this data
    4. If data is too complex to anonymise:
    • extract the features of interest
    • anonymise and share only those
    5. Even better, don’t share the data itself, allow secure queries on the data
    6. Be open and clear about how you protect and use private data
    Lessons for Data Practitioners

    View Slide

  45. Please help
    We’re looking to meet data scientist/analysts who are working
    with structured sensitive data, to discuss and user test some of
    our advanced products. Please contact me at:
    [email protected]

    View Slide

  46. View Slide