Privacy and Data Science - PyData London

Privacy and Data Science - PyData London

Data science on customer data opens up huge opportunities, both for economic benefit and social good. But as datasets become richer, individual privacy comes under threat, and indeed responsible organisations are blocked from innovating because they have no way to guarantee privacy.

Technology has created this problem, and technology can solve it.

I talk about Privacy Engineering techniques that enable the safe and effective use of data, including tokenisation and masking, statistical generalisation and blurring of data (such as k-anonymity), controlled privacy-preserving querying of data (such as differential privacy), homomorphic encryption and randomised response. I describe the state of the art, and outline the hard problems that must be solved next.

3422087e3ed25552721b646cc8dfa852?s=128

Jason McFall

February 07, 2017
Tweet

Transcript

  1. 1.

    Privacy and Data Science 31st PyData London Meetup 7 Feb

    2017 Jason McFall CTO Privitar jason.mcfall@privitar.com
  2. 3.
  3. 10.
  4. 12.
  5. 13.
  6. 18.
  7. 20.

    Human mobility dynamics in Pakistan. Amy Wesolowski et al. PNAS

    2015;112:11887-11892 http://www.pnas.org/content/112/38/11887.long ©2015 by National Academy of Sciences Impact of human mobility on the emergence of dengue epidemics in Pakistan
  8. 21.
  9. 24.

    Date of birth Gender ZIP code Ethnicity Visit date Diagnosis

    Procedure Medication Hospital visits Voter registration Name Address Phone Number Date of birth Gender ZIP code
  10. 25.
  11. 26.
  12. 30.

    984598498 33 Male N6 6DT Heart disease 804528909 39 Male

    N6 5PA Cancer 976234507 36 Female N6 5LB Asthma 789453297 38 Male N6 4TA HIV dgkdakhkjf 33 Male N6 6DT Heart disease ajhfqricddk 39 Male N6 5PA Cancer mndbhbnai 36 Female N6 5LB Asthma lalhfkippfaj 38 Male N6 4TA HIV John Smith 984598498 33 Male N6 6DT Heart disease James Brown 804528909 39 Male N6 5PA Cancer Sarah Jones 976234507 36 Female N6 5LB Asthma David Evans 789453297 38 Male N6 4TA HIV Step 0: Remove Identifiers
  13. 31.
  14. 32.

    33 Male N6 6DT Heart disease 39 Male N6 5PA

    Cancer 36 Female N6 5LB Asthma 38 Male N6 4TA HIV 30-39 * N6 Heart disease 30-39 * N6 Cancer 30-39 * N6 Asthma 30-39 * N6 HIV Statistical Generalisation (k-anonymity) l-diversity
  15. 34.
  16. 37.

    Adding noise roughly 1/ε x (effect any individual can have

    on outcome) gives the desired ratio eε ≈ 1+ε Add Laplace Noise to Query Result 0 0.1 0.2 0.3 0.4 0.5 0.6 30 31 32 33 34 35 36 37 38 39 40 41 42
  17. 39.
  18. 43.

    1. Your data lives for a long time – maybe

    forever 2. Linking datasets can be very revealing 3. Today’s technology can match data and find sensitive patterns at huge scale 4. Hold organisations to account! Lessons for Citizens
  19. 44.

    1. Only store data you need 2. Always remove primary

    identifiers 3. Aggregate and statistically anonymise data before sharing • You can’t foresee all future datasets and potential linkage risks • You can’t foresee the future power of machine learning on this data 4. If data is too complex to anonymise: • extract the features of interest • anonymise and share only those 5. Even better, don’t share the data itself, allow secure queries on the data 6. Be open and clear about how you protect and use private data Lessons for Data Practitioners
  20. 45.

    Please help We’re looking to meet data scientist/analysts who are

    working with structured sensitive data, to discuss and user test some of our advanced products. Please contact me at: jason.mcfall@privitar.com