Slide 1

Slide 1 text

Privacy and Data Science 31st PyData London Meetup 7 Feb 2017 Jason McFall CTO Privitar [email protected]

Slide 2

Slide 2 text

Credit: James Cridland

Slide 3

Slide 3 text

No content

Slide 4

Slide 4 text

Egor Tsvetkov https://birdinflight.com/ru/vdohnovenie/ fotoproect/06042016-face-big-data.html

Slide 5

Slide 5 text

Credit: Jorge Láscar

Slide 6

Slide 6 text

©Egor Tsvetkov

Slide 7

Slide 7 text

©Egor Tsvetkov

Slide 8

Slide 8 text

©Egor Tsvetkov

Slide 9

Slide 9 text

©Egor Tsvetkov

Slide 10

Slide 10 text

No content

Slide 11

Slide 11 text

http://www.informationisbeautiful.net/visualizations/worlds-biggest-data-breaches-hacks/

Slide 12

Slide 12 text

No content

Slide 13

Slide 13 text

No content

Slide 14

Slide 14 text

Using Private Data for Good

Slide 15

Slide 15 text

https://www.newscientist.com/article/2086454- revealed-google-ai-has-access-to-huge-haul-of-nhs- patient-data/ https://deepmind.com/health

Slide 16

Slide 16 text

Replace this

Slide 17

Slide 17 text

Location data

Slide 18

Slide 18 text

No content

Slide 19

Slide 19 text

Credit: Highways Agency

Slide 20

Slide 20 text

Human mobility dynamics in Pakistan. Amy Wesolowski et al. PNAS 2015;112:11887-11892 http://www.pnas.org/content/112/38/11887.long ©2015 by National Academy of Sciences Impact of human mobility on the emergence of dengue epidemics in Pakistan

Slide 21

Slide 21 text

No content

Slide 22

Slide 22 text

Publish and be Damned?

Slide 23

Slide 23 text

Credit: Matthew W. Hutchins, Harvard Law Record

Slide 24

Slide 24 text

Date of birth Gender ZIP code Ethnicity Visit date Diagnosis Procedure Medication Hospital visits Voter registration Name Address Phone Number Date of birth Gender ZIP code

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

Slide 28

Slide 28 text

https://research.neustar.biz/2014/09/15/riding-with-the-stars-passenger-privacy-in-the-nyc-taxicab-dataset/

Slide 29

Slide 29 text

Technology to the rescue?

Slide 30

Slide 30 text

984598498 33 Male N6 6DT Heart disease 804528909 39 Male N6 5PA Cancer 976234507 36 Female N6 5LB Asthma 789453297 38 Male N6 4TA HIV dgkdakhkjf 33 Male N6 6DT Heart disease ajhfqricddk 39 Male N6 5PA Cancer mndbhbnai 36 Female N6 5LB Asthma lalhfkippfaj 38 Male N6 4TA HIV John Smith 984598498 33 Male N6 6DT Heart disease James Brown 804528909 39 Male N6 5PA Cancer Sarah Jones 976234507 36 Female N6 5LB Asthma David Evans 789453297 38 Male N6 4TA HIV Step 0: Remove Identifiers

Slide 31

Slide 31 text

No content

Slide 32

Slide 32 text

33 Male N6 6DT Heart disease 39 Male N6 5PA Cancer 36 Female N6 5LB Asthma 38 Male N6 4TA HIV 30-39 * N6 Heart disease 30-39 * N6 Cancer 30-39 * N6 Asthma 30-39 * N6 HIV Statistical Generalisation (k-anonymity) l-diversity

Slide 33

Slide 33 text

Analysis on Generalised Data

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

Tracker Attack Average Salary = £36,000 Average Salary = £35,300

Slide 36

Slide 36 text

( ) ( 9) ≤ 1 + Differential Privacy

Slide 37

Slide 37 text

Adding noise roughly 1/ε x (effect any individual can have on outcome) gives the desired ratio eε ≈ 1+ε Add Laplace Noise to Query Result 0 0.1 0.2 0.3 0.4 0.5 0.6 30 31 32 33 34 35 36 37 38 39 40 41 42

Slide 38

Slide 38 text

Replace this

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

Randomised Response Answer honestly Answer yes

Slide 41

Slide 41 text

https://github.com/google/rappor

Slide 42

Slide 42 text

Some Lessons

Slide 43

Slide 43 text

1. Your data lives for a long time – maybe forever 2. Linking datasets can be very revealing 3. Today’s technology can match data and find sensitive patterns at huge scale 4. Hold organisations to account! Lessons for Citizens

Slide 44

Slide 44 text

1. Only store data you need 2. Always remove primary identifiers 3. Aggregate and statistically anonymise data before sharing • You can’t foresee all future datasets and potential linkage risks • You can’t foresee the future power of machine learning on this data 4. If data is too complex to anonymise: • extract the features of interest • anonymise and share only those 5. Even better, don’t share the data itself, allow secure queries on the data 6. Be open and clear about how you protect and use private data Lessons for Data Practitioners

Slide 45

Slide 45 text

Please help We’re looking to meet data scientist/analysts who are working with structured sensitive data, to discuss and user test some of our advanced products. Please contact me at: [email protected]

Slide 46

Slide 46 text

Thank you [email protected]