Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Data Privacy

SdgJlbl
September 19, 2018

Introduction to Data Privacy

SdgJlbl

September 19, 2018
Tweet

Other Decks in Technology

Transcript

  1. HEY HERE’S SOME DATA ! ➤ Data come in various

    shapes but can fall into specific categories ➤ Non-personnal / Personnal ➤ Non-sensitive / Sensitive ➤ Non-protected / Protected
  2. PERSONAL IDENTIFIERS ➤ First name ➤ Last name ➤ SSN

    ➤ Street address ➤ Email address ➤ IP address ➤ Mac address ➤ Coordinates ➤ Cookie ID ➤ Your phone advertising ID
  3. PROTECTED DATA Legally protected attributes that you can get in

    trouble if you use for discriminating against: ➤ Gender identity ➤ Ethnicity ➤ Political beliefs ➤ Religion ➤ ...
  4. SENSITIVE DATA Anything that can be linked to an unique

    individual and that is not public information: ➤ Membership to a private community ➤ Medical data ➤ Earnings, savings, financial info ➤ Political beliefs, voting choices, ... ➤ Personal habits ➤ Religion ➤ ... Those are meaningless if not linked to someone in particular.
  5. WHAT WE WANT Extract the relevant information out of our

    terabytes of personnal data to make the world a better place
  6. HOW DO WE ACHIEVE THIS 1. By not collecting any

    personnal data 2. By working on an anonymized dataset
 - Swapping personal information for ids: pseudonymisation
 - Aggregating pseudo-identifiers using k-anonymity
 - Using differential privacy
  7. REMINDER: TO ASK BEFORE USING PERSONAL INFORMATION Do I need

    it ? Do I really need it ? Would I collect it if I didn't have it in the first place ? If the answer of any of these questions is "no", then
  8. PERSONAL IDENTIFIERS ➤ First name ➤ Last name ➤ SSN

    ➤ Street address ➤ Email address ➤ IP address ➤ Mac address ➤ Coordinates ➤ Cookie ID ➤ Your phone advertising ID
  9. PSEUDONYMIZATION TECHNIQUES ➤ Replace with fake information (faker, homomorphic pseudonymization,

    ...) ➤ Mask some part of the data ➤ Hash the data ➤ Replace with an id and
 keep the mapping in a super secret place
  10. PSEUDONYMIZATION ➤ Minimal requirement for GDPR compliance ➤ Pseudonymized data

    is still sensible, and not ready for public release (or release to an untrusted third party)
  11. EXERCISE 1: EXTRACT FROM MS POMFREY RECORD ID Age Gender

    House Magical Disease 1 15 M Slytherin Dragon pox 2 19 M Hufflepuff Black cat flu 3 12 F Griffindor Levitation sickness 4 18 F Slytherin Petrification 5 14 M Griffindor Hippogriff bite 6 14 M Griffindor Dragon pox 7 19 M Ravenclaw Black cat flu 8 13 F Ravenclaw Levitation sickness 9 17 F Slytherin Lycanthropy 10 15 M Griffindor Hippogriff bite What can you tell me about Harry, a 15-year-old Griffindor boy ?
  12. K-ANONYMITY Knowing all public information on a person does not

    allow to single out
 less than k rows in a dataset.
  13. K-ANONYMITY How do we achieve it? ➤ By generalizing over

    some attributes, so that they cannot be used
 as pseudo-identifiers. ➤ Even in large datasets, some combinations of gender, age and zipcode
 are often unique. ➤ Eg: Use ranges for age (20-30, 30-40) instead of the value.
 It's probably enough for your analysis. ➤ Eg: Use larger region rather than zipcodes,
 or remove zipcodes for less populated areas.
  14. EXERCISE 2: EXTRACT FROM MS POMFREY RECORD WITH 2-ANONYMITY ID

    Age Gender House Magical Disease 1 15-20 M Slyth./Griff. Dragon pox 2 15-20 M Huff./Rav. Black cat flu 3 10-14 F Griff./Rav. Levitation sickness 4 15-20 F Slyth./Griff. Petrification 5 10-14 M Slyth./Griff. Hippogriff bite 6 10-14 M Slyth./Griff. Dragon pox 7 15-20 M Huff./Rav. Black cat flu 8 10-14 F Griff./Rav. Levitation sickness 9 15-20 F Slyth./Griff. Lycanthropy 10 15-20 M Slyth./Griff. Hippogriff bite What can you tell me about Luna, a 14-year-old Ravenclaw girl ?
  15. L-DIVERSITY ➤ k- anonymity offers plausible deniability to individuals, since

    we cannot know for sure that they are in the dataset. ➤ However, it is possible that all k individuals in the same group share the same value for a protected attribute. ➤ l-diversity ensures that each 
 k-anonymous group contains at least l different values of a sensitive attribute
  16. EXERCISE 3: EXTRACT FROM MS POMFREY RECORD ID Age Gender

    House Magical Disease 3 10-14 F \ Levitation sickness 5 10-14 M \ Hippogriff bite 6 10-14 M \ Dragon pox 8 10-14 F \ Levitation sickness 12 10-14 M \ Common cold 18 10-14 F \ Levitation sickness 21 10-14 M \ Black cat flu 25 10-14 F \ Levitation sickness 28 10-14 F \ Dragon pox 30 10-14 M \ Hippogriff bite What can you tell me about Luna, a 14-year-old girl ?
  17. T-CLOSENESS ➤ l-diverse dataset does not protect from statistical attacks.

    ➤ If 90% of the persons in a l-diverse group have the same value for a protected attribute, then we can infer with high-probability that a person in this group will have that value. ➤ t-closeness ensures that, in each group, the distribution with respect to a sensitive attribute does not differ "too much" from the overall distribution.
  18. FINAL THOUGHTS ➤ k-diversity, l-diversity and t-closeness all limit the

    amount of information a legitimate user can get from the data. ➤ Find the right trade-off between utility of the data and privacy.
  19. EXERCISE 4: PUBLIC ANONYMOUS STATISTICS OF THE DUMBLEDORE'S ARMY Wednesday:

    Dataset size: 72 people Members of the army: 17 people Thursday: Dataset size: 73 people Members of the army: 18 people What can you tell me about Ginny, who was added on Wednesday night ?
  20. DIFFERENTIAL PRIVACY ➤ The definition gets a bit mathy, so

    let's see it from an example. Pr[(D1 ) ∈ S] ≤ exp(ϵ) × Pr[(D2 ) ∈ S]
  21. DIFFERENTIAL PRIVACY: AN EXAMPLE ➤ We have some data about

    student membership to Dumbledore's army. ➤ It's a highly sensitive information, we cannot just store the true value. ➤ So let's sometimes lie about it Should I store the truth? p 1-p No: what should I store? k 1-k Yes, the truth Yes No Is this person member of Dumbledore's Army ?
  22. DIFFERENTIAL PRIVACY: WHY IS IT SO GREAT ? ➤ You

    can't draw any definitive conclusion just by watching the dataset grow over time ➤ If probabilities p and k are known, we can adjust our estimators to get an unbiaised estimate of aggregated values, with a limited loss of precision
  23. KEEP IN MIND ➤ Some graph libraries/formats can leak information,


    even when the graph looks fine (think JS-based graphs, or SVGs) ➤ Some ML models can leak private information, esp. on minority classes: consider privacy-preserving Machine Learning methods ➤ Sometimes, even just knowing someone is part of a dataset
 is a privacy leak
  24. KEEP IN MIND ➤ Sometimes, sensitive informations are not even

    linked to individual
 knowledge (think Strava) ➤ Knowing more public information on someone can help you identify them(Background knowledge attack)
  25. EXERCISE 4 - BACKGROUND KNOWLEDGE ATTACK Has Hermione Granger been

    levitating lately ? ID Age Gender House OWL Magical Disease 3 10-14 F \ B Levitation sickness 5 10-14 M \ C Hippogriff bite 6 10-14 M \ B- Dragon pox 8 10-14 F \ A- Levitation sickness 12 10-14 M \ B+ Common cold 18 10-14 F \ F Levitation sickness 21 10-14 M \ C+ Black cat flu 25 10-14 F \ A+++ Levitation sickness 28 10-14 F \ A+ Dragon pox 30 10-14 M \ E Hippogriff bite
  26. REFERENCES ➤ Check out this tutorial on Data Privacy by

    Andreas Dewes and Katharine Jarmul: https://github.com/KIProtect/data-privacy-for-data-scientists ➤ k-anonymity: a model for protecting privacy, Latanya Sweeney ➤ Mondrian Multidimensional k-Anonymity, 
 K. LeFevre, D.DeWitt, and R. Ramakrishnan ➤ Differential Privacy, Cynthia Dwork ➤ Encrypted statistical machine learning: new privacy preserving methods, 
 L. Aslett, P . Esperança, and C. Holmes ➤ Communication-Efficient Learning of Deep Networks from Decentralized Data, 
 H. Brendan McMahan, E. Moore, D. Ramage, S. Hampson, B. Agüera y Arcas