Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Data Privacy

Avatar for SdgJlbl SdgJlbl
September 19, 2018

Introduction to Data Privacy

Avatar for SdgJlbl

SdgJlbl

September 19, 2018
Tweet

Other Decks in Technology

Transcript

  1. HEY HERE’S SOME DATA ! ➤ Data come in various

    shapes but can fall into specific categories ➤ Non-personnal / Personnal ➤ Non-sensitive / Sensitive ➤ Non-protected / Protected
  2. PERSONAL IDENTIFIERS ➤ First name ➤ Last name ➤ SSN

    ➤ Street address ➤ Email address ➤ IP address ➤ Mac address ➤ Coordinates ➤ Cookie ID ➤ Your phone advertising ID
  3. PROTECTED DATA Legally protected attributes that you can get in

    trouble if you use for discriminating against: ➤ Gender identity ➤ Ethnicity ➤ Political beliefs ➤ Religion ➤ ...
  4. SENSITIVE DATA Anything that can be linked to an unique

    individual and that is not public information: ➤ Membership to a private community ➤ Medical data ➤ Earnings, savings, financial info ➤ Political beliefs, voting choices, ... ➤ Personal habits ➤ Religion ➤ ... Those are meaningless if not linked to someone in particular.
  5. WHAT WE WANT Extract the relevant information out of our

    terabytes of personnal data to make the world a better place
  6. HOW DO WE ACHIEVE THIS 1. By not collecting any

    personnal data 2. By working on an anonymized dataset
 - Swapping personal information for ids: pseudonymisation
 - Aggregating pseudo-identifiers using k-anonymity
 - Using differential privacy
  7. REMINDER: TO ASK BEFORE USING PERSONAL INFORMATION Do I need

    it ? Do I really need it ? Would I collect it if I didn't have it in the first place ? If the answer of any of these questions is "no", then
  8. PERSONAL IDENTIFIERS ➤ First name ➤ Last name ➤ SSN

    ➤ Street address ➤ Email address ➤ IP address ➤ Mac address ➤ Coordinates ➤ Cookie ID ➤ Your phone advertising ID
  9. PSEUDONYMIZATION TECHNIQUES ➤ Replace with fake information (faker, homomorphic pseudonymization,

    ...) ➤ Mask some part of the data ➤ Hash the data ➤ Replace with an id and
 keep the mapping in a super secret place
  10. PSEUDONYMIZATION ➤ Minimal requirement for GDPR compliance ➤ Pseudonymized data

    is still sensible, and not ready for public release (or release to an untrusted third party)
  11. EXERCISE 1: EXTRACT FROM MS POMFREY RECORD ID Age Gender

    House Magical Disease 1 15 M Slytherin Dragon pox 2 19 M Hufflepuff Black cat flu 3 12 F Griffindor Levitation sickness 4 18 F Slytherin Petrification 5 14 M Griffindor Hippogriff bite 6 14 M Griffindor Dragon pox 7 19 M Ravenclaw Black cat flu 8 13 F Ravenclaw Levitation sickness 9 17 F Slytherin Lycanthropy 10 15 M Griffindor Hippogriff bite What can you tell me about Harry, a 15-year-old Griffindor boy ?
  12. K-ANONYMITY Knowing all public information on a person does not

    allow to single out
 less than k rows in a dataset.
  13. K-ANONYMITY How do we achieve it? ➤ By generalizing over

    some attributes, so that they cannot be used
 as pseudo-identifiers. ➤ Even in large datasets, some combinations of gender, age and zipcode
 are often unique. ➤ Eg: Use ranges for age (20-30, 30-40) instead of the value.
 It's probably enough for your analysis. ➤ Eg: Use larger region rather than zipcodes,
 or remove zipcodes for less populated areas.
  14. EXERCISE 2: EXTRACT FROM MS POMFREY RECORD WITH 2-ANONYMITY ID

    Age Gender House Magical Disease 1 15-20 M Slyth./Griff. Dragon pox 2 15-20 M Huff./Rav. Black cat flu 3 10-14 F Griff./Rav. Levitation sickness 4 15-20 F Slyth./Griff. Petrification 5 10-14 M Slyth./Griff. Hippogriff bite 6 10-14 M Slyth./Griff. Dragon pox 7 15-20 M Huff./Rav. Black cat flu 8 10-14 F Griff./Rav. Levitation sickness 9 15-20 F Slyth./Griff. Lycanthropy 10 15-20 M Slyth./Griff. Hippogriff bite What can you tell me about Luna, a 14-year-old Ravenclaw girl ?
  15. L-DIVERSITY ➤ k- anonymity offers plausible deniability to individuals, since

    we cannot know for sure that they are in the dataset. ➤ However, it is possible that all k individuals in the same group share the same value for a protected attribute. ➤ l-diversity ensures that each 
 k-anonymous group contains at least l different values of a sensitive attribute
  16. EXERCISE 3: EXTRACT FROM MS POMFREY RECORD ID Age Gender

    House Magical Disease 3 10-14 F \ Levitation sickness 5 10-14 M \ Hippogriff bite 6 10-14 M \ Dragon pox 8 10-14 F \ Levitation sickness 12 10-14 M \ Common cold 18 10-14 F \ Levitation sickness 21 10-14 M \ Black cat flu 25 10-14 F \ Levitation sickness 28 10-14 F \ Dragon pox 30 10-14 M \ Hippogriff bite What can you tell me about Luna, a 14-year-old girl ?
  17. T-CLOSENESS ➤ l-diverse dataset does not protect from statistical attacks.

    ➤ If 90% of the persons in a l-diverse group have the same value for a protected attribute, then we can infer with high-probability that a person in this group will have that value. ➤ t-closeness ensures that, in each group, the distribution with respect to a sensitive attribute does not differ "too much" from the overall distribution.
  18. FINAL THOUGHTS ➤ k-diversity, l-diversity and t-closeness all limit the

    amount of information a legitimate user can get from the data. ➤ Find the right trade-off between utility of the data and privacy.
  19. EXERCISE 4: PUBLIC ANONYMOUS STATISTICS OF THE DUMBLEDORE'S ARMY Wednesday:

    Dataset size: 72 people Members of the army: 17 people Thursday: Dataset size: 73 people Members of the army: 18 people What can you tell me about Ginny, who was added on Wednesday night ?
  20. DIFFERENTIAL PRIVACY ➤ The definition gets a bit mathy, so

    let's see it from an example. Pr[(D1 ) ∈ S] ≤ exp(ϵ) × Pr[(D2 ) ∈ S]
  21. DIFFERENTIAL PRIVACY: AN EXAMPLE ➤ We have some data about

    student membership to Dumbledore's army. ➤ It's a highly sensitive information, we cannot just store the true value. ➤ So let's sometimes lie about it Should I store the truth? p 1-p No: what should I store? k 1-k Yes, the truth Yes No Is this person member of Dumbledore's Army ?
  22. DIFFERENTIAL PRIVACY: WHY IS IT SO GREAT ? ➤ You

    can't draw any definitive conclusion just by watching the dataset grow over time ➤ If probabilities p and k are known, we can adjust our estimators to get an unbiaised estimate of aggregated values, with a limited loss of precision
  23. KEEP IN MIND ➤ Some graph libraries/formats can leak information,


    even when the graph looks fine (think JS-based graphs, or SVGs) ➤ Some ML models can leak private information, esp. on minority classes: consider privacy-preserving Machine Learning methods ➤ Sometimes, even just knowing someone is part of a dataset
 is a privacy leak
  24. KEEP IN MIND ➤ Sometimes, sensitive informations are not even

    linked to individual
 knowledge (think Strava) ➤ Knowing more public information on someone can help you identify them(Background knowledge attack)
  25. EXERCISE 4 - BACKGROUND KNOWLEDGE ATTACK Has Hermione Granger been

    levitating lately ? ID Age Gender House OWL Magical Disease 3 10-14 F \ B Levitation sickness 5 10-14 M \ C Hippogriff bite 6 10-14 M \ B- Dragon pox 8 10-14 F \ A- Levitation sickness 12 10-14 M \ B+ Common cold 18 10-14 F \ F Levitation sickness 21 10-14 M \ C+ Black cat flu 25 10-14 F \ A+++ Levitation sickness 28 10-14 F \ A+ Dragon pox 30 10-14 M \ E Hippogriff bite
  26. REFERENCES ➤ Check out this tutorial on Data Privacy by

    Andreas Dewes and Katharine Jarmul: https://github.com/KIProtect/data-privacy-for-data-scientists ➤ k-anonymity: a model for protecting privacy, Latanya Sweeney ➤ Mondrian Multidimensional k-Anonymity, 
 K. LeFevre, D.DeWitt, and R. Ramakrishnan ➤ Differential Privacy, Cynthia Dwork ➤ Encrypted statistical machine learning: new privacy preserving methods, 
 L. Aslett, P . Esperança, and C. Holmes ➤ Communication-Efficient Learning of Deep Networks from Decentralized Data, 
 H. Brendan McMahan, E. Moore, D. Ramage, S. Hampson, B. Agüera y Arcas