$30 off During Our Annual Pro Sale. View Details »

Privacy Preserving Machine Learning (PPML) @ SciPy 2023

Privacy Preserving Machine Learning (PPML) @ SciPy 2023

Privacy guarantees are **the** most crucial requirement when it comes to analyse sensitive data. These requirements could be sometimes very stringent, so that it becomes a real barrier for the entire pipeline. Reasons for this are manifold, and involve the fact that data could not be _shared_ nor moved from their silos of resident, let alone analysed in their _raw_ form. As a result, _data anonymisation techniques_ are sometimes used to generate a sanitised version of the original data. However, these techniques alone are not enough to guarantee that privacy will be completely preserved. Moreover, the _memoisation_ effect of Deep learning models could be maliciously exploited to _attack_ the models, and _reconstruct_ sensitive information about samples used in training, even if these information were not originally provided.

*Privacy-preserving machine learning* (PPML) methods hold the promise to overcome all those issues, allowing to train machine learning models with full privacy guarantees.

This workshop will be mainly organised in **three** main parts. In the first part, we will introduce the main concepts of **differential privacy**: what is it, and how this method differs from more classical _anonymisation_ techniques (e.g. `k-anonymity`). In the second part, we will focus on Machine learning experiments. We will start by demonstrating how DL models could be exploited (i.e. _inference attack_ ) to reconstruct original data solely analysing models predictions; and then we will explore how **differential privacy** can help us protecting the privacy of our model, with _minimum disruption_ to the original pipeline. Finally, we will conclude the tutorial considering more complex ML scenarios to train Deep learning networks on encrypted data, with specialised _distributed federated_ _learning_ strategies.

Valerio Maggio

July 10, 2023
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. Privacy Preserving


    Machine Learning
    Machine Learning on Data you’re not allowed to see
    @leriomaggio
    github.com/leriomaggio/ppml-tutorial
    speakerdeck.com/leriomaggio/ppml-scipy
    [email protected]

    View Slide

  2. also me
    • Background in CS


    • PhD in Machine Learning


    • Research: ML/DL for BioMedicine


    • SSI Fellow


    • Python Geek

    • Data Scientists Advocate _at_ Anaconda
    me
    pun
    Who?

    View Slide

  3. Provide an overview of the emerging tools (in the ecosystem)

    for Privacy Enhancing Technologies (a.k.a. PETs) with focus on

    Machine Learning
    Aim of this Tutorial
    Privacy-Preserving Machine Learning (PPML)

    View Slide

  4. • Privacy-Preserving Machine Learning (PPML)
    technologies have the huge potential to be the

    Data Science paradigm of the future


    • Joint e
    ff
    ort of Open Source & ML & Security
    Communities


    • I wish to disseminate the knowledge about these new
    methods and technologies among researchers


    • Focus on Reproducibility of PPML work
    fl
    ows
    SSI
    Fellowship:
    PPML
    What I would like to do
    Any help or suggestions about Use/Data cases or more
    generally Case studies, or any contribution to shape the
    repository will be very much appreciated!
    Looking forward to collaborations and contributions ☺
    Aw
    a
    rded by JGI Seed-Corn Fundings 2021
    je
    a
    ngoldinginstitute.blogs.bristol.
    a
    c.uk/2021/01/07/seed-corn-funding-winner-
    a
    nnouncement/

    View Slide

  5. Let’s Introduce Privacy

    View Slide

  6. Why Privacy is important
    The Facebook-Cambridge Analytical Scandal (2014-18)
    i
    2014

    A Facebook quiz called “This Is Your
    Digital Life” invited users to
    f
    ind out
    their personality type
    The app collected data from
    participants but also recorded public
    data from those in their friends list.
    2015

    The Guardian reported that Cambridge
    Analytica had data from this app and used it
    to psychologically pro
    f
    ile voters in the US.
    305K people installed the app


    87M people info gathered
    2018

    US and British lawmakers demanded
    that Facebook explain how the
    f
    irm
    was able to harvest personal
    information without users’ consent.
    i
    Facebook apologised for the data
    scandal and announced changes
    to the privacy settings.

    View Slide

  7. What about Machine Learning ?
    Human Learning ≠ Machine Learning R
    G
    B

    View Slide

  8. What about Machine Learning ?
    Human Learning ≠ Machine Learning Di
    ff
    erent Challenges
    APPLE
    Machine Learning instead may require
    millions of samples even for a “simple” task


    > ML models are data hungry
    APPLE??

    View Slide

  9. What about Machine Learning ?
    Human Learning ≠ Machine Learning Di
    ff
    erent Challenges
    APPLE
    APPLE??
    MS Researchers demonstrated that
    di
    ff
    erent models (even fairly simple ones)

    performed almost identically on Natural
    Language disambiguation tasks
    [1]: Scaling to very very large corpora for natural language disambiguation, Banko M., Brill E., ACL '01: Proceedings of the 39th Annual ACL Meeting doi.org/10.3115/1073012.1073017
    [1]
    Data >> Model ?
    See Also: “The Unreasonable E
    ff
    ectiveness of Data”, Halevy, Norvig, and Pereira, IEEE Intelligent Systems, 2009

    View Slide

  10. The Data Vs Privacy AI Dilemma
    AI models are data hungry:


    • The more the data, the better the model


    • Push for High-quality and Curated* Open Datasets
    * More on the Curated possible meanings in the next slides!
    High-sensitive data: we need to keep data safe
    from both intentional and accidental leakage
    Data &| Models are kept in silos!

    View Slide

  11. The Data vs Privacy AI Dilemma
    AI models are data hungry:


    • The more the data, the better the model


    • Push for High-quality and Curated* Open Datasets
    * More on the Curated possible meanings in the next slide!
    High-sensitive data: we need to keep data safe
    from both intentional and accidental leakage
    Data &| Models are kept in silos! Data accounting for privacy

    (privacy preserving data)

    View Slide

  12. Privacy-Preserving Data
    Data Anonymisation Techniques: e.g. k-anonimity


    • (From Wikipedia)

    In the context of k-anonymization problems, a database is a table with n rows and m columns.

    Each row of the table represents a record relating to a speci
    fi
    c member of a population and the
    entries in the various rows need not be unique. The values in the various columns are the
    values of attributes associated with the members of the population.
    Priv
    a
    cy
    a
    s
    a
    property of D
    a
    t
    a
    Dataset
    🔒K-Anonymised

    Dataset
    Algorithm #1
    Algorithm #2
    Algorithm #k
    Data Sharing
    https://github.com/leriomaggio/privacy-preserving-data-science

    View Slide

  13. Privacy-Preserving Data
    Priv
    a
    cy
    a
    s
    a
    property of D
    a
    t
    a
    Patients

    Medical

    Records
    (Sanitised) History of
    Medical Prescriptions
    Patients bought meds from which pharmacy
    Pharmacies visited
    Roughly infer ZIP codes and residency

    (even without address info: e.g. most
    visited pharmacy)
    Correlation between medications and disease

    View Slide

  14. Data Privacy Issues
    Source: https://venturebeat.com/2020/04/07/2020-census-data-may-not-be-as-anonymous-as-expected/
    […] (we) show how these methods can be used in practice to
    de-anonymize the Netflix Prize dataset, a 500,000-record
    public dataset.
    Linking Attack

    View Slide

  15. Threats and Attacks for ML systems
    2008

    De-Anonymisation

    (re-identi
    fi
    cation)
    2011 

    Reconstruction Attack
    2013

    Parameter
    Inference Attack
    2015

    Model Inversion
    Attacks
    2017

    Membership
    Inference Attacks
    ML Model
    MLaaS
    White Box

    Attacks
    Black Box

    Attacks
    Perimeter
    • privacy has to be implemented systematically
    without using arbitrary mechanisms.


    • ML applications are prone to di
    ff
    erent privacy and
    security threats.

    View Slide

  16. Utility vs Privacy Dilemma
    The real challenge is balancing privacy and performance in ML
    applications so that we can better utilise the data while ensuring the
    privacy of the individuals.
    Image Credits: Johannes Stutz (@da
    f
    lowjoe)

    View Slide

  17. PPML Interactive Session
    - Approach: Data Scientist

    -Always predilige dev/practical aspects (tools & sw)


    -Work on the full pipeline

    - Perspective: Researcher


    - References and Further Readings to know more


    - Live Coding 🧑💻 (wish me luck! 🤞 )


    - non-live coding bits will have exercises to play with.

    View Slide

  18. 1. Model Threats

    View Slide

  19. Model Vulnerabilities
    Adversarial Examples

    View Slide

  20. Model Stealing
    Model Inversion Attacks

    View Slide

  21. Model Stealing
    Model Inversion Attacks

    View Slide

  22. Introducing
    Di
    ff
    erential Privacy
    Inspired from: Di
    ff
    erential Privacy on PyTorch | PyTorch Developer Day 2020

    youtu.be/l6
    f
    bl2CBnq0

    View Slide

  23. View Slide

  24. Source:

    pinterest.com/
    agirlandaglobe/

    View Slide

  25. View Slide

  26. View Slide

  27. View Slide

  28. PPML with Differential Privacy
    https://ppml-workshop.github.io
    Di
    ff
    erential privacy is a system for
    publicly sharing information about a
    dataset by describing the patterns of
    groups within the dataset while
    withholding information about
    individuals in the dataset.
    Like k-Anonymity, DP is a formal notion of privacy (i.e. it’s possible to prove
    that a data release has the property).

    Unlike k-Anonymity, however, di
    ff
    erential privacy is a property of algorithms,
    and not a property of data. That is, we can prove that an algorithm satis
    fi
    es
    di
    ff
    erential privacy; to show that a dataset satis
    fi
    es di
    ff
    erential privacy, we must
    show that the algorithm which produced it satis
    fi
    es di
    ff
    erential privacy.

    View Slide

  29. Learning from Aggregates
    Introducing OPACUS
    Di
    ff
    erential Privacy within the ML Pipeline
    ppml-tutorial/2-mia-di
    ff
    erential-privacy
    • Aggregate Count on the Data


    • Computing Mean


    • (Complex) Train ML model
    Di
    ff
    erential Privacy within the ML Pipeline

    View Slide

  30. Learning from Aggregates
    Introducing OPACUS
    ppml-tutorial/2-mia-di
    ff
    erential-privacy

    View Slide

  31. Why don’t we allow AI without
    moving data from their silos?

    View Slide

  32. Introducing: Federated Learning

    View Slide

  33. So that’s it ?

    Federated Learning to rule them all ?

    View Slide

  34. Federated Learning

    & Encryption

    View Slide

  35. Federated Learning

    & Homomorphic Encryption
    https://blog.openmined.org/ckks-homomorphic-encryption-pytorch-pysyft-seal/
    ppml-tutorial/3-federeted-learning

    View Slide

  36. Wrap up
    • Part 1: Data Anonymisation


    • K-anonymity


    • Part 2: Di
    ff
    erential Privacy


    • Properties & DP for ML


    • Part 3: Model Vulnerabilities and Attacks


    • Adversarial Examples ( in Practice )


    • Model Inversion Attack


    • Part 4: Federated Machine Learning


    • Federated Data


    • Federated Learning (SplitNN)

    View Slide

  37. Thank you very much

    for your kind attention
    Valerio Maggio
    @leriomaggio
    [email protected]
    github.com/leriomaggio/ppml-tutorial
    speakerdeck.com/leriomaggio/ppml-scipy

    View Slide