Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PPML Mozfest23

PPML Mozfest23

Privacy guarantees are one of the most crucial requirements when it comes to analyse sensitive information. However, data anonymisation techniques alone do not always provide complete privacy protection; moreover Machine Learning (ML) models could also be exploited to _leak_ sensitive data when _attacked_ and no counter-measure is put in place.

*Privacy-preserving machine learning* (PPML) methods hold the promise to overcome all those issues, allowing to train machine learning models with full privacy guarantees.

This workshop will be mainly organised in **two parts**. In the first part, we will explore one example of ML model exploitation (i.e. _inference attack_ ) to reconstruct original data from a trained model, and we will then see how **differential privacy** can help us protecting the privacy of our model, with _minimum disruption_ to the original pipeline. In the second part of the workshop, we will examine a more complicated ML scenario to train Deep learning networks on encrypted data, with specialised _distributed federated_ _learning_ strategies.

Valerio Maggio

March 22, 2023
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. Privacy Preserving


    Machine Learning
    Machine Learning on Data you’re not allowed to see
    @leriomaggio
    github.com/leriomaggio/ppml-tutorial
    speakerdeck.com/leriomaggio/ppml-mozfest23
    [email protected]

    View full-size slide

  2. also me
    • Background in CS


    • PhD in Machine Learning


    • Research: ML/DL for BioMedicine


    • SSI Fellow


    • Python Geek

    • Data Scientists Advocate _at_ Anaconda
    me
    pun
    Who?

    View full-size slide

  3. Let’s Introduce Privacy

    View full-size slide

  4. Why Privacy is important
    The Facebook-Cambridge Analytical Scandal (2014-18)
    i
    2014

    A Facebook quiz called “This Is Your
    Digital Life” invited users to
    f
    ind out
    their personality type
    The app collected data from
    participants but also recorded public
    data from those in their friends list.
    2015

    The Guardian reported that Cambridge
    Analytica had data from this app and used it
    to psychologically pro
    f
    ile voters in the US.
    305K people installed the app


    87M people info gathered
    2018

    US and British lawmakers demanded
    that Facebook explain how the
    f
    irm
    was able to harvest personal
    information without users’ consent.
    i
    Facebook apologised for the data
    scandal and announced changes
    to the privacy settings.

    View full-size slide

  5. What about Machine Learning ?
    Human Learning ≠ Machine Learning R
    G
    B

    View full-size slide

  6. What about Machine Learning ?
    Human Learning ≠ Machine Learning Di
    ff
    erent Challenges
    APPLE
    Machine Learning instead may require
    millions of samples even for a “simple” task


    > ML models are data hungry
    APPLE??

    View full-size slide

  7. What about Machine Learning ?
    Human Learning ≠ Machine Learning Di
    ff
    erent Challenges
    APPLE
    APPLE??
    MS Researchers demonstrated that
    di
    ff
    erent models (even fairly simple ones)

    performed almost identically on Natural
    Language disambiguation tasks
    [1]: Scaling to very very large corpora for natural language disambiguation, Banko M., Brill E., ACL '01: Proceedings of the 39th Annual ACL Meeting doi.org/10.3115/1073012.1073017
    [1]
    Data >> Model ?
    See Also: “The Unreasonable E
    ff
    ectiveness of Data”, Halevy, Norvig, and Pereira, IEEE Intelligent Systems, 2009

    View full-size slide

  8. The Data Vs Privacy AI Dilemma
    AI models are data hungry:


    • The more the data, the better the model


    • Push for High-quality and Curated* Open Datasets
    * More on the Curated possible meanings in the next slides!
    High-sensitive data: we need to keep data safe
    from both intentional and accidental leakage
    Data &| Models are kept in silos!

    View full-size slide

  9. The Data vs Privacy AI Dilemma
    AI models are data hungry:


    • The more the data, the better the model


    • Push for High-quality and Curated* Open Datasets
    * More on the Curated possible meanings in the next slide!
    High-sensitive data: we need to keep data safe
    from both intentional and accidental leakage
    Data &| Models are kept in silos! Data accounting for privacy

    (privacy preserving data)

    View full-size slide

  10. Privacy-Preserving Data
    Data Anonymisation Techniques: e.g. k-anonimity


    • (From Wikipedia)

    In the context of k-anonymization problems, a database is a table with n rows and m columns.

    Each row of the table represents a record relating to a speci
    fi
    c member of a population and the
    entries in the various rows need not be unique. The values in the various columns are the
    values of attributes associated with the members of the population.
    Priv
    a
    cy
    a
    s
    a
    property of D
    a
    t
    a
    Dataset
    🔒K-Anonymised

    Dataset
    Algorithm #1
    Algorithm #2
    Algorithm #k
    Data Sharing
    https://github.com/leriomaggio/privacy-preserving-data-science

    View full-size slide

  11. Privacy-Preserving Data
    Priv
    a
    cy
    a
    s
    a
    property of D
    a
    t
    a
    Patients

    Medical

    Records
    (Sanitised) History of
    Medical Prescriptions
    Patients bought meds from which pharmacy
    Pharmacies visited
    Roughly infer ZIP codes and residency

    (even without address info: e.g. most
    visited pharmacy)
    Correlation between medications and disease

    View full-size slide

  12. Data Privacy Issues
    Source: https://venturebeat.com/2020/04/07/2020-census-data-may-not-be-as-anonymous-as-expected/
    […] (we) show how these methods can be used in practice to
    de-anonymize the Netflix Prize dataset, a 500,000-record
    public dataset.
    Linking Attack

    View full-size slide

  13. Threats and Attacks for ML systems
    2008

    De-Anonymisation

    (re-identi
    fi
    cation)
    2011 

    Reconstruction Attack
    2013

    Parameter
    Inference Attack
    2015

    Model Inversion
    Attacks
    2017

    Membership
    Inference Attacks
    ML Model
    MLaaS
    White Box

    Attacks
    Black Box

    Attacks
    Perimeter
    • privacy has to be implemented systematically
    without using arbitrary mechanisms.


    • ML applications are prone to di
    ff
    erent privacy and
    security threats.

    View full-size slide

  14. Utility vs Privacy Dilemma
    The real challenge is balancing privacy and performance in ML
    applications so that we can better utilise the data while ensuring the
    privacy of the individuals.
    Image Credits: Johannes Stutz (@da
    f
    lowjoe)

    View full-size slide

  15. PPML Interactive Session
    - Approach: Data Scientist

    -Always predilige dev/practical aspects (tools & sw)


    -Work on the full pipeline

    - Perspective: Researcher


    - References and Further Readings to know more


    - Live Coding 🧑💻 (wish me luck! 🤞 )


    - non-live coding bits will have exercises to play with.

    View full-size slide

  16. 1. Model Threats

    View full-size slide

  17. Getting Started
    github.com/leriomaggio/ppml-tutorial
    Let’s switch to code to check that we’re all ready to start

    View full-size slide

  18. Model Vulnerabilities
    Adversarial Examples
    ppml-tutorial/1-fast-gradient-sign-method

    View full-size slide

  19. Model Stealing
    Model Inversion Attacks
    ppml-tutorial/2-mia-di
    ff
    erential-privacy

    View full-size slide

  20. Introducing
    Di
    ff
    erential Privacy
    Inspired from: Di
    ff
    erential Privacy on PyTorch | PyTorch Developer Day 2020

    youtu.be/l6
    f
    bl2CBnq0

    View full-size slide

  21. Source:

    pinterest.com/
    agirlandaglobe/

    View full-size slide

  22. PPML with Differential Privacy
    https://ppml-workshop.github.io
    Di
    ff
    erential privacy is a system for
    publicly sharing information about a
    dataset by describing the patterns of
    groups within the dataset while
    withholding information about
    individuals in the dataset.
    Like k-Anonymity, DP is a formal notion of privacy (i.e. it’s possible to prove
    that a data release has the property).

    Unlike k-Anonymity, however, di
    ff
    erential privacy is a property of algorithms,
    and not a property of data. That is, we can prove that an algorithm satis
    fi
    es
    di
    ff
    erential privacy; to show that a dataset satis
    fi
    es di
    ff
    erential privacy, we must
    show that the algorithm which produced it satis
    fi
    es di
    ff
    erential privacy.

    View full-size slide

  23. Learning from Aggregates
    Introducing OPACUS
    Di
    ff
    erential Privacy within the ML Pipeline
    ppml-tutorial/2-mia-di
    ff
    erential-privacy
    • Aggregate Count on the Data


    • Computing Mean


    • (Complex) Train ML model
    Di
    ff
    erential Privacy within the ML Pipeline

    View full-size slide

  24. Learning from Aggregates
    Introducing OPACUS
    ppml-tutorial/2-mia-di
    ff
    erential-privacy

    View full-size slide

  25. Why don’t we allow AI without
    moving data from their silos?

    View full-size slide

  26. Introducing: Federated Learning

    View full-size slide

  27. So that’s it ?

    Federated Learning to rule them all ?

    View full-size slide

  28. Federated Learning

    & Encryption

    View full-size slide

  29. Federated Learning

    & Homomorphic Encryption
    https://blog.openmined.org/ckks-homomorphic-encryption-pytorch-pysyft-seal/
    ppml-tutorial/3-federeted-learning

    View full-size slide

  30. Wrap up
    • Part 1: Model Vulnerabilities and
    Attacks


    • Adversarial Examples ( in Practice )


    • Model Inversion Attack

    • Part 3: DL and Di
    ff
    erential Privacy (DP)


    • Model Training with DP


    • Part 2: Federated Machine Learning


    • Federated Data


    • Federated Learning (SplitNN)

    View full-size slide

  31. Thank you very much

    for your kind attention
    Valerio Maggio
    @leriomaggio
    [email protected]
    github.com/leriomaggio/ppml-tutorial
    speakerdeck.com/leriomaggio/ppml-mozfest23

    View full-size slide