Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PPML JGI

PPML JGI

**Privacy-Preserving Machine Learning**: Machine learning on data you _cannot see_.

Privacy guarantees are one of the most crucial requirements when it comes to analyse sensitive information. However, data anonymisation techniques alone do not always provide complete privacy protection; moreover Machine Learning (ML) models could also be exploited to _leak_ sensitive data when _attacked_ and no counter-measure is put in place.

*Privacy-preserving machine learning* (PPML) methods hold the promise to overcome all those issues, allowing to train machine learning models with full privacy guarantees.

This workshop will be mainly organised in **two parts**. In the first part, we will explore one example of ML model exploitation (i.e. _inference attack_ ) to reconstruct original data from a trained model, and we will then see how **differential privacy** can help us protecting the privacy of our model, with _minimum disruption_ to the original pipeline. In the second part of the workshop, we will examine a more complicated ML scenario to train Deep learning networks on encrypted data, with specialised _distributed federated_ _learning_ strategies.

Valerio Maggio

June 15, 2022
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. Privacy Preserving


    Machine Learning
    Machine Learning on Data you’re not allowed to see
    [email protected]
    @leriomaggio
    github.com/leriomaggio/ppml-tutorial
    speakerdeck.com/leriomaggio/ppml-jgi

    View Slide

  2. Provide an overview of the emerging tools (in the ecosystem)

    for Privacy Enhancing Technologies (a.k.a. PETs) with focus on

    Machine Learning
    Aim of this Tutorial
    Privacy-Preserving Machine Learning (PPML)

    View Slide

  3. • Privacy-Preserving Machine Learning (PPML)
    technologies have the huge potential to be the

    Data Science paradigm of the future


    • Joint e
    ff
    ort of Open Source & ML & Security
    Communities


    • I wish to disseminate the knowledge about these new
    methods and technologies among researchers


    • Focus on Reproducibility of PPML work
    fl
    ows
    SSI
    Fellowship
    Plans :
    PPML
    What I would like to do
    How I would like to do
    gather.town
    Any help or suggestions about Use/Data cases or more
    generally Case studies, or any contribution to shape the
    repository will be very much appreciated!
    Looking forward to collaborations and contributions ☺
    Aw
    a
    rded by JGI Seed-Corn Fundings 2021
    je
    a
    ngoldinginstitute.blogs.bristol.
    a
    c.uk/2021/01/07/seed-corn-funding-winner-
    a
    nnouncement/

    View Slide

  4. PPML Tutorial
    - Approach: Data Scientist

    -Always predilige dev/practical aspects (tools & sw)


    -Work on the full pipeline

    - Perspective: Researcher


    - References and Further Readings to know more


    - Live Coding 🧑💻 (wish me luck! 🤞 )


    - non-live coding bits will have exercises to play with.
    github.com/leriomaggio/ppml-tutorial
    Let’s switch to code to check that we’re
    all ready to start

    View Slide

  5. Warm up
    DL Basics & PyTorch Quick Refresher

    View Slide

  6. Deep Learning Terms
    Everyone on the same page?

    View Slide

  7. also ref: bit.ly/nvidia-dl-glossary
    Epochs
    Batches and

    mini-batch learning
    Parameters vs HyperParameters


    (e.g. weights vs layers)
    Loss & Optimiser


    (e.g. Cross Entropy & SGD)
    Transfer learning
    Gradient & Backward Propagation
    Tensor
    Deep Learning Terms Everyone on the same page?
    Generative Adversarial
    Networks (GAN)

    View Slide

  8. Python has its say
    Machine Learning
    Deep Learning
    “There should be one, and preferably one, way to do it”
    The Zen of Python

    View Slide

  9. Multiple Frameworks?
    Data APIs: Standardization of N-dimensional arrays and dataframes, by Stephannie Jimenez Gacha

    https://2022.pycon.de/program/BMFVFG/

    View Slide

  10. Main features overview
    review of basic PyTorch features we will see soon

    View Slide

  11. Tensors, NumPy, Devices
    Numpy-like API tensor -> ndarray
    tensor <- ndarray
    CUDA support
    🙋
    torch.cuda

    View Slide

  12. torch.nn

    Module
    subclassing
    De
    fi
    nition of layers
    (i.e. tensors)
    De
    fi
    nition of graph
    (i.e. network)
    🙋

    View Slide

  13. Loss and Gradients
    optimiser
    criterion

    & loss
    backprop &
    update
    🙋
    torch.optim
    torch.nn

    View Slide

  14. Dataset and DataLoader
    transformers
    Dataset
    DataLoader
    🙋
    torch.utils.data

    View Slide

  15. Let’s Introduce Privacy

    View Slide

  16. The Data Vs Privacy AI Dilemma
    AI models are data hungry:


    • The more the data, the better the model


    • Push for High-quality and Curated* Open Datasets
    * More on the Curated possible meanings in the next slides!
    High-sensitive data: we need to keep data safe
    from both intentional and accidental leakage
    Data &| Models are kept in silos!

    View Slide

  17. The Data vs Privacy AI Dilemma
    AI models are data hungry:


    • The more the data, the better the model


    • Push for High-quality and Curated* Open Datasets
    * More on the Curated possible meanings in the next slide!
    High-sensitive data: we need to keep data safe
    from both intentional and accidental leakage
    Data &| Models are kept in silos! Data accounting for privacy

    (privacy preserving data)

    View Slide

  18. Privacy-Preserving Data
    Data Anonymisation Techniques: e.g. k-anonimity


    • (From Wikipedia)

    In the context of k-anonymization problems, a database is a table with n rows and m columns.

    Each row of the table represents a record relating to a speci
    fi
    c member of a population and the
    entries in the various rows need not be unique. The values in the various columns are the
    values of attributes associated with the members of the population.
    D
    a
    t
    a
    Anonymity
    Dataset
    🔒K-Anonymised

    Dataset
    Algorithm #1
    Algorithm #2
    Algorithm #k
    Data Sharing
    https://github.com/leriomaggio/privacy-preserving-data-science

    View Slide

  19. Privacy-Preserving Data
    Data Anonymisation Techniques: e.g. k-anonimity


    • (From Wikipedia)

    In the context of k-anonymization problems, a database is a table with n rows and m columns.

    Each row of the table represents a record relating to a speci
    fi
    c member of a population and the
    entries in the various rows need not be unique. The values in the various columns are the
    values of attributes associated with the members of the population.
    D
    a
    t
    a
    Anonymity Issues
    Source: https://venturebeat.com/2020/04/07/2020-census-data-may-not-be-as-anonymous-as-expected/
    […] (we) show how these methods can be used in practice to
    de-anonymize the Netflix Prize dataset, a 500,000-record
    public dataset.
    Linking Attack

    View Slide

  20. Why don’t we allow AI without
    moving data from their silos?

    View Slide

  21. Introducing: Federated Learning

    View Slide

  22. So that’s it ?

    Federated Learning to rule them all ?

    View Slide

  23. Model Vulnerabilities
    Adversarial Examples
    ppml-tutorial/1-fast-gradient-sign-method

    View Slide

  24. Model Stealing
    Model Inversion Attacks
    ppml-tutorial/2-model-inversion-attack

    View Slide

  25. Introducing
    Di
    ff
    erential Privacy
    Inspired from: Di
    ff
    erential Privacy on PyTorch | PyTorch Developer Day 2020

    youtu.be/l6
    f
    bl2CBnq0

    View Slide

  26. View Slide

  27. Source:

    pinterest.com/
    agirlandaglobe/

    View Slide

  28. View Slide

  29. View Slide

  30. View Slide

  31. PPML with Differential Privacy
    https://ppml-workshop.github.io
    Di
    ff
    erential privacy is a system for
    publicly sharing information about a
    dataset by describing the patterns of
    groups within the dataset while
    withholding information about
    individuals in the dataset.
    Like k-Anonymity, DP is a formal notion of privacy (i.e. it’s possible to prove
    that a data release has the property).

    Unlike k-Anonymity, however, di
    ff
    erential privacy is a property of algorithms,
    and not a property of data. That is, we can prove that an algorithm satis
    fi
    es
    di
    ff
    erential privacy; to show that a dataset satis
    fi
    es di
    ff
    erential privacy, we must
    show that the algorithm which produced it satis
    fi
    es di
    ff
    erential privacy.

    View Slide

  32. Learning from Aggregates
    With Di
    ff
    erential Privacy
    • Aggregate Count on the Data


    • Computing Mean


    • (Complex) Train ML model
    Di
    ff
    erential Privacy within the ML Pipeline
    ppml-tutorial/3-di
    ff
    erential-privacy

    View Slide

  33. Going back to: Federated Learning

    View Slide

  34. Federated Learning

    & Encryption

    View Slide

  35. Federated Learning

    & Homomorphic Encryption
    https://blog.openmined.org/ckks-homomorphic-encryption-pytorch-pysyft-seal/
    ppml-tutorial/4-federeted-learning

    View Slide

  36. Thank you very much

    for your kind attention
    Valerio Maggio
    [email protected]
    @leriomaggio
    github.com/leriomaggio/ppml-tutorial
    speakerdeck.com/leriomaggio/ppml-jgi

    View Slide