Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PPML PyConDE

PPML PyConDE

Privacy is to date one of the major impediment for Machine Learning (ML), when applied to sensitive dataset. One popular example is the case of ML applied to the medical domain, but this generally extends to any data scenario in which sensitive data have or simply cannot be used. Moreover, data anonymisation methods are also not enough to guarantee that privacy will be completely preserved. In fact, it is possible to exploit the _memoisation_ effect of DL models to exploit sensitive information about samples, and the original dataset used for training. However, *privacy-preserving machine learning* (PPML) methods promise to overcome all this issues, allowing to train Machine learning models on "data that cannot be seen".

The workshop will be organised in **two parts**: (1) in the first part, we will work on attacks to Deep Learning models, leveraging on their vulnerabilities to exploit insights on original (sensitive) data. We will then explore potential counter-measures to work around these issues.
Examples will include cases of image data, as well as textual data where attacks and counter-measures highlight different nuances and corner cases.
(2) In the second part of the workshop, will delve into PPML methods, focusing on mechanisms to train DL networks on encrypted data, as well as on specialised _distributed federated_ training strategies for multiple _sensitive_ datasets.

Valerio Maggio

April 13, 2022
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. Privacy Preserving


    Machine Learning
    Machine Learning on Data you’re not allowed to see
    [email protected]
    @leriomaggio
    github.com/leriomaggio/ppml-pyconde
    speakerdeck.com/leriomaggio/ppml-pyconde

    View Slide

  2. Aim of this Tutorial
    Privacy-Preserving Machine Learning (PPML)
    - Approach: Data Scientist


    -Always predilige dev/practical aspects (tools & sw)


    -Work on the full pipeline

    - Perspective: Researcher


    - References and Further Readings to know more


    - Live Coding 🧑💻 (wish me luck! 🤞 )


    - non-live coding bits will have exercises to play with.
    Provide an overview of the emerging technologies (in the Python ecosystem)

    for Privacy Protection with speci
    f
    ic Focus on Machine Learning

    View Slide

  3. • Privacy-Preserving Machine Learning (PPML)
    technologies have the huge potential to be the

    Data Science paradigm of the future


    • Joint e
    ff
    ort of Open Source & ML & Security
    Communities


    • I wish to disseminate the knowledge about these new
    methods and technologies among researchers


    • Focus on Reproducibility of PPML work
    fl
    ows
    SSI
    Fellowship
    Plans :
    PPML
    What I would like to do
    How I would like to do
    gather.town
    Any help or suggestions about Use/Data cases or more
    generally Case studies, or any contribution to shape the
    repository will be very much appreciated!
    Looking forward to collaborations and contributions ☺

    View Slide

  4. Warm up
    DL Basics & PyTorch Quick Refresher

    View Slide

  5. Deep Learning Terminology
    Everyone on the same page?
    also ref: bit.ly/nvidia-dl-glossary
    Epochs
    Batches and

    mini-batch learning
    Parameters vs HyperParameters


    (e.g. weights vs layers)
    Loss & Optimiser


    (e.g. Cross Entropy & SGD)
    Transfer learning
    Gradient &

    Backward Propagation
    Tensor

    View Slide

  6. Python has its say
    Machine Learning
    Deep Learning
    “There should be one, and preferably one, way to do it”
    The Zen of Python

    View Slide

  7. Multiple Frameworks?
    Data APIs: Standardization of N-dimensional arrays and dataframes, by Stephannie Jimenez Gacha

    https://2022.pycon.de/program/BMFVFG/

    View Slide

  8. Deep Learning Frameworks
    Static Graph Dynamic Graph
    X
    b
    W
    *
    +
    σ
    xTW + b
    (xTW + b)
    σ
    Computational Graph Models
    Linear (or Dense)
    + +
    y
    L
    y’
    fc1
    fc2
    fc3
    fc4
    fc5
    epoch 1, batch 1
    epoch 1, batch 2
    + +
    y
    L
    y’
    fc2
    fc3
    fc1
    X
    fc5
    fc4
    + +
    y
    L
    y’
    fc2
    fc3
    fc1
    X
    fc5
    fc4

    View Slide

  9. Deep Learning Frameworks
    Static Graph Dynamic Graph
    X
    b
    W
    *
    +
    σ
    xTW + b
    (xTW + b)
    σ
    Backwards and Gradient Computation
    Linear (or Dense)
    + +
    y
    L
    y’
    fc1
    fc2
    fc3
    fc4
    fc5
    + +
    y
    L
    y’
    fc2
    fc3
    fc1
    X
    fc5
    fc4
    Backprop
    Record Replay
    Autograd
    &

    View Slide

  10. Main features overview
    review of basic PyTorch features we will see soon

    View Slide

  11. Tensors, NumPy, Devices
    N
    u
    m
    p
    y
    -l
    i
    k
    e
    A
    P
    I
    t
    en
    s
    o
    r
    ->
    n
    d
    a
    rr
    a
    y
    t
    en
    s
    o
    r
    <-
    nd
    a
    rr
    a
    y
    C
    UD
    A
    s
    u
    pp
    o
    rt

    View Slide

  12. Neural Module
    subclassing
    De
    fi
    nition of layers
    (i.e. tensors)
    De
    fi
    nition of graph
    (i.e. network)

    View Slide

  13. Loss and Gradients
    optimiser
    criterion

    & loss
    backprop &
    update

    View Slide

  14. Dataset and DataLoader
    transformers
    Dataset
    DataLoader

    View Slide

  15. Let’s Talk about Privacy

    View Slide

  16. The Data Vs Privacy AI Dilemma
    AI models are data hungry:


    • The more the data, the better the model


    • Push for High-quality and Curated* Open Datasets
    * More on the Curated possible meanings in the next slides!
    High-sensitive data: we need to keep data safe
    from both intentional and accidental leakage
    Data &| Models are kept in silos!

    View Slide

  17. The Data vs Privacy AI Dilemma
    AI models are data hungry:


    • The more the data, the better the model


    • Push for High-quality and Curated* Open Datasets
    * More on the Curated possible meanings in the next slide!
    High-sensitive data: we need to keep data safe
    from both intentional and accidental leakage
    Data &| Models are kept in silos! Data accounting for privacy

    (privacy preserving data)

    View Slide

  18. Privacy-Preserving Data
    Data Anonymisation Techniques: e.g. k-anonimity


    • (From Wikipedia)

    In the context of k-anonymization problems, a database is a table with n rows and m columns.

    Each row of the table represents a record relating to a speci
    fi
    c member of a population and the
    entries in the various rows need not be unique. The values in the various columns are the
    values of attributes associated with the members of the population.
    D
    a
    t
    a
    Anonymity
    Source: https://venturebeat.com/2020/04/07/2020-census-data-may-not-be-as-anonymous-as-expected/
    We then show how these methods can be used in practice to
    de-anonymize the Netflix Prize dataset, a 500,000-record
    public dataset.
    Linking Attack

    View Slide

  19. Why don’t we allow analysis
    without moving data from their
    silos at all?

    View Slide

  20. Introducing: Federated Learning

    View Slide

  21. So that’s it ?

    Federated Learning to rule them all ?

    View Slide

  22. Model Vulnerabilities
    Adversarial Examples

    View Slide

  23. Model Stealing
    Model Inversion Attacks

    View Slide

  24. Federated Learning

    & Encryption

    View Slide

  25. Federated Learning

    & Homomorphic Encryption
    https://blog.openmined.org/ckks-homomorphic-encryption-pytorch-pysyft-seal/

    View Slide

  26. PPML with Differential Privacy
    https://ppml-workshop.github.io
    Di
    ff
    erential privacy is a system for
    publicly sharing information about a dataset
    by describing the patterns of groups within
    the dataset while withholding information
    about individuals in the dataset.

    View Slide

  27. Introducing
    Di
    ff
    erential Privacy
    Inspired from: Di
    ff
    erential Privacy on PyTorch | PyTorch Developer Day 2020

    youtu.be/l6
    f
    bl2CBnq0

    View Slide

  28. View Slide

  29. Source:

    pinterest.com/
    agirlandaglobe/

    View Slide

  30. View Slide

  31. View Slide

  32. View Slide

  33. Learning from Aggregates
    With Di
    ff
    erential Privacy
    • Aggregate Count on the Data


    • Computing Mean


    • (Complex) Train ML model
    Di
    ff
    erential Privacy within the ML Pipeline

    View Slide

  34. So that we could see all of that in action ☺
    Let’s switch to code now
    github.com/leriomaggio/ppml-pyconde

    View Slide

  35. Agenda
    • Part 1: Model Vulnerabilities and
    Attacks


    • Adversarial Examples ( in Practice )


    • Model Inversion Attack

    • Part 2: Federated Machine Learning


    • Federated Data


    • Federated Learning (SplitNN)


    • Part 3: DL and Di
    ff
    erential Privacy (DP)


    • Model Training with DP

    View Slide

  36. Thank you very much

    for your kind attention
    Valerio Maggio
    [email protected]
    @leriomaggio
    github.com/leriomaggio/ppml-pyconde
    speakerdeck.com/leriomaggio/ppml-pyconde

    View Slide