Upgrade to Pro — share decks privately, control downloads, hide ads and more …

PPML PyConDE

PPML PyConDE

Privacy is to date one of the major impediment for Machine Learning (ML), when applied to sensitive dataset. One popular example is the case of ML applied to the medical domain, but this generally extends to any data scenario in which sensitive data have or simply cannot be used. Moreover, data anonymisation methods are also not enough to guarantee that privacy will be completely preserved. In fact, it is possible to exploit the _memoisation_ effect of DL models to exploit sensitive information about samples, and the original dataset used for training. However, *privacy-preserving machine learning* (PPML) methods promise to overcome all this issues, allowing to train Machine learning models on "data that cannot be seen".

The workshop will be organised in **two parts**: (1) in the first part, we will work on attacks to Deep Learning models, leveraging on their vulnerabilities to exploit insights on original (sensitive) data. We will then explore potential counter-measures to work around these issues.
Examples will include cases of image data, as well as textual data where attacks and counter-measures highlight different nuances and corner cases.
(2) In the second part of the workshop, will delve into PPML methods, focusing on mechanisms to train DL networks on encrypted data, as well as on specialised _distributed federated_ training strategies for multiple _sensitive_ datasets.

Valerio Maggio

April 13, 2022
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. Privacy Preserving Machine Learning Machine Learning on Data you’re not

    allowed to see [email protected] @leriomaggio github.com/leriomaggio/ppml-pyconde speakerdeck.com/leriomaggio/ppml-pyconde
  2. Aim of this Tutorial Privacy-Preserving Machine Learning (PPML) - Approach:

    Data Scientist -Always predilige dev/practical aspects (tools & sw) -Work on the full pipeline 
 - Perspective: Researcher - References and Further Readings to know more - Live Coding 🧑💻 (wish me luck! 🤞 ) - non-live coding bits will have exercises to play with. Provide an overview of the emerging technologies (in the Python ecosystem) 
 for Privacy Protection with speci f ic Focus on Machine Learning
  3. • Privacy-Preserving Machine Learning (PPML) technologies have the huge potential

    to be the 
 Data Science paradigm of the future • Joint e ff ort of Open Source & ML & Security Communities • I wish to disseminate the knowledge about these new methods and technologies among researchers • Focus on Reproducibility of PPML work fl ows SSI Fellowship Plans : PPML What I would like to do How I would like to do gather.town Any help or suggestions about Use/Data cases or more generally Case studies, or any contribution to shape the repository will be very much appreciated! Looking forward to collaborations and contributions ☺
  4. Deep Learning Terminology Everyone on the same page? also ref:

    bit.ly/nvidia-dl-glossary Epochs Batches and 
 mini-batch learning Parameters vs HyperParameters (e.g. weights vs layers) Loss & Optimiser (e.g. Cross Entropy & SGD) Transfer learning Gradient & 
 Backward Propagation Tensor
  5. Python has its say Machine Learning Deep Learning “There should

    be one, and preferably one, way to do it” The Zen of Python
  6. Multiple Frameworks? Data APIs: Standardization of N-dimensional arrays and dataframes,

    by Stephannie Jimenez Gacha 
 https://2022.pycon.de/program/BMFVFG/
  7. Deep Learning Frameworks Static Graph Dynamic Graph X b W

    * + σ xTW + b (xTW + b) σ Computational Graph Models Linear (or Dense) + + y L y’ fc1 fc2 fc3 fc4 fc5 epoch 1, batch 1 epoch 1, batch 2 + + y L y’ fc2 fc3 fc1 X fc5 fc4 + + y L y’ fc2 fc3 fc1 X fc5 fc4
  8. Deep Learning Frameworks Static Graph Dynamic Graph X b W

    * + σ xTW + b (xTW + b) σ Backwards and Gradient Computation Linear (or Dense) + + y L y’ fc1 fc2 fc3 fc4 fc5 + + y L y’ fc2 fc3 fc1 X fc5 fc4 Backprop Record Replay Autograd &
  9. Tensors, NumPy, Devices N u m p y -l i

    k e A P I t en s o r -> n d a rr a y t en s o r <- nd a rr a y C UD A s u pp o rt
  10. The Data Vs Privacy AI Dilemma AI models are data

    hungry: • The more the data, the better the model • Push for High-quality and Curated* Open Datasets * More on the Curated possible meanings in the next slides! High-sensitive data: we need to keep data safe from both intentional and accidental leakage Data &| Models are kept in silos!
  11. The Data vs Privacy AI Dilemma AI models are data

    hungry: • The more the data, the better the model • Push for High-quality and Curated* Open Datasets * More on the Curated possible meanings in the next slide! High-sensitive data: we need to keep data safe from both intentional and accidental leakage Data &| Models are kept in silos! Data accounting for privacy 
 (privacy preserving data)
  12. Privacy-Preserving Data Data Anonymisation Techniques: e.g. k-anonimity • (From Wikipedia)

    
 In the context of k-anonymization problems, a database is a table with n rows and m columns. 
 Each row of the table represents a record relating to a speci fi c member of a population and the entries in the various rows need not be unique. The values in the various columns are the values of attributes associated with the members of the population. D a t a Anonymity Source: https://venturebeat.com/2020/04/07/2020-census-data-may-not-be-as-anonymous-as-expected/ We then show how these methods can be used in practice to de-anonymize the Netflix Prize dataset, a 500,000-record public dataset. Linking Attack
  13. PPML with Differential Privacy https://ppml-workshop.github.io Di ff erential privacy is

    a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset.
  14. Introducing Di ff erential Privacy Inspired from: Di ff erential

    Privacy on PyTorch | PyTorch Developer Day 2020 
 youtu.be/l6 f bl2CBnq0
  15. Learning from Aggregates With Di ff erential Privacy • Aggregate

    Count on the Data • Computing Mean • (Complex) Train ML model Di ff erential Privacy within the ML Pipeline
  16. So that we could see all of that in action

    ☺ Let’s switch to code now github.com/leriomaggio/ppml-pyconde
  17. Agenda • Part 1: Model Vulnerabilities and Attacks • Adversarial

    Examples ( in Practice ) • Model Inversion Attack 
 • Part 2: Federated Machine Learning • Federated Data • Federated Learning (SplitNN) • Part 3: DL and Di ff erential Privacy (DP) • Model Training with DP
  18. Thank you very much 
 for your kind attention Valerio

    Maggio [email protected] @leriomaggio github.com/leriomaggio/ppml-pyconde speakerdeck.com/leriomaggio/ppml-pyconde