Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Privacy Preserving Machine Learning (PPML) @ SciPy 2023

Privacy Preserving Machine Learning (PPML) @ SciPy 2023

Privacy guarantees are **the** most crucial requirement when it comes to analyse sensitive data. These requirements could be sometimes very stringent, so that it becomes a real barrier for the entire pipeline. Reasons for this are manifold, and involve the fact that data could not be _shared_ nor moved from their silos of resident, let alone analysed in their _raw_ form. As a result, _data anonymisation techniques_ are sometimes used to generate a sanitised version of the original data. However, these techniques alone are not enough to guarantee that privacy will be completely preserved. Moreover, the _memoisation_ effect of Deep learning models could be maliciously exploited to _attack_ the models, and _reconstruct_ sensitive information about samples used in training, even if these information were not originally provided.

*Privacy-preserving machine learning* (PPML) methods hold the promise to overcome all those issues, allowing to train machine learning models with full privacy guarantees.

This workshop will be mainly organised in **three** main parts. In the first part, we will introduce the main concepts of **differential privacy**: what is it, and how this method differs from more classical _anonymisation_ techniques (e.g. `k-anonymity`). In the second part, we will focus on Machine learning experiments. We will start by demonstrating how DL models could be exploited (i.e. _inference attack_ ) to reconstruct original data solely analysing models predictions; and then we will explore how **differential privacy** can help us protecting the privacy of our model, with _minimum disruption_ to the original pipeline. Finally, we will conclude the tutorial considering more complex ML scenarios to train Deep learning networks on encrypted data, with specialised _distributed federated_ _learning_ strategies.

Valerio Maggio

July 10, 2023
Tweet

More Decks by Valerio Maggio

Other Decks in Research

Transcript

  1. Privacy Preserving Machine Learning Machine Learning on Data you’re not

    allowed to see @leriomaggio github.com/leriomaggio/ppml-tutorial speakerdeck.com/leriomaggio/ppml-scipy [email protected]
  2. also me • Background in CS • PhD in Machine

    Learning • Research: ML/DL for BioMedicine • SSI Fellow • Python Geek 
 • Data Scientists Advocate _at_ Anaconda me pun Who?
  3. Provide an overview of the emerging tools (in the ecosystem)

    
 for Privacy Enhancing Technologies (a.k.a. PETs) with focus on 
 Machine Learning Aim of this Tutorial Privacy-Preserving Machine Learning (PPML)
  4. • Privacy-Preserving Machine Learning (PPML) technologies have the huge potential

    to be the 
 Data Science paradigm of the future • Joint e ff ort of Open Source & ML & Security Communities • I wish to disseminate the knowledge about these new methods and technologies among researchers • Focus on Reproducibility of PPML work fl ows SSI Fellowship: PPML What I would like to do Any help or suggestions about Use/Data cases or more generally Case studies, or any contribution to shape the repository will be very much appreciated! Looking forward to collaborations and contributions ☺ Aw a rded by JGI Seed-Corn Fundings 2021 je a ngoldinginstitute.blogs.bristol. a c.uk/2021/01/07/seed-corn-funding-winner- a nnouncement/
  5. Why Privacy is important The Facebook-Cambridge Analytical Scandal (2014-18) i

    2014 
 A Facebook quiz called “This Is Your Digital Life” invited users to f ind out their personality type The app collected data from participants but also recorded public data from those in their friends list. 2015 
 The Guardian reported that Cambridge Analytica had data from this app and used it to psychologically pro f ile voters in the US. 305K people installed the app 87M people info gathered 2018 
 US and British lawmakers demanded that Facebook explain how the f irm was able to harvest personal information without users’ consent. i Facebook apologised for the data scandal and announced changes to the privacy settings.
  6. What about Machine Learning ? Human Learning ≠ Machine Learning

    Di ff erent Challenges APPLE Machine Learning instead may require millions of samples even for a “simple” task > ML models are data hungry APPLE??
  7. What about Machine Learning ? Human Learning ≠ Machine Learning

    Di ff erent Challenges APPLE APPLE?? MS Researchers demonstrated that di ff erent models (even fairly simple ones) 
 performed almost identically on Natural Language disambiguation tasks [1]: Scaling to very very large corpora for natural language disambiguation, Banko M., Brill E., ACL '01: Proceedings of the 39th Annual ACL Meeting doi.org/10.3115/1073012.1073017 [1] Data >> Model ? See Also: “The Unreasonable E ff ectiveness of Data”, Halevy, Norvig, and Pereira, IEEE Intelligent Systems, 2009
  8. The Data Vs Privacy AI Dilemma AI models are data

    hungry: • The more the data, the better the model • Push for High-quality and Curated* Open Datasets * More on the Curated possible meanings in the next slides! High-sensitive data: we need to keep data safe from both intentional and accidental leakage Data &| Models are kept in silos!
  9. The Data vs Privacy AI Dilemma AI models are data

    hungry: • The more the data, the better the model • Push for High-quality and Curated* Open Datasets * More on the Curated possible meanings in the next slide! High-sensitive data: we need to keep data safe from both intentional and accidental leakage Data &| Models are kept in silos! Data accounting for privacy 
 (privacy preserving data)
  10. Privacy-Preserving Data Data Anonymisation Techniques: e.g. k-anonimity • (From Wikipedia)

    
 In the context of k-anonymization problems, a database is a table with n rows and m columns. 
 Each row of the table represents a record relating to a speci fi c member of a population and the entries in the various rows need not be unique. The values in the various columns are the values of attributes associated with the members of the population. Priv a cy a s a property of D a t a Dataset 🔒K-Anonymised 
 Dataset Algorithm #1 Algorithm #2 Algorithm #k Data Sharing https://github.com/leriomaggio/privacy-preserving-data-science
  11. Privacy-Preserving Data Priv a cy a s a property of

    D a t a Patients 
 Medical 
 Records (Sanitised) History of Medical Prescriptions Patients bought meds from which pharmacy Pharmacies visited Roughly infer ZIP codes and residency 
 (even without address info: e.g. most visited pharmacy) Correlation between medications and disease
  12. Data Privacy Issues Source: https://venturebeat.com/2020/04/07/2020-census-data-may-not-be-as-anonymous-as-expected/ […] (we) show how these

    methods can be used in practice to de-anonymize the Netflix Prize dataset, a 500,000-record public dataset. Linking Attack
  13. Threats and Attacks for ML systems 2008 
 De-Anonymisation 


    (re-identi fi cation) 2011 
 Reconstruction Attack 2013 
 Parameter Inference Attack 2015 
 Model Inversion Attacks 2017 
 Membership Inference Attacks ML Model MLaaS White Box 
 Attacks Black Box 
 Attacks Perimeter • privacy has to be implemented systematically without using arbitrary mechanisms. • ML applications are prone to di ff erent privacy and security threats.
  14. Utility vs Privacy Dilemma The real challenge is balancing privacy

    and performance in ML applications so that we can better utilise the data while ensuring the privacy of the individuals. Image Credits: Johannes Stutz (@da f lowjoe)
  15. PPML Interactive Session - Approach: Data Scientist -Always predilige dev/practical

    aspects (tools & sw) -Work on the full pipeline 
 - Perspective: Researcher - References and Further Readings to know more - Live Coding 🧑💻 (wish me luck! 🤞 ) - non-live coding bits will have exercises to play with.
  16. Introducing Di ff erential Privacy Inspired from: Di ff erential

    Privacy on PyTorch | PyTorch Developer Day 2020 
 youtu.be/l6 f bl2CBnq0
  17. PPML with Differential Privacy https://ppml-workshop.github.io Di ff erential privacy is

    a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. Like k-Anonymity, DP is a formal notion of privacy (i.e. it’s possible to prove that a data release has the property). 
 Unlike k-Anonymity, however, di ff erential privacy is a property of algorithms, and not a property of data. That is, we can prove that an algorithm satis fi es di ff erential privacy; to show that a dataset satis fi es di ff erential privacy, we must show that the algorithm which produced it satis fi es di ff erential privacy.
  18. Learning from Aggregates Introducing OPACUS Di ff erential Privacy within

    the ML Pipeline ppml-tutorial/2-mia-di ff erential-privacy • Aggregate Count on the Data • Computing Mean • (Complex) Train ML model Di ff erential Privacy within the ML Pipeline
  19. Wrap up • Part 1: Data Anonymisation • K-anonymity •

    Part 2: Di ff erential Privacy • Properties & DP for ML • Part 3: Model Vulnerabilities and Attacks • Adversarial Examples ( in Practice ) • Model Inversion Attack • Part 4: Federated Machine Learning • Federated Data • Federated Learning (SplitNN)
  20. Thank you very much 
 for your kind attention Valerio

    Maggio @leriomaggio [email protected] github.com/leriomaggio/ppml-tutorial speakerdeck.com/leriomaggio/ppml-scipy