Privacy Preserving Machine Learning (PPML) @ SciPy 2023

Privacy Preserving Machine Learning Machine Learning on Data you’re not
allowed to see @leriomaggio github.com/leriomaggio/ppml-tutorial speakerdeck.com/leriomaggio/ppml-scipy vmaggio@anaconda.com

also me • Background in CS • PhD in Machine
Learning • Research: ML/DL for BioMedicine • SSI Fellow • Python Geek   • Data Scientists Advocate _at_ Anaconda me pun Who?

Provide an overview of the emerging tools (in the ecosystem)
  for Privacy Enhancing Technologies (a.k.a. PETs) with focus on   Machine Learning Aim of this Tutorial Privacy-Preserving Machine Learning (PPML)

• Privacy-Preserving Machine Learning (PPML) technologies have the huge potential
to be the   Data Science paradigm of the future • Joint e ff ort of Open Source & ML & Security Communities • I wish to disseminate the knowledge about these new methods and technologies among researchers • Focus on Reproducibility of PPML work fl ows SSI Fellowship: PPML What I would like to do Any help or suggestions about Use/Data cases or more generally Case studies, or any contribution to shape the repository will be very much appreciated! Looking forward to collaborations and contributions ☺ Aw a rded by JGI Seed-Corn Fundings 2021 je a ngoldinginstitute.blogs.bristol. a c.uk/2021/01/07/seed-corn-funding-winner- a nnouncement/

Let’s Introduce Privacy

Why Privacy is important The Facebook-Cambridge Analytical Scandal (2014-18) i
2014   A Facebook quiz called “This Is Your Digital Life” invited users to f ind out their personality type The app collected data from participants but also recorded public data from those in their friends list. 2015   The Guardian reported that Cambridge Analytica had data from this app and used it to psychologically pro f ile voters in the US. 305K people installed the app 87M people info gathered 2018   US and British lawmakers demanded that Facebook explain how the f irm was able to harvest personal information without users’ consent. i Facebook apologised for the data scandal and announced changes to the privacy settings.

What about Machine Learning ? Human Learning ≠ Machine Learning
R G B

Di ff erent Challenges APPLE Machine Learning instead may require millions of samples even for a “simple” task > ML models are data hungry APPLE??

Di ff erent Challenges APPLE APPLE?? MS Researchers demonstrated that di ff erent models (even fairly simple ones)   performed almost identically on Natural Language disambiguation tasks [1]: Scaling to very very large corpora for natural language disambiguation, Banko M., Brill E., ACL '01: Proceedings of the 39th Annual ACL Meeting doi.org/10.3115/1073012.1073017 [1] Data >> Model ? See Also: “The Unreasonable E ff ectiveness of Data”, Halevy, Norvig, and Pereira, IEEE Intelligent Systems, 2009

The Data Vs Privacy AI Dilemma AI models are data
hungry: • The more the data, the better the model • Push for High-quality and Curated* Open Datasets * More on the Curated possible meanings in the next slides! High-sensitive data: we need to keep data safe from both intentional and accidental leakage Data &| Models are kept in silos!

The Data vs Privacy AI Dilemma AI models are data
hungry: • The more the data, the better the model • Push for High-quality and Curated* Open Datasets * More on the Curated possible meanings in the next slide! High-sensitive data: we need to keep data safe from both intentional and accidental leakage Data &| Models are kept in silos! Data accounting for privacy   (privacy preserving data)

Privacy-Preserving Data Data Anonymisation Techniques: e.g. k-anonimity • (From Wikipedia)
  In the context of k-anonymization problems, a database is a table with n rows and m columns.   Each row of the table represents a record relating to a speci fi c member of a population and the entries in the various rows need not be unique. The values in the various columns are the values of attributes associated with the members of the population. Priv a cy a s a property of D a t a Dataset 🔒K-Anonymised   Dataset Algorithm #1 Algorithm #2 Algorithm #k Data Sharing https://github.com/leriomaggio/privacy-preserving-data-science

Privacy-Preserving Data Priv a cy a s a property of
D a t a Patients   Medical   Records (Sanitised) History of Medical Prescriptions Patients bought meds from which pharmacy Pharmacies visited Roughly infer ZIP codes and residency   (even without address info: e.g. most visited pharmacy) Correlation between medications and disease

Data Privacy Issues Source: https://venturebeat.com/2020/04/07/2020-census-data-may-not-be-as-anonymous-as-expected/ […] (we) show how these
methods can be used in practice to de-anonymize the Netﬂix Prize dataset, a 500,000-record public dataset. Linking Attack

Threats and Attacks for ML systems 2008   De-Anonymisation  
(re-identi fi cation) 2011   Reconstruction Attack 2013   Parameter Inference Attack 2015   Model Inversion Attacks 2017   Membership Inference Attacks ML Model MLaaS White Box   Attacks Black Box   Attacks Perimeter • privacy has to be implemented systematically without using arbitrary mechanisms. • ML applications are prone to di ff erent privacy and security threats.

Utility vs Privacy Dilemma The real challenge is balancing privacy
and performance in ML applications so that we can better utilise the data while ensuring the privacy of the individuals. Image Credits: Johannes Stutz (@da f lowjoe)

PPML Interactive Session - Approach: Data Scientist -Always predilige dev/practical
aspects (tools & sw) -Work on the full pipeline   - Perspective: Researcher - References and Further Readings to know more - Live Coding 🧑💻 (wish me luck! 🤞 ) - non-live coding bits will have exercises to play with.

1. Model Threats

Model Vulnerabilities Adversarial Examples

Model Stealing Model Inversion Attacks

Introducing Di ff erential Privacy Inspired from: Di ff erential
Privacy on PyTorch | PyTorch Developer Day 2020   youtu.be/l6 f bl2CBnq0

Source:   pinterest.com/ agirlandaglobe/

PPML with Differential Privacy https://ppml-workshop.github.io Di ff erential privacy is
a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. Like k-Anonymity, DP is a formal notion of privacy (i.e. it’s possible to prove that a data release has the property).   Unlike k-Anonymity, however, di ff erential privacy is a property of algorithms, and not a property of data. That is, we can prove that an algorithm satis fi es di ff erential privacy; to show that a dataset satis fi es di ff erential privacy, we must show that the algorithm which produced it satis fi es di ff erential privacy.

Learning from Aggregates Introducing OPACUS Di ff erential Privacy within
the ML Pipeline ppml-tutorial/2-mia-di ff erential-privacy • Aggregate Count on the Data • Computing Mean • (Complex) Train ML model Di ff erential Privacy within the ML Pipeline

Learning from Aggregates Introducing OPACUS ppml-tutorial/2-mia-di ff erential-privacy

Why don’t we allow AI without moving data from their
silos?

Introducing: Federated Learning

So that’s it ?   Federated Learning to rule them
all ?

Federated Learning   & Encryption

Federated Learning   & Homomorphic Encryption https://blog.openmined.org/ckks-homomorphic-encryption-pytorch-pysyft-seal/ ppml-tutorial/3-federeted-learning

Wrap up • Part 1: Data Anonymisation • K-anonymity •
Part 2: Di ff erential Privacy • Properties & DP for ML • Part 3: Model Vulnerabilities and Attacks • Adversarial Examples ( in Practice ) • Model Inversion Attack • Part 4: Federated Machine Learning • Federated Data • Federated Learning (SplitNN)

Thank you very much   for your kind attention Valerio
Maggio @leriomaggio vmaggio@anaconda.com github.com/leriomaggio/ppml-tutorial speakerdeck.com/leriomaggio/ppml-scipy

Privacy Preserving Machine Learning (PPML) @ Sc...

Privacy Preserving Machine Learning (PPML) @ SciPy 2023

Valerio Maggio

More Decks by Valerio Maggio

Other Decks in Research

Featured

Transcript

Privacy Preserving Machine Learning Machine Learning on Data you’re not

also me • Background in CS • PhD in Machine

Provide an overview of the emerging tools (in the ecosystem)

• Privacy-Preserving Machine Learning (PPML) technologies have the huge potential

Let’s Introduce Privacy

Why Privacy is important The Facebook-Cambridge Analytical Scandal (2014-18) i

What about Machine Learning ? Human Learning ≠ Machine Learning

What about Machine Learning ? Human Learning ≠ Machine Learning

What about Machine Learning ? Human Learning ≠ Machine Learning

The Data Vs Privacy AI Dilemma AI models are data

The Data vs Privacy AI Dilemma AI models are data

Privacy-Preserving Data Data Anonymisation Techniques: e.g. k-anonimity • (From Wikipedia)

Privacy-Preserving Data Priv a cy a s a property of

Data Privacy Issues Source: https://venturebeat.com/2020/04/07/2020-census-data-may-not-be-as-anonymous-as-expected/ […] (we) show how these

Threats and Attacks for ML systems 2008   De-Anonymisation

Utility vs Privacy Dilemma The real challenge is balancing privacy

PPML Interactive Session - Approach: Data Scientist -Always predilige dev/practical

1. Model Threats

Model Vulnerabilities Adversarial Examples

Model Stealing Model Inversion Attacks

Model Stealing Model Inversion Attacks

Introducing Di ff erential Privacy Inspired from: Di ff erential

Source:   pinterest.com/ agirlandaglobe/

PPML with Differential Privacy https://ppml-workshop.github.io Di ff erential privacy is

Learning from Aggregates Introducing OPACUS Di ff erential Privacy within

Learning from Aggregates Introducing OPACUS ppml-tutorial/2-mia-di ff erential-privacy

Why don’t we allow AI without moving data from their

Introducing: Federated Learning

So that’s it ?   Federated Learning to rule them

Federated Learning   & Encryption

Federated Learning   & Homomorphic Encryption https://blog.openmined.org/ckks-homomorphic-encryption-pytorch-pysyft-seal/ ppml-tutorial/3-federeted-learning

Wrap up • Part 1: Data Anonymisation • K-anonymity •

Thank you very much   for your kind attention Valerio