Privacy Preserving Machine Learning (PPML) @ SciPy 2023

Slide 1

Slide 1 text

Privacy Preserving Machine Learning Machine Learning on Data you’re not allowed to see @leriomaggio github.com/leriomaggio/ppml-tutorial speakerdeck.com/leriomaggio/ppml-scipy [email protected]

Slide 2

Slide 2 text

also me • Background in CS • PhD in Machine Learning • Research: ML/DL for BioMedicine • SSI Fellow • Python Geek   • Data Scientists Advocate _at_ Anaconda me pun Who?

Slide 3

Slide 3 text

Provide an overview of the emerging tools (in the ecosystem)   for Privacy Enhancing Technologies (a.k.a. PETs) with focus on   Machine Learning Aim of this Tutorial Privacy-Preserving Machine Learning (PPML)

Slide 4

Slide 4 text

• Privacy-Preserving Machine Learning (PPML) technologies have the huge potential to be the   Data Science paradigm of the future • Joint e ff ort of Open Source & ML & Security Communities • I wish to disseminate the knowledge about these new methods and technologies among researchers • Focus on Reproducibility of PPML work fl ows SSI Fellowship: PPML What I would like to do Any help or suggestions about Use/Data cases or more generally Case studies, or any contribution to shape the repository will be very much appreciated! Looking forward to collaborations and contributions ☺ Aw a rded by JGI Seed-Corn Fundings 2021 je a ngoldinginstitute.blogs.bristol. a c.uk/2021/01/07/seed-corn-funding-winner- a nnouncement/

Slide 5

Slide 5 text

Let’s Introduce Privacy

Slide 6

Slide 6 text

Why Privacy is important The Facebook-Cambridge Analytical Scandal (2014-18) i 2014   A Facebook quiz called “This Is Your Digital Life” invited users to f ind out their personality type The app collected data from participants but also recorded public data from those in their friends list. 2015   The Guardian reported that Cambridge Analytica had data from this app and used it to psychologically pro f ile voters in the US. 305K people installed the app 87M people info gathered 2018   US and British lawmakers demanded that Facebook explain how the f irm was able to harvest personal information without users’ consent. i Facebook apologised for the data scandal and announced changes to the privacy settings.

Slide 7

Slide 7 text

What about Machine Learning ? Human Learning ≠ Machine Learning R G B

Slide 8

Slide 8 text

What about Machine Learning ? Human Learning ≠ Machine Learning Di ff erent Challenges APPLE Machine Learning instead may require millions of samples even for a “simple” task > ML models are data hungry APPLE??

Slide 9

Slide 9 text

What about Machine Learning ? Human Learning ≠ Machine Learning Di ff erent Challenges APPLE APPLE?? MS Researchers demonstrated that di ff erent models (even fairly simple ones)   performed almost identically on Natural Language disambiguation tasks [1]: Scaling to very very large corpora for natural language disambiguation, Banko M., Brill E., ACL '01: Proceedings of the 39th Annual ACL Meeting doi.org/10.3115/1073012.1073017 [1] Data >> Model ? See Also: “The Unreasonable E ff ectiveness of Data”, Halevy, Norvig, and Pereira, IEEE Intelligent Systems, 2009

Slide 10

Slide 10 text

The Data Vs Privacy AI Dilemma AI models are data hungry: • The more the data, the better the model • Push for High-quality and Curated* Open Datasets * More on the Curated possible meanings in the next slides! High-sensitive data: we need to keep data safe from both intentional and accidental leakage Data &| Models are kept in silos!

Slide 11

Slide 11 text

The Data vs Privacy AI Dilemma AI models are data hungry: • The more the data, the better the model • Push for High-quality and Curated* Open Datasets * More on the Curated possible meanings in the next slide! High-sensitive data: we need to keep data safe from both intentional and accidental leakage Data &| Models are kept in silos! Data accounting for privacy   (privacy preserving data)

Slide 12

Slide 12 text

Privacy-Preserving Data Data Anonymisation Techniques: e.g. k-anonimity • (From Wikipedia)   In the context of k-anonymization problems, a database is a table with n rows and m columns.   Each row of the table represents a record relating to a speci fi c member of a population and the entries in the various rows need not be unique. The values in the various columns are the values of attributes associated with the members of the population. Priv a cy a s a property of D a t a Dataset 🔒K-Anonymised   Dataset Algorithm #1 Algorithm #2 Algorithm #k Data Sharing https://github.com/leriomaggio/privacy-preserving-data-science

Slide 13

Slide 13 text

Privacy-Preserving Data Priv a cy a s a property of D a t a Patients   Medical   Records (Sanitised) History of Medical Prescriptions Patients bought meds from which pharmacy Pharmacies visited Roughly infer ZIP codes and residency   (even without address info: e.g. most visited pharmacy) Correlation between medications and disease

Slide 14

Slide 14 text

Data Privacy Issues Source: https://venturebeat.com/2020/04/07/2020-census-data-may-not-be-as-anonymous-as-expected/ […] (we) show how these methods can be used in practice to de-anonymize the Netﬂix Prize dataset, a 500,000-record public dataset. Linking Attack

Slide 15

Slide 15 text

Threats and Attacks for ML systems 2008   De-Anonymisation   (re-identi fi cation) 2011   Reconstruction Attack 2013   Parameter Inference Attack 2015   Model Inversion Attacks 2017   Membership Inference Attacks ML Model MLaaS White Box   Attacks Black Box   Attacks Perimeter • privacy has to be implemented systematically without using arbitrary mechanisms. • ML applications are prone to di ff erent privacy and security threats.

Slide 16

Slide 16 text

Utility vs Privacy Dilemma The real challenge is balancing privacy and performance in ML applications so that we can better utilise the data while ensuring the privacy of the individuals. Image Credits: Johannes Stutz (@da f lowjoe)

Slide 17

Slide 17 text

PPML Interactive Session - Approach: Data Scientist -Always predilige dev/practical aspects (tools & sw) -Work on the full pipeline   - Perspective: Researcher - References and Further Readings to know more - Live Coding 🧑💻 (wish me luck! 🤞 ) - non-live coding bits will have exercises to play with.

Slide 18

Slide 18 text

1. Model Threats

Slide 19

Slide 19 text

Model Vulnerabilities Adversarial Examples

Slide 20

Slide 20 text

Model Stealing Model Inversion Attacks

Slide 21

Slide 21 text

Model Stealing Model Inversion Attacks

Slide 22

Slide 22 text

Introducing Di ff erential Privacy Inspired from: Di ff erential Privacy on PyTorch | PyTorch Developer Day 2020   youtu.be/l6 f bl2CBnq0

Slide 23

Slide 23 text

No content

Slide 24

Slide 24 text

Source:   pinterest.com/ agirlandaglobe/

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

No content

Slide 27

Slide 27 text

No content

Slide 28

Slide 28 text

PPML with Differential Privacy https://ppml-workshop.github.io Di ff erential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. Like k-Anonymity, DP is a formal notion of privacy (i.e. it’s possible to prove that a data release has the property).   Unlike k-Anonymity, however, di ff erential privacy is a property of algorithms, and not a property of data. That is, we can prove that an algorithm satis fi es di ff erential privacy; to show that a dataset satis fi es di ff erential privacy, we must show that the algorithm which produced it satis fi es di ff erential privacy.

Slide 29

Slide 29 text

Learning from Aggregates Introducing OPACUS Di ff erential Privacy within the ML Pipeline ppml-tutorial/2-mia-di ff erential-privacy • Aggregate Count on the Data • Computing Mean • (Complex) Train ML model Di ff erential Privacy within the ML Pipeline

Slide 30

Slide 30 text

Learning from Aggregates Introducing OPACUS ppml-tutorial/2-mia-di ff erential-privacy

Slide 31

Slide 31 text

Why don’t we allow AI without moving data from their silos?

Slide 32

Slide 32 text

Introducing: Federated Learning

Slide 33

Slide 33 text

So that’s it ?   Federated Learning to rule them all ?

Slide 34

Slide 34 text

Federated Learning   & Encryption

Slide 35

Slide 35 text

Federated Learning   & Homomorphic Encryption https://blog.openmined.org/ckks-homomorphic-encryption-pytorch-pysyft-seal/ ppml-tutorial/3-federeted-learning

Slide 36

Slide 36 text

Wrap up • Part 1: Data Anonymisation • K-anonymity • Part 2: Di ff erential Privacy • Properties & DP for ML • Part 3: Model Vulnerabilities and Attacks • Adversarial Examples ( in Practice ) • Model Inversion Attack • Part 4: Federated Machine Learning • Federated Data • Federated Learning (SplitNN)

Slide 37

Slide 37 text

Thank you very much   for your kind attention Valerio Maggio @leriomaggio [email protected] github.com/leriomaggio/ppml-tutorial speakerdeck.com/leriomaggio/ppml-scipy