Upgrade to Pro — share decks privately, control downloads, hide ads and more …

OpenMined: Introduction to Privacy Preserving M...

OpenMined: Introduction to Privacy Preserving Machine Learning

This one-hour live webinar will introduce participants to the fundamentals of Privacy Preserving Machine Learning (PPML). The session explores essential PPML concepts including Federated Learning, Differential Privacy, and Homomorphic Encryption, providing participants with a foundational understanding of balancing privacy and transparency in ML model development. Through practical demonstrations, attendees will learn to integrate privacy-preserving techniques into ML workflows using OpenMined. Participants will explore PySyft, a powerful open-source framework for secure and private machine learning, alongside SyftBox—OpenMined's latest project designed to make development with Privacy-Enhancing Technologies more intuitive and developer-friendly.

More info: https://openmined.github.io/intro-to-ppml-workshop/

Valerio Maggio

December 18, 2024
Tweet

More Decks by Valerio Maggio

Other Decks in Education

Transcript

  1. Mission: Unlock non-public data data for research by building open-source

    privacy software that empowers researchers to receive more benefits (e.g. co-authorship, citations, grants) while mitigating risks related to privacy, security, IP Non-profit 501(c)(3) & Open Source Community openminded.org
  2. Why Privacy is important The Facebook-Cambridge Analytical Scandal (2014-18) i

    2014 A Facebook quiz called “This Is Your Digital Life” invited users to find out their personality type The app collected data from participants but also recorded public data from those in their friends list. 2015 The Guardian reported that Cambridge Analytica had data from this app and used it to psychologically profile voters in the US. 305K people installed the app 87M people info gathered 2018 US and British lawmakers demanded that Facebook explain how the firm was able to harvest personal information without users’ consent. i Facebook apologised for the data scandal and announced changes to the privacy settings.
  3. Human vs Machine Learning ? Human Learning ≠ Machine Learning

    Is this an APPLE? Machine Learning instead may require millions of samples even for a “simple” task > ML models are data hungry Are these apples?? Data is extremely important in Machine Learning
  4. MS Researchers demonstrated that diferent models (even fairly simple ones)

    performed almost identically on Natural Language disambiguation tasks [1]: Scaling to very very large corpora for natural language disambiguation, Banko M., Brill E., ACL '01: Proceedings of the 39th Annual ACL Meeting doi.org/10.3115/1073012.1073017 [1] Data >> Model ? See Also: “The Unreasonable Efectiveness of Data”, Halevy, Norvig, and Pereira, IEEE Intelligent Systems, 2009 Human Learning ≠ Machine Learning Data is extremely important in Machine Learning Human vs Machine Learning ?
  5. To get started ? • Look for any available resource

    (dataset/benchmark/paper) Develop new automated methods to detect breast cancer tumours
  6. 3.4M scans 60K scans (~2%?) Permission for external collaboration by

    research and industry partners are reviewed on a case-by-case basis […] we have released 20% of the dataset […] allowing researchers to review the structure and content of EMBED […] 360K scans (~10%?) EMBED License […] Emory grants you permission to view and use EMBED for personal, non-commercial research. You may not otherwise copy, reproduce, retransmit, distribute, publish, commercially exploit or otherwise transfer any material.
  7. 5K scans (4 views) - The LICENSEE will not attempt

    to identify any individual or institution referenced in PhysioNet restricted data. […] - The LICENSEE will not share access [.…] […] - The LICENSEE will use the data for […] scientific research and no other.
  8. 5K scans (4 views) - The LICENSEE will not attempt

    to identify any individual or institution referenced in PhysioNet restricted data. […] - The LICENSEE will not share access [.…] […] - The LICENSEE will use the data for […] scientific research and no other. • Not all data is available • 60K from original 3.4M • Research cannot immediately start • Additional Admin required • License to agree and abide to • No (real) control over data use/misuse
  9. Data Availability Not every dataset can be open ISSUE #1

    ISSUE #1.1 We’re trying to move data where computation lives (i.e. the Copy) problem
  10. Data Availability Not every dataset can be open ISSUE #1

    ISSUE #1.1 We’re trying to move data where computation lives (i.e. the Copy) problem ISSUE #1.2 There is NO control over data use/misuse
  11. Threats and Attacks for ML systems 2008 De-Anonymisation (re-identification) 2011

    
 Reconstruction Attack 2013 Parameter Inference Attack 2015 Model Inversion Attacks 2017 Membership Inference Attacks ML Model MLaaS White Box Attacks Black Box Attacks Perimeter • privacy has to be implemented systematically without using arbitrary mechanisms. • ML applications are prone to different privacy and security threats.
  12. PETs? HOMOMORPHIC ENCRYPTION K-ANONYMIZATION SECURE ENCLAVES FUNCTIONAL ENCRYPTION ZERO-KNOWLEDGE PROOFS

    SYNTHETIC DATA DIFFERENTIAL PRIVACY FEDERATED LEARNING SECURE MULTI-PARTY COMPUTATION
  13. PETs and PPML • PETs have the huge potential to

    become the Data Science paradigm of the future • Joint effort of ML & Security & Math & Open Source Communities • Privacy-Preserving Machine Learning (PPML) refers to PETs integrated within Machine learning workflows.
  14. What are PETs? HOMOMORPHIC ENCRYPTION K-ANONYMIZATION SECURE ENCLAVES FUNCTIONAL ENCRYPTION

    ZERO-KNOWLEDGE PROOFS SYNTHETIC DATA DIFFERENTIAL PRIVACY FEDERATED LEARNING SECURE MULTI-PARTY COMPUTATION Input Privacy Output Privacy ZERO-KNOWLEDGE PROOFS CRYPTOGRAPHIC SIGNATURES TRUST OVER IP INFRA Input Verification ACTIVE SECURITY SECURE ENCLAVES Output Verification
  15. PETs, PPML, and the “Data vs Privacy” Dilemma AI models

    are data hungry: • The more the data, the better the model • Push for High-quality and Curated* Open Datasets * More on the Curated possible meanings in the next slide! High-sensitive data: we need to keep data safe from both intentional and accidental leakage Data &| Models are kept in silos! Privacy Enhancing Technology 
 (PETs)
  16. Utility vs Privacy Trade-off The real challenge is balancing privacy

    and performance in ML applications so that we can better utilise the data while ensuring the privacy of the individuals. Image Credits: Johannes Stutz (@dafflowjoe) No PETs PETs
  17. PETS MAKE IT POSSIBLE TO: answer a question using data

    you cannot see. HOMOMORPHIC ENCRYPTION K-ANONYMIZATION SECURE ENCLAVES FUNCTIONAL ENCRYPTION ZERO-KNOWLEDGE PROOFS SYNTHETIC DATA DIFFERENTIAL PRIVACY FEDERATED LEARNING SECURE MULTI-PARTY COMPUTATION This is the ability that matters! These are just algorithms! In Another Country In Another Org In Another Dept.
  18. Remotely study data on another organisation server Org With Data

    Researcher OUR VISION / SOLUTION Can answer a “specific” question without seeing nor copying the data Retains governance over the information and never shares a copy of the data safeguarding privacy Systematic way! HOW?
  19. Structured Transparency Framework Information Hierarchy Stakeholder Access Public Internal Private

    General Team Admin Governance Goal: make transparency systematic and sustainable rather than ad hoc, creating trust while protecting sensitive information.
  20. Structured Transparency Framework Input Policy Data Validation Access Control Input

    Filtering Processing Policy Data Transform Compliance Check Audit Logging Output Policy Data Filtering Privacy Check Format Control Structured Transparency Policy Framework
  21. 1.1 Upload the Dataset into the Datasite 1.2 Creates an

    account for an external researcher 1.3 Univ. Wisconsin Datasite Data Owner … goes and has a coffee… …(or tea) Step 1: University launches the Datasite
  22. 1.1 Upload the Dataset into the Datasite 1.2 Creates an

    account for an external researcher 1.3 Univ. Wisconsin Datasite Data Owner … goes and has a coffee… …(or tea) Step 1: University launches the Datasite
  23. 1.1 Upload the Dataset into the Datasite 1.2 Creates an

    account for an external researcher 1.3 Univ. Wisconsin Datasite Data Owner … goes and has a coffee… …(or tea) Step 1: University launches the Datasite
  24. 1.1 Upload the Dataset into the Datasite 1.2 Creates an

    account for an external researcher 1.3 Univ. Wisconsin Datasite Data Owner … goes and has a coffee… …(or tea) Step 1: University launches the Datasite
  25. Univ. Wisconsin Datasite Data Owner [email protected] Dr. Rachel Science *************

    1.1 Upload the Dataset into the Datasite 1.2 Creates an account for an external researcher 1.3 … goes and has a coffee… …(or tea) Step 1: University launches the Datasite
  26. bye! * Univ. Wisconsin Datasite Owen, Data Owner Step 1:

    University launches the Datasite 1.1 Upload the Dataset into the Datasite 1.2 Creates an account for an external researcher 1.3 … goes and has a coffee… …(or tea)
  27. Step 2: External Researcher proposes a project * Univ. Wisconsin

    Datasite Rachel, Data Scientist Research Code Project Proposal
  28. 2.1 Login to Domain Server Rachel, Data Scientist 2.2 Prepare

    Code using public mock data Univ. Wisconsin Datasite Step 2: External Researcher proposes a project 2.3 Submit Research Project
  29. 2.1 Login to Domain Server Rachel, Data Scientist 2.2 Prepare

    Code using public mock data Univ. Wisconsin Datasite Step 2: External Researcher proposes a project 2.3 Submit Research Project
  30. Open-access
 Mock Data Securely-hosted
 Real Data 2.1 Login to Domain

    Server 2.2 Prepare Code using public mock data 2.3 Submit Research Project Step 2: External Researcher proposes a project Rachel, Data Scientist This is the Structured Transparency + PETs part
  31. Open-access
 Mock Data Securely-hosted
 Real Data 2.1 Login to Domain

    Server 2.2 Prepare Code using public mock data 2.3 Submit Research Project Step 2: External Researcher proposes a project Rachel, Data Scientist This is the Structured Transparency + PETs part
  32. Open-access
 Mock Data Securely-hosted
 Real Data 2.1 Login to Domain

    Server 2.2 Prepare Code using public mock data 2.3 Submit Research Project Step 2: External Researcher proposes a project Rachel, Data Scientist This is the Structured Transparency + PETs part
  33. 2.1 Login to Domain Server 2.2 Prepare Code using public

    mock data 2.3 Submit Research Project Rachel, Data Scientist Univ. Wisconsin Datasite Q Q Q Step 2: External Researcher proposes a project This is the ST+PETs part Di" priv
  34. 2.1 Login to Domain Server 2.2 Prepare Code using public

    mock data 2.3 Submit Research Project Rachel, Data Scientist Univ. Wisconsin Datasite Q Q Q Step 2: External Researcher proposes a project This is the ST+PETs part Di" priv
  35. 2.1 Login to Domain Server 2.2 Prepare Code using public

    mock data 2.3 Submit Research Project Rachel, Data Scientist Univ. Wisconsin Datasite Q Q Q Step 2: External Researcher proposes a project This is the ST+PETs part Di" priv
  36. 2.1 Login to Domain Server 2.2 Prepare Code using public

    mock data 2.3 Submit Research Project Rachel, Data Scientist Univ. Wisconsin Datasite Q Q Q Step 2: External Researcher proposes a project
  37. 2.1 Login to Domain Server 2.2 Prepare Code using public

    mock data 2.3 Submit Research Project Rachel, Data Scientist Univ. Wisconsin Datasite Q Q Q Step 2: External Researcher proposes a project
  38. 2.1 Login to Domain Server 2.2 Prepare Code using public

    mock data 2.3 Submit Research Project Rachel, Data Scientist Univ. Wisconsin Datasite Q Q Q Step 2: External Researcher proposes a project
  39. Q Q Q … 010101010110101 01010… … 010101010110 10101010… Step

    3: Admin reviews code Owen, Data Owner Rachel, Data Scientist * A A A Univ. Wisconsin Datasite
  40. 3.1 Reviews incoming code 3.3 ORGANISATION ADMIN CODE REVIEW 10010101010101010101

    10101011010100101010 01000001110101010101 01011101101010101011 Submits the results 3.2 Executes the audit code … 0101010101101 0101010… * A A A Univ. Wisconsin Datasite Data Owner Step 3: Admin reviews code
  41. 3.1 Reviews incoming code 3.3 ORGANISATION ADMIN CODE REVIEW 10010101010101010101

    10101011010100101010 01000001110101010101 01011101101010101011 Submits the results 3.2 Executes the audit code … 0101010101101 0101010… * A A A Univ. Wisconsin Datasite Data Owner Step 3: Admin reviews code
  42. 3.1 Reviews incoming code 3.3 ORGANISATION ADMIN CODE REVIEW 10010101010101010101

    10101011010100101010 01000001110101010101 01011101101010101011 Submits the results 3.2 Executes the audit code … 0101010101101 0101010… * A A A Univ. Wisconsin Datasite Data Owner Step 3: Admin reviews code
  43. 3.1 Reviews incoming code 3.3 ORGANISATION ADMIN CODE REVIEW 10010101010101010101

    10101011010100101010 01000001110101010101 01011101101010101011 Submits the results … 0101010101101 0101010… * A A A Univ. Wisconsin Datasite Data Owner 3.2 Executes the code Step 3: Admin reviews code
  44. 3.1 Reviews incoming code * A A A Univ. Wisconsin

    Datasite Data Owner 3.2 Executes the code Step 3: Admin reviews code APPROVED! 10010101010 10101010110 10101101010 01010100100 3.3 Submits the results
  45. A A A Step 4: External Researcher downloads answers Rachel,

    Data Scientist Univ. Wisconsin Datasite *
  46. A A A Step 4: External Researcher downloads answers Rachel,

    Data Scientist Univ. Wisconsin Datasite *
  47. Privacy-Preserving Data Data Anonymisation Techniques: e.g. k-anonimity • (From Wikipedia)

    In the context of k-anonymization problems, a database is a table with n rows and m columns. Each row of the table represents a record relating to a specific member of a population and the entries in the various rows need not be unique. The values in the various columns are the values of attributes associated with the members of the population. Privacy as a property of Data Dataset !K-Anonymised Dataset Algorithm #1 Algorithm #2 Algorithm #k Data Sharing https://github.com/leriomaggio/privacy-preserving-data-science
  48. Privacy-Preserving Data Privacy as a property of Data Patients Medical

    Records (Sanitised) History of Medical Prescriptions Patients bought meds from which pharmacy Pharmacies visited Roughly infer ZIP codes and residency (even without address info: e.g. most visited pharmacy) Correlation between medications and disease
  49. Data Privacy Issues Source: https://venturebeat.com/2020/04/07/2020-census-data-may-not-be-as-anonymous-as-expected/ […] (we) show how these

    methods can be used in practice to de-anonymize the Netflix Prize dataset, a 500,000-record public dataset. Linking Attack
  50. Introducing OUTPUT PRIVACY #2 Inspired from: Differential Privacy on PyTorch

    | PyTorch Developer Day 2020 youtu.be/l6ffbl2CBnq0 Differential Privacy
  51. Differential Privacy A more formal definition of Differential Privacy Like

    k-anonimity, di!erential privacy is a formal notion of privacy (i.e. it's possible to prove that a data release has the property). Unlike k-Anonymity however, di!erential privacy is a property of algorithms, and not a property of data.
  52. Learning from Aggregates Introducing OPACUS Diferential Privacy within the ML

    Pipeline leriomaggio/ppml-tutorial/3-diferential-privacy • Aggregate Count on the Data • Computing Mean • (Complex) Train ML model Diferential Privacy within the ML Pipeline
  53. Federated Learning & Homomorphic Encryption What is Homomorphic Encryption? •

    Encryption method allowing computations on encrypted data • Result matches encryption of computation on unencrypted data • E(x) ⊕ E(y) = E(x + y) 1. Partially Homomorphic (PHE) ◦ Single operation (addition OR multiplication) ◦ Examples: RSA, Paillier 2. Fully Homomorphic (FHE) ◦ Multiple operations (both addition AND multiplication) ◦ Enables arbitrary computations ◦ Higher computational overhead Types
  54. https://syftbox-documentation.openmined.org/ FL Workflow with SyftBox syftbox client ____ __ _

    ____ / ___| _ _ / _| |_| __ ) _____ __ \___ \| | | | |_| __| _ \ / _ \ \/ / ___) | |_| | _| |_| |_) | (_) > < |____/ \__, |_| \__|____/ \___/_/\_\ |___/ Installing uv Installing SyftBox (with managed Python 3.12) Installation completed! Start the client now? [y/n] y Starting SyftBox client... 2024-12-17 18:36:42.845 | INFO | syftbox.client.client2:run_client:308 - Client metadata { "client_config": { "data_dir": "/Users/leriomaggio/SyftBox", "server_url": "https://syftbox.openmined.org/", "client_url": "http://127.0.0.1:8080/", "email": "[email protected]", "token": "6096369892341579219", "access_token": “eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJlbWFpbCI6InZhbGVyaW9Ab3Blbm1pbmVkLm9yZyaertfbd” }, "server_syftbox_version": null, "client_syftbox_version": "0.2.11", "python_version": "3.12.7 (main, Oct 1 2024, 02:05:46) [Clang 16.0.0 (clang-1600.0.26.3)]", "platform": "macOS-15.1.1-arm64-arm-64bit", "timestamp": "2024-12-17T18:36:42.845623Z", "env": { "DISABLE_ICONS": false, "CLIENT_CONFIG_PATH": "/Users/leriomaggio/.syftbox/config.json" } } Join the Live SyftBox Network
  55. https://syftbox-documentation.openmined.org/ FL Workflow with SyftBox fl_aggregator API main.py run.sh "

    # $ Ana Bob John git clone https://github.com/OpenMined/fl_client $HOME/SyftBox/ apis datasites
  56. https://syftbox-documentation.openmined.org/ FL Workflow with SyftBox fl_aggregator API main.py run.sh "

    # $ Ana Bob John git clone https://github.com/OpenMined/fl_client fl_client API $HOME/SyftBox/ apis datasites
  57. https://syftbox-documentation.openmined.org/ FL Workflow with SyftBox fl_aggregator API main.py run.sh "

    # $ fl_client API /Ana/SyftBox/ apis datasites fl_client API /Bob/SyftBox/ apis datasites fl_client API /John/SyftBox/ apis datasites datasites datasites datasites
  58. https://syftbox-documentation.openmined.org/ FL Workflow with SyftBox fl_aggregator API main.py run.sh "

    # $ /Ana/SyftBox/datasites/[email protected]/ /Bob/SyftBox/datasites/[email protected]/ /Bob/SyftBox/datasites/[email protected]/ Local Model Local Model Local Model Global Model Accuracy