Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Towards Building a Responsible Data Economy (Da...

Towards Building a Responsible Data Economy (Dawn Song, UC Berkeley)

Data is a key driver of modern economy and AI/machine learning, however, a lot of this data is sensitive and handling the sensitive data has caused unprecedented challenges for both individuals and businesses, and these challenges will only get more severe as we move forward in the digital era. In this talk, I will talk about technologies needed for responsible data use including secure computing, differential privacy, federated learning, as well as blockchain technologies for data rights, and how to combine privacy computing technologies and blockchain to building a platform for a responsible data economy, to enable more responsible use of data that maximizes social welfare and economic efficiency while protecting users’ data rights and enable fair distribution of value created from data.

Anyscale

July 20, 2021
Tweet

More Decks by Anyscale

Other Decks in Technology

Transcript

  1. $3 T Global Data Economy Source 2.5 Quintillion Bytes of

    data generated a day 8% Of the EU’s GDP= value from personalized data
  2. Individuals have lost control over how their data is used

    https://www.fastcompany.com/90310803/here-are-the-data-brokers-quietly-buying-and-selling-your-personal-information https://www.vice.com/en_us/article/evjekz/the-california-dmv-is-making-dollar50m-a-year-selling-drivers-personal-information
  3. Anonymization doesn’t adequately protect user privacy The NYTimes was able

    to track the location of a Secret Service agent with former president Trump from an anonymized mobile phone location dataset. https://www.nytimes.com/interactive/2019/12/20/opinion/location-data-national-security.html 7:10am -- At Mar-a-Lago 9:24am -- At the Jupiter Golf Club 1:14pm -- Stop for lunch in West Palm Beach 5:08pm -- Phone was back in Mar-a-Lago
  4. Regulations like CCPA and GDPR are becoming costly for businesses

    https://www.cnbc.com/2019/10/05/california-consumer-privacy-act-ccpa-could-cost-companies-55-billion.html https://www.csoonline.com/article/3410278/the-biggest-data-breach-fines-penalties-and-settlements-so-far.html The cost businesses will have to pay to comply with CCPA. $55B
  5. The biggest obstacle to using advanced data analysis isn’t skill

    base or technology; it’s plain old access to the data. -Edd Wilder-James, Harvard Business Review “ ” https://hbr.org/2016/12/breaking-down-data-silos
  6. Goals/principles of A Responsible Data Economy • Establishing & enforcing

    data rights ◦ Foundation of data economy and preventing misuse/abuse of data • Fair distribution of value created from data ◦ Users should be able to gain benefit from their data • Efficient data use to maximize social welfare & economic efficiency ◦ Data useful for individuals, organizations, governments, societies
  7. Unique Challenges & Complexity • Natural tension between utility &

    privacy • Data is non-rival; different from physical objects • Once data is given out (copied), one cannot take it back (undo it) • Data dependency & data externality • Processed data can reveal information about original data • Data about one entity can reveal information about another entity • Cannot simply copy concepts and methods in analog world
  8. A Framework for a Responsible Data Economy Technical Solutions Legal

    Framework Incentive Models • Requires a combination of technical & non-technical solutions
  9. Data protection in use Protect computation outputs from leaking sensitive

    information Control use of data without copying the raw data We need technology that provides: Traditional solutions are insufficient: Data is either not used, or is copied — making it difficult to control usage Data Encryption only protects data at rest or in transit Anonymizing data doesn’t always protect privacy
  10. Differential privacy ensures computation output won’t leak sensitive information about

    individuals Secure computing (secure hardware, MPC, FHE, etc.) keeps data confidential even while in use by an application Federated learning means data never leaves an owner’s machine, and models are trained in a distributed manner Rapid Advancement in Responsible Data Technologies Distributed ledger provides an immutable log to ensure data usage is compliant
  11. Secure computation techniques Trusted hardware Fully homomorphic encryption (FHE) Secure

    multi-party computation (MPC) Zero-knowledge proof Performance Support for general-purpo se computation Security mechanisms Secure hardware Cryptography, distributed trust Cryptography Cryptography, local computation
  12. Performance of Homomorphic Encryption (HE) Methods MNIST: Amortized inference time

    per instance (ms) [Jiang-CCS-2018, Bourse-CRYPTO-2018, Boemer-CF-2019, Boemer-WAHC-2019] Performance for HE-based methods for NN inference has improved by 2 order of magnitude in the last several years Training time [Lou-NeurIPS-2020] HE-based methods still too slow for large network inference, and currently impractical for training for even small network
  13. Performance of Secure Multi-party Computation (MPC) Methods 2PC-based method for

    inference: ImageNet (ResNet50): ~550s per instance [Rathee-CCS-2020] vs ~3ms on GPU, ~155ms on CPU Training is too slow even for small network FALCON [Wagh-PETS-2021] ~3-4 orders of magnitude slower than the GPU version ~2-3 orders of magnitude slower than the CPU version Numbers are for a 128 size batch in milliseconds. 3PC performance for inference & training: ImageNet (VGG-16) training takes more than 7 years [Wagh-PETS-2021]
  14. Secure computation techniques Trusted hardware Fully homomorphic encryption Secure multi-party

    computation Zero-knowledge proof Performance Support for general-purpo se computation Security mechanisms Secure hardware Cryptography, distributed trust Cryptography Cryptography, local computation
  15. Secure Hardware OS Applications Secure Enclave Smart contract & data

    Enclave contents Integrity Confidentiality Remote Attestation
  16. Secure Enclave as a Cornerstone Security Primitive • Strong security

    capabilities ◦ Authenticate itself (device) ◦ Authenticate software ◦ Guarantee the integrity and confidentiality of execution • Platform for building new security applications ◦ Couldn’t be built otherwise for the same practical performance
  17. 2016 SEV: Secure Encrypted Virtualization - Introduced in EYPC server

    processor line - Provides confidentiality but not integrity 2017 2014 SGX: Software Guard Extensions Built in to all Core™ processors (6th-generation and later) Trusted Execution Environment - Hardware-based isolation - TLK: open-source stack for TEE 2015 ARM TrustZone Hardware-based isolation for embedded devices 2018 - Remedies issues in previous secure hardware - Can be publicly analyzed and verified - Can be manufactured by any manufacturer - First release: Fall 2018 Keystone: Open-source secure enclave https://keystone-enclave.github.io Trusted hardware timeline Closed source Open source Intel SGX version 2 - In pipeline - Drivers already available
  18. Keystone: an Open Framework for Customizable TEEs 24 • Simpler

    Abstraction ◦ Decoupling core security primitives (SM) from all the other features (Runtime) ◦ Memory isolation with RISC-V standard feature (PMP) • Modular and Flexible Design ◦ Extensible functional and security plugins ◦ Implement new features without changing core primitive • Minimal Trusted Computing Base ◦ Base SM: 6000 LoC, Base Runtime: 3000 LoC U-mode S-mode M-mode User process OS Hypervisor Root of Trust Security Monitor Enclave App Privilege Higher Trusted Untrusted Lower Enclave Runtime keystone-enclave.org Standard RISC-V HW
  19. Experimental Results I/O Benchmark (IOZone) 25 Machine Learning Benchmark (Torch)

    User CPU Overhead < 0.6% (Beebs, Coremark, RV8) D. Lee, D. Kohlbrenner, S. Shinde, D. Song, K. Asanovic "Keystone: An Open Framework for Architecting TEEs", EuroSys. 2020
  20. In 10 Years: Secure Computing will Become Common Place •

    In 10 years, most chips will have secure enclave (secure execution environment) capabilities • In 10 years, most computation will use secure enclaves (secure execution environments) • In 10 years, hardware accelerators for cryptographic methods for secure computation will be widely available
  21. Differential privacy ensures computation output won’t leak sensitive information about

    individuals Secure computing (secure hardware, MPC, FHE, etc.) keeps data confidential even while in use by an application Federated learning means data never leaves an owner’s machine, and models are trained in a distributed manner Rapid Advancement in Responsible Data Technologies Distributed ledger provides an immutable log to ensure data usage is compliant
  22. Do Neural Networks Remember Training Data? Can Attackers Extract Secrets

    (in Training Data) from (Querying) Learned Models? N Carlini, C Liu, J Kos, Ú Erlingsson, and D Song. "The Secret Sharer: Measuring Unintended Neural Network Memorization & Extracting Secrets". 2018. N Carlini, et. Al., ”Extracting Training Data from Large Language Models”
  23. N Carlini, C Liu, J Kos, Ú Erlingsson, D Song.

    The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks 2018 1. Train 2. Predict "What are you" "doing"
  24. Extracting Social Security Number from Language Model • Learning task:

    train a language model on Enron Email dataset ◦ Containing actual people’s credit card and social security numbers • New attacks: can extract 3 of the 10 secrets completely by querying trained models • New measure “Exposure” for memorization ◦ Used in Google Smart Compose
  25. 200,000 LM Generations LM (GPT-2) Sorted Generations (using one of

    6 metrics) Deduplicate Training Data Extraction Attack Prefixes Evaluation Internet Search Choose Top-100 Check Memorization Match No Match Training Data Extraction from Large Scale Language Models (GPT-2) • Use GPT-2 to minimize harm (model and data are public) ◦ attacks apply to any LM • Choose 100 samples from each of 18 different attacks configurations -> 1800 samples
  26. Preventing Memorization •Differential Privacy: a formal notion of privacy to

    protect sensitive inputs •Solution: train a differentially-private neural network ◦Exposure is lower empirically ◦Attack unable to extract secrets
  27. Differential Privacy by Program Rewriting •Chorus automatically rewrites input SQL

    queries into intrinsically private queries ◦Embeds a differential privacy mechanism in the query ◦Does not require any modifications to database engine or data ◦Works with any standard SQL database Intrinsicall y private query Analys t Chorus Quer y Query Analysis & Rewriting Engine Original databas e Differentially private results
  28. Duet: An Expressive Higher-Order Language and Linear Type System for

    Statically Enforcing Differential Privacy [Distinguished Paper Award, OOPSLA 2019] satisfies - differential privacy e.g., Duet automatically proves that:
  29. Differential privacy ensures computation output won’t leak sensitive information about

    individuals Secure computing (secure hardware, MPC, FHE, etc.) keeps data confidential even while in use by an application Federated learning means data never leaves an owner’s machine, and models are trained in a distributed manner The Platform for a Responsible Data Economy Distributed ledger provides an immutable log to ensure data usage is compliant Secure Distributed Computing Fabric
  30. Data Commons for Decentralized Data Science • Data owners/producers register

    datasets in data catalogs with policy specified • Data consumers search through data catalogs to find relevant data • Data consumers write data analytics and machine learning programs over different datasets and data sources • The platform provides distributed secure computing while ensuring the program is compliant with desired policies • Reduce friction of data usage; removing data silos; enforcing security and privacy protection
  31. In 10 Years: Data Trusts/Commons will Become Predominant • In

    10 years, data trusts/commons will become predominant ways of utilizing diverse sources of data, enabling ownership economy where users benefit from their data as owners & partners • In 10 years, data stewards/fiduciaries/trustees will be a new class of entities important in the data ecosystem, managing/protecting users’ data and growing its value • In 10 years, huge economic value will be created through these new forms of data trusts/commons, orders of magnitude higher than today’s data marketplace
  32. Creating a new type of assets: Data Assets The combination

    of secure computing and blockchain allows for a new paradigm of data assetization. • Blockchain allows for logging and enforcement of usage policies with high integrity and auditability. • Secure computing ensures that data remains private during compute and cannot be reused without permission. This capsule of data + policies creates an asset that can be consumed along specific guidelines for a specific fee or exchange of value.
  33. With Data Tokenization the Oasis Platform can unlock a new

    responsible data economy where individuals can maintain data rights and earn value from their data assets.
  34. Recent layoffs in DTC genomics indicate a strong need for

    privacy Was cited by both 23&Me and Ancestry as a main reason for decline in D2C market Privac y https://techcrunch.com/2020/02/05/ancestry-lays-off-6-of-staff-as-consumer-genetic-testing-market-continues-to-decline/ https://www.theverge.com/2020/1/23/21078911/23andme-layoffs-100-employees-ceo-privacy-dna-testing
  35. Use Case: Beta launch on Oasis Platform to give users

    control of their genome data Problem: • Genomic data is incredibly valuable and personal, and are often sold or misused by genome sequencing companies. • Nebula Genomics needs a solution that allows there customers to have better control and oversight of their data. Solution: • Nebula’s users leverage Oasis platform to control how their genome is used and ensure it remains private. • Data is kept private through TEEs and a record of actions is stored on the Oasis platform.
  36. A Framework for a Responsible Data Economy Technical Solutions Legal

    Framework Incentive Models Problem is complex • Natural tension between utility & privacy • Data is non-rival; cannot simply copy concepts and methods in analog world • Requires a combination of technical & non-technical solutions
  37. Need Better Incentives Models: How to determine & distribute value

    of data? ”Towards Efficient Data Valuation Based on the Shapley value.” Jia*, Dao*, Wang, Hubis, Gurel, Li, Zhang, Spanos, Song. AISTATS 2019 “Efficient Data Valuation for Nearest Neighbor Algorithms.” Jia, Dao, Wang, Hubis, Gurel, Hynes, Li, Zhang, Spanos, Song. VLDB 2019. “An Empirical and Comparative Analysis of Data Valuation with Scalable Algorithms.” Jia, Sun*, Xu*, Zhang, Li, Song. arXiv:1911.07128 Machine learning as a coalitional game: - Data contributors are players in a coalition - Usefulness of data is characterized via utility function Shapley value: - Defines a way of distributing the profit generated by the coalition of all players - First proposed by Lloyd Shapley in 1953 - The only distribution that satisfies a collection of desirable properties - Provides a good measure of importance of data points
  38. Open Challenge: What are data rights? Who controls data rights?

    Individual property rights are a cornerstone of modern economy Helped establish modern economics and fueled centuries of significant growth Ishay, Micheline (2008). The History of Human Rights: From Ancient Times to the Globalized Era. University of California Press. p. 91. ISBN 978-0-520-25641-5 Besley, Timothy; Maitreesh, Ghatak (2009). Rodrik, Dani; Rosenzweig, Mark R (eds.). "Property Rights and Economic Development". Handbook of Development Economics. V: 4526–28.
  39. Today, we lack an adequate framework for data rights Establishing

    data rights will allow: Individuals to derive value from their data Propel economic growth and unlock new value Open Challenge: What are data rights? Who controls data rights?
  40. Explore diverse concepts/frameworks Data as labor where individuals can form

    unions and collectively bargain for fair compensation was proposed by researchers Eric Posner and Glen Weyl. https://hbr.org/2020/01/why-companies-make-it-so-hard-for-users-to-control-their-data Standard minimum wage where users are guaranteed some base compensation in exchange for useful data. Public data banks (or data trusts) that are regulated by government agencies was floated by journalist Rana Foroohar. Big-tech led initiatives that offer users tools to manage, download, and even delete their own data.
  41. Need data-driven, technology-informed regulation • How will advancement of responsible

    data technology influence/impact regulatory frameworks? • How can regulation help with faster, broader adoption of responsible data technology?
  42. We Must Build a Responsible Data Economy for the Future

    of the Internet Technical Solutions Legal Framework Incentive Models Problem is complex • Natural tension between utility & privacy • Data is non-rival; cannot simply copy concepts and methods in analog world • Requires a combination of technical & non-technical solutions