Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Secure Exchange SQL: Building A Privacy-Preserving Data Clean Room Over Trino (Trino Summit 2023)

Taro L. Saito
December 13, 2023

Secure Exchange SQL: Building A Privacy-Preserving Data Clean Room Over Trino (Trino Summit 2023)

Secure Exchange SQL is a production data clean room service deployed at Treasure Data, which leverages Trino and differential privacy technology to enable cross-company data analysis while mitigating the risk of privacy breaches. In this session, we will introduce the concept of differential privacy and discuss the privacy protection methods that need to be implemented during SQL processing. To minimize changes to Trino's codebase, we employed approaches of SQL rewriting and validation at the logical plan level. We will explain these methods and provide practical use cases of our Data Clean Room.

Related Links:
- Trino Summit 2023: https://www.starburst.io/info/trinosummit2023/
- Treasure Data https://www.treasuredata.com/
- Airframe (OSS): https://github.com/wvlet/airframe

Taro L. Saito

December 13, 2023

More Decks by Taro L. Saito

Other Decks in Technology


  1. Secure Exchange SQL Building A Privacy Preserving Data Clean Room

    Over Trino Trino Summit 2023 December 13, 2023 Taro L. Saito (Leo) Senior Principal Engineer at Treasure Data
  2. A Brief History of Treasure Data and Trino • 2011

    ◦ Treasure Data was founded as a Hadoop as a Service company (Mountain View, CA) • 2013 ◦ Presto was open-sourced from Facebook • 2014 ◦ Treasure Data launched Presto as a Service for interactive query processing • 2015 ◦ 20,000 queries processed / day • 2019 ◦ 1,000,000 queries processed / day ◦ Presto Conference Tokyo, Japan (Report in Trino Blog) ▪ Invited Presto Creators (Martin, Dain, David) • 2020 ◦ Presto was rebranded to Trino • 2023 ◦ Processing 2.5+ million queries processed / day with Trino ◦ Launched Data Clean Room (Secure Exchange SQL) service based on Trino 2
  3. Treasure Data: Architecture • Collecting 10+ years of query logs

    in our own system • Pre-production testing with query simulator (DBTest 2022) ◦ Build customer-specific benchmark from query logs ◦ Run query regression tests using real production data Logs Batch Data Partition Indexes (PostgreSQL) Realtime Storage (S3) Archive Storage (S3) Merge periodically Table Catalog Query Logs PlazmaDB Tables Query Logs Simulation Logs Customer-Specific Benchmarks Replay Queries Production Cluster Control Cluster Test Cluster PlazmaDB Query Simulator Simulation Reports Build Benchmarks Extract Workload Pre-production Collect Query Logs Read/Write Partitions Stream Import Bulk Import 3
  4. Trino at Treasure Data (2023) • Users ◦ 700+ accounts

    ◦ 6,000+ users • 2x usage increase since 2022 ◦ Trino: 2.5+ million queries/day, Hive: 100k+ queries/day ◦ 200+ trillion rows processed/day ◦ 10+ billion S3 GET requests/day for reading partition data on AWS S3 • Treasure Data ◦ Provides Customer Data Platform (CDP) for customers ◦ Our customers manage their own customers’ activity logs on CDP • Challenge ◦ How to facilitate data sharing between CDPs while preserving privacy? ▪ Data Clean Room (DCR): Cross-Party Data Sharing 4
  5. New: Secure Exchange SQL (2023) • Enables privacy-preserving cross-party query

    processing through Trino ◦ Queries run inside the provider account, and the consumer will get only anonymized statistics 5 Provider Consumer Secure Exchange SQL Consumer Tables Analytic SQL Provider Tables Distributed Query Engine (Trino) Provider Account Secure join User-Level Sampling Access Delegation Encrypted Data Noise Injection Anonymized Statistics Sensitivity Analysis Privacy Policy Check Privacy Budget Management User Segment Insights Aggregation Thresholding Consumer Account Differential Privacy Query Rewrite
  6. Privacy-Preserving Query Rewrite • Automatically rewrite SQL for protecting privacy

    from query results ◦ Add random noise to aggregation expressions (e.g., COUNT, SUM, AVG, etc.) ◦ Hide aggregation results from a small number of people, e.g., less than 100 users • Adjust the amount of noise with ε parameter 6 Less Noise ε = 1.0 More Noise ε = 0.25
  7. Differential Privacy • A method for quantifying the privacy with

    ε (epsilon) parameter ◦ A higher ε-value adds less noise, but it loses more privacy ◦ A lower ε-value adds more noise, but it protects more privacy • ε parameter = privacy budget ◦ Each time you run a query, you will consume ε-privacy budget for the database ◦ Once the entire budget is used, no more query can be issued to the database 7 Less Noise ε = 1.0 More Noise ε = 0.25
  8. Real-World Application of Differential Privacy • Google Maps ◦ The

    visitor histogram adds noise to hide a contribution from a single person. • See also a list of real-world uses of differential privacy 8 An Example from Privacy on Apache Beam Privacy Leak! Randomized with Noise If no noise is applied …
  9. Secure Exchange SQL: Internals • An SQL compiler written in

    Scala ◦ Using the same ANTLR4 grammar with Trino SQL • Step 1: Parse SQL ◦ Build a LogicalPlan (unresolved) from a given SQL query • Step 2: Resolve column types ◦ With table schema and column tags (e.g., personal identifiers, private columns, etc.) • Tracks the propagation of personal records within the SQL plan • This SQL module is available in OSS: airframe-sql ◦ Note: The other Secure Exchange SQL modules are closed-source (proprietary) 9 Airframe Parse SQL Resolve Column Types
  10. Step 3: Validate SQL Plans • Reject certain types of

    queries and function usages • Examples: ◦ Reject other than SELECT queries ◦ Reject queries that return private columns ◦ Reject any transformation of private columns. • Powerful pattern matching in Scala 10 Reject Private Column Output Only allows SELECT (Relation) query
  11. Step 4: Rewrite SQL • Transform logical plans ◦ Add

    noise ◦ Add thresholding conditions ◦ etc. • Step 5: Post validation ◦ Confirm the rewritten plan satisfies the privacy policies • Step 6: Generate SQL ◦ Produce a Trino-compatible SQL statement from the rewritten LogicalPlan 11 Adding Laplace Noise to Aggregation Expressions Find Aggregation Expressions Add Noise UDF
  12. A More Complex SQL Rewrite Example • e.g., HAVING count(distinct

    PID) > threshold • If personal identifier (PID) column is missing, pull it up from subqueries 12 Add Thresholding Condition Pull-Up PID Columns (if missing) Recursively traverse plans to find PID columns
  13. Testing Query Rewrite • Describe test cases in YAML ◦

    Original query ◦ Expected rewrite ◦ Configuration parameters ◦ Table catalog (schema) • Verify the rewritten query ◦ Check all rules at unit tests • Launching Trino in the unit test, and actually execute the rewritten query to find any syntax errors 13 User Contribution Bounding Noise Injection Aggregation Thresholding Result Cutoff After Noise Original SQL Rewritten SQL PID Propagation
  14. Secure Exchange SQL: Supported Queries • A subset of Trino

    SQL ◦ Only SELECT … GROUP BY queries ◦ No update • Aggregation requirement ◦ The final result of SQL queries must be an aggregation result (e.g., COUNT, SUM, AVG, etc.) ▪ Some aggregation functions like array_agg, window functions are not allowed • No private column exposure ◦ Private columns (e.g,. personal identifiers) cannot be used for GROUP BY keys or the final output columns • Avoid personal record amplification with joins ◦ Only support equi-joins, i.e., inner joins on personal identifiers ▪ JOIN USING (pid) 14
  15. • To avoid directly sharing raw personal records ◦ For

    the ease of taking user consent ▪ Sharing anonymized statistics is OK, but sharing raw data is NG ◦ Compliance with privacy regulation laws (GDPR, CCPA, HIPAA, etc.) • Access control methods are insufficient for privacy protection ◦ e.g., Database, table, column-level permission ◦ Column value masks (e.g., hiding Name, SSN) using Apache Ranger plugin for Trino ◦ Even though private columns are hidden, other non-private columns may work as personal identifiers (quasi identifiers) Why Only Aggregation Queries? 15 Differential Privacy in the Wild. Part 1 (SIGMOD 2017 Tutorial)
  16. Database Reconstruction Attack (DRA) • If rules for producing query

    results are known, the original data can be reconstructed even from aggregated results. ◦ Example rules: ▪ Mean (average) age 38 = (A1 + A2 + … + A7) / 7 ▪ Female count 4 = 7 - (S1 + S2 + … + S7), etc. ◦ 164 constraints in total • Test all possible combinations that can produce the target aggregation values (Brute-force) • Requires only 0.2 seconds with an OSS SAT Solver (Glucose) 16 Aggregated Results Reconstructed Data
  17. Differentially Private SQL Engine • Differential Privacy (DP) ◦ Formalized

    by Cynthia Dwork in 2006 to provide a mathematically rigorous definition of privacy ◦ DP is a property of a mechanism (e.g., SQL engine) to produce almost the same results regardless of the presence or absence of a single person in the database. • Differentially Private SQL Engine ◦ Considers all possible neighbour databases D1 and D2 that differ only records from a single person ◦ Ensures DP by adding sufficient amount of noise (scaled with ε) and thresholds 17 An illustration from “Why differential privacy is awesome” D1 D2
  18. Differentially Private GROUP BY Aggregation • GROUP BY might output

    unique key values from a small number of people ◦ In this case, query results for the neighbor databases Q(D1) and Q(D2) will be consistently different => Not differentially private. • Aggregation thresholding is required to hide small group keys ◦ Complement noise because it can’t be added for non-existent GROUP BY keys 18 Thresholding HAVING COUNT(distinct PID) > (threshold) Q(D1) Q(D2) Q(D1) and Q(D2) will be similar results = Differentially Private An example from “Almost differential privacy” (2019) Q(D1) and Q(D2) differ = Not Differentially Private
  19. User-Contribution Bounding • Sensitivity: How sensitive the query result is

    in the presence of a single person in the database? ◦ If a single user has 1,000 records in the database, the result of count(*) queries can be changed by 1,000 if the user is removed from the database. • If sensitivity is high, more noise is needed to satisfy DP • A practical approach: Take at most k-records from each user via sampling ◦ Implemented as a window function in Trino, using the memory-efficient reservoir sampling algorithm. 19 An example from “Getting more useful results with differential privacy” (2021) User-Level Sampling
  20. 30+ Rules for Validation and SQL Rewrite • Adding various

    rules to make query results differentially private (= almost the same) ◦ Noise, threshold, user-contribution bound, etc. ◦ Configurable with feature flags • Leveraging the powerful pattern matching in Scala ◦ Recursive tree traversal and transformation rules ◦ Spark SQL also uses pattern matching of Scala for query optimization 20
  21. Increased Adoption of Differentially Privacy • The 2020 US Census

    published the data adding noise with DP guarantee • Google ◦ Open sourced google/differential-privacy library for computing noise ◦ ZetaSQL with DP SQL support ◦ DP for COVID-19 tracing data • AWS ◦ Nov 2023: Announced differential privacy support in AWS Clean Rooms 21
  22. Secure Exchange SQL: Summary 22 • A differentially private SQL

    engine ◦ Produces almost the same query results regardless of the presence of an individual in the database • Privacy protection ◦ Tracks personal record propagation inside the SQL ◦ Anonymize aggregation query results via noise and thresholding • Minimized changes to Trino ◦ Rewrites SQL outside Trino ◦ UDFs for adding noise and sampling ◦ Table catalog: Access delegation to another account
  23. Further Readings • Readings in Differential Privacy https://github.com/xerial/dp-readings • Topics

    not addressed in this talk ◦ Sensitivity for other aggregation operators ◦ Privacy budget management ◦ Trade offs with utility and privacy ◦ Gaussian noise ◦ Variants of DP ◦ Machine Learning ◦ Global vs Local DP ◦ Secure joins ◦ User-level aggregations ◦ etc. 23