Slide 1

Slide 1 text

Secure Exchange SQL Building A Privacy Preserving Data Clean Room Over Trino Trino Summit 2023 December 13, 2023 Taro L. Saito (Leo) Senior Principal Engineer at Treasure Data

Slide 2

Slide 2 text

A Brief History of Treasure Data and Trino ● 2011 ○ Treasure Data was founded as a Hadoop as a Service company (Mountain View, CA) ● 2013 ○ Presto was open-sourced from Facebook ● 2014 ○ Treasure Data launched Presto as a Service for interactive query processing ● 2015 ○ 20,000 queries processed / day ● 2019 ○ 1,000,000 queries processed / day ○ Presto Conference Tokyo, Japan (Report in Trino Blog) ■ Invited Presto Creators (Martin, Dain, David) ● 2020 ○ Presto was rebranded to Trino ● 2023 ○ Processing 2.5+ million queries processed / day with Trino ○ Launched Data Clean Room (Secure Exchange SQL) service based on Trino 2

Slide 3

Slide 3 text

Treasure Data: Architecture ● Collecting 10+ years of query logs in our own system ● Pre-production testing with query simulator (DBTest 2022) ○ Build customer-specific benchmark from query logs ○ Run query regression tests using real production data Logs Batch Data Partition Indexes (PostgreSQL) Realtime Storage (S3) Archive Storage (S3) Merge periodically Table Catalog Query Logs PlazmaDB Tables Query Logs Simulation Logs Customer-Specific Benchmarks Replay Queries Production Cluster Control Cluster Test Cluster PlazmaDB Query Simulator Simulation Reports Build Benchmarks Extract Workload Pre-production Collect Query Logs Read/Write Partitions Stream Import Bulk Import 3

Slide 4

Slide 4 text

Trino at Treasure Data (2023) ● Users ○ 700+ accounts ○ 6,000+ users ● 2x usage increase since 2022 ○ Trino: 2.5+ million queries/day, Hive: 100k+ queries/day ○ 200+ trillion rows processed/day ○ 10+ billion S3 GET requests/day for reading partition data on AWS S3 ● Treasure Data ○ Provides Customer Data Platform (CDP) for customers ○ Our customers manage their own customers’ activity logs on CDP ● Challenge ○ How to facilitate data sharing between CDPs while preserving privacy? ■ Data Clean Room (DCR): Cross-Party Data Sharing 4

Slide 5

Slide 5 text

New: Secure Exchange SQL (2023) ● Enables privacy-preserving cross-party query processing through Trino ○ Queries run inside the provider account, and the consumer will get only anonymized statistics 5 Provider Consumer Secure Exchange SQL Consumer Tables Analytic SQL Provider Tables Distributed Query Engine (Trino) Provider Account Secure join User-Level Sampling Access Delegation Encrypted Data Noise Injection Anonymized Statistics Sensitivity Analysis Privacy Policy Check Privacy Budget Management User Segment Insights Aggregation Thresholding Consumer Account Differential Privacy Query Rewrite

Slide 6

Slide 6 text

Privacy-Preserving Query Rewrite ● Automatically rewrite SQL for protecting privacy from query results ○ Add random noise to aggregation expressions (e.g., COUNT, SUM, AVG, etc.) ○ Hide aggregation results from a small number of people, e.g., less than 100 users ● Adjust the amount of noise with ε parameter 6 Less Noise ε = 1.0 More Noise ε = 0.25

Slide 7

Slide 7 text

Differential Privacy ● A method for quantifying the privacy with ε (epsilon) parameter ○ A higher ε-value adds less noise, but it loses more privacy ○ A lower ε-value adds more noise, but it protects more privacy ● ε parameter = privacy budget ○ Each time you run a query, you will consume ε-privacy budget for the database ○ Once the entire budget is used, no more query can be issued to the database 7 Less Noise ε = 1.0 More Noise ε = 0.25

Slide 8

Slide 8 text

Real-World Application of Differential Privacy ● Google Maps ○ The visitor histogram adds noise to hide a contribution from a single person. ● See also a list of real-world uses of differential privacy 8 An Example from Privacy on Apache Beam Privacy Leak! Randomized with Noise If no noise is applied …

Slide 9

Slide 9 text

Secure Exchange SQL: Internals ● An SQL compiler written in Scala ○ Using the same ANTLR4 grammar with Trino SQL ● Step 1: Parse SQL ○ Build a LogicalPlan (unresolved) from a given SQL query ● Step 2: Resolve column types ○ With table schema and column tags (e.g., personal identifiers, private columns, etc.) ● Tracks the propagation of personal records within the SQL plan ● This SQL module is available in OSS: airframe-sql ○ Note: The other Secure Exchange SQL modules are closed-source (proprietary) 9 Airframe Parse SQL Resolve Column Types

Slide 10

Slide 10 text

Step 3: Validate SQL Plans ● Reject certain types of queries and function usages ● Examples: ○ Reject other than SELECT queries ○ Reject queries that return private columns ○ Reject any transformation of private columns. ● Powerful pattern matching in Scala 10 Reject Private Column Output Only allows SELECT (Relation) query

Slide 11

Slide 11 text

Step 4: Rewrite SQL ● Transform logical plans ○ Add noise ○ Add thresholding conditions ○ etc. ● Step 5: Post validation ○ Confirm the rewritten plan satisfies the privacy policies ● Step 6: Generate SQL ○ Produce a Trino-compatible SQL statement from the rewritten LogicalPlan 11 Adding Laplace Noise to Aggregation Expressions Find Aggregation Expressions Add Noise UDF

Slide 12

Slide 12 text

A More Complex SQL Rewrite Example ● e.g., HAVING count(distinct PID) > threshold ● If personal identifier (PID) column is missing, pull it up from subqueries 12 Add Thresholding Condition Pull-Up PID Columns (if missing) Recursively traverse plans to find PID columns

Slide 13

Slide 13 text

Testing Query Rewrite ● Describe test cases in YAML ○ Original query ○ Expected rewrite ○ Configuration parameters ○ Table catalog (schema) ● Verify the rewritten query ○ Check all rules at unit tests ● Launching Trino in the unit test, and actually execute the rewritten query to find any syntax errors 13 User Contribution Bounding Noise Injection Aggregation Thresholding Result Cutoff After Noise Original SQL Rewritten SQL PID Propagation

Slide 14

Slide 14 text

Secure Exchange SQL: Supported Queries ● A subset of Trino SQL ○ Only SELECT … GROUP BY queries ○ No update ● Aggregation requirement ○ The final result of SQL queries must be an aggregation result (e.g., COUNT, SUM, AVG, etc.) ■ Some aggregation functions like array_agg, window functions are not allowed ● No private column exposure ○ Private columns (e.g,. personal identifiers) cannot be used for GROUP BY keys or the final output columns ● Avoid personal record amplification with joins ○ Only support equi-joins, i.e., inner joins on personal identifiers ■ JOIN USING (pid) 14

Slide 15

Slide 15 text

● To avoid directly sharing raw personal records ○ For the ease of taking user consent ■ Sharing anonymized statistics is OK, but sharing raw data is NG ○ Compliance with privacy regulation laws (GDPR, CCPA, HIPAA, etc.) ● Access control methods are insufficient for privacy protection ○ e.g., Database, table, column-level permission ○ Column value masks (e.g., hiding Name, SSN) using Apache Ranger plugin for Trino ○ Even though private columns are hidden, other non-private columns may work as personal identifiers (quasi identifiers) Why Only Aggregation Queries? 15 Differential Privacy in the Wild. Part 1 (SIGMOD 2017 Tutorial)

Slide 16

Slide 16 text

Database Reconstruction Attack (DRA) ● If rules for producing query results are known, the original data can be reconstructed even from aggregated results. ○ Example rules: ■ Mean (average) age 38 = (A1 + A2 + … + A7) / 7 ■ Female count 4 = 7 - (S1 + S2 + … + S7), etc. ○ 164 constraints in total ● Test all possible combinations that can produce the target aggregation values (Brute-force) ● Requires only 0.2 seconds with an OSS SAT Solver (Glucose) 16 Aggregated Results Reconstructed Data

Slide 17

Slide 17 text

Differentially Private SQL Engine ● Differential Privacy (DP) ○ Formalized by Cynthia Dwork in 2006 to provide a mathematically rigorous definition of privacy ○ DP is a property of a mechanism (e.g., SQL engine) to produce almost the same results regardless of the presence or absence of a single person in the database. ● Differentially Private SQL Engine ○ Considers all possible neighbour databases D1 and D2 that differ only records from a single person ○ Ensures DP by adding sufficient amount of noise (scaled with ε) and thresholds 17 An illustration from “Why differential privacy is awesome” D1 D2

Slide 18

Slide 18 text

Differentially Private GROUP BY Aggregation ● GROUP BY might output unique key values from a small number of people ○ In this case, query results for the neighbor databases Q(D1) and Q(D2) will be consistently different => Not differentially private. ● Aggregation thresholding is required to hide small group keys ○ Complement noise because it can’t be added for non-existent GROUP BY keys 18 Thresholding HAVING COUNT(distinct PID) > (threshold) Q(D1) Q(D2) Q(D1) and Q(D2) will be similar results = Differentially Private An example from “Almost differential privacy” (2019) Q(D1) and Q(D2) differ = Not Differentially Private

Slide 19

Slide 19 text

User-Contribution Bounding ● Sensitivity: How sensitive the query result is in the presence of a single person in the database? ○ If a single user has 1,000 records in the database, the result of count(*) queries can be changed by 1,000 if the user is removed from the database. ● If sensitivity is high, more noise is needed to satisfy DP ● A practical approach: Take at most k-records from each user via sampling ○ Implemented as a window function in Trino, using the memory-efficient reservoir sampling algorithm. 19 An example from “Getting more useful results with differential privacy” (2021) User-Level Sampling

Slide 20

Slide 20 text

30+ Rules for Validation and SQL Rewrite ● Adding various rules to make query results differentially private (= almost the same) ○ Noise, threshold, user-contribution bound, etc. ○ Configurable with feature flags ● Leveraging the powerful pattern matching in Scala ○ Recursive tree traversal and transformation rules ○ Spark SQL also uses pattern matching of Scala for query optimization 20

Slide 21

Slide 21 text

Increased Adoption of Differentially Privacy ● The 2020 US Census published the data adding noise with DP guarantee ● Google ○ Open sourced google/differential-privacy library for computing noise ○ ZetaSQL with DP SQL support ○ DP for COVID-19 tracing data ● AWS ○ Nov 2023: Announced differential privacy support in AWS Clean Rooms 21

Slide 22

Slide 22 text

Secure Exchange SQL: Summary 22 ● A differentially private SQL engine ○ Produces almost the same query results regardless of the presence of an individual in the database ● Privacy protection ○ Tracks personal record propagation inside the SQL ○ Anonymize aggregation query results via noise and thresholding ● Minimized changes to Trino ○ Rewrites SQL outside Trino ○ UDFs for adding noise and sampling ○ Table catalog: Access delegation to another account

Slide 23

Slide 23 text

Further Readings ● Readings in Differential Privacy https://github.com/xerial/dp-readings ● Topics not addressed in this talk ○ Sensitivity for other aggregation operators ○ Privacy budget management ○ Trade offs with utility and privacy ○ Gaussian noise ○ Variants of DP ○ Machine Learning ○ Global vs Local DP ○ Secure joins ○ User-level aggregations ○ etc. 23

Slide 24

Slide 24 text

Treasure Data Team: Data Clean Room Project 24