Secure Exchange SQL: Building A Privacy-Preserving Data Clean Room Over Trino (Trino Summit 2023)

Secure Exchange SQL Building A Privacy Preserving Data Clean Room
Over Trino Trino Summit 2023 December 13, 2023 Taro L. Saito (Leo) Senior Principal Engineer at Treasure Data

A Brief History of Treasure Data and Trino • 2011
◦ Treasure Data was founded as a Hadoop as a Service company (Mountain View, CA) • 2013 ◦ Presto was open-sourced from Facebook • 2014 ◦ Treasure Data launched Presto as a Service for interactive query processing • 2015 ◦ 20,000 queries processed / day • 2019 ◦ 1,000,000 queries processed / day ◦ Presto Conference Tokyo, Japan (Report in Trino Blog) ▪ Invited Presto Creators (Martin, Dain, David) • 2020 ◦ Presto was rebranded to Trino • 2023 ◦ Processing 2.5+ million queries processed / day with Trino ◦ Launched Data Clean Room (Secure Exchange SQL) service based on Trino 2

Treasure Data: Architecture • Collecting 10+ years of query logs
in our own system • Pre-production testing with query simulator (DBTest 2022) ◦ Build customer-speciﬁc benchmark from query logs ◦ Run query regression tests using real production data Logs Batch Data Partition Indexes (PostgreSQL) Realtime Storage (S3) Archive Storage (S3) Merge periodically Table Catalog Query Logs PlazmaDB Tables Query Logs Simulation Logs Customer-Speciﬁc Benchmarks Replay Queries Production Cluster Control Cluster Test Cluster PlazmaDB Query Simulator Simulation Reports Build Benchmarks Extract Workload Pre-production Collect Query Logs Read/Write Partitions Stream Import Bulk Import 3

Trino at Treasure Data (2023) • Users ◦ 700+ accounts
◦ 6,000+ users • 2x usage increase since 2022 ◦ Trino: 2.5+ million queries/day, Hive: 100k+ queries/day ◦ 200+ trillion rows processed/day ◦ 10+ billion S3 GET requests/day for reading partition data on AWS S3 • Treasure Data ◦ Provides Customer Data Platform (CDP) for customers ◦ Our customers manage their own customers’ activity logs on CDP • Challenge ◦ How to facilitate data sharing between CDPs while preserving privacy? ▪ Data Clean Room (DCR): Cross-Party Data Sharing 4

New: Secure Exchange SQL (2023) • Enables privacy-preserving cross-party query
processing through Trino ◦ Queries run inside the provider account, and the consumer will get only anonymized statistics 5 Provider Consumer Secure Exchange SQL Consumer Tables Analytic SQL Provider Tables Distributed Query Engine (Trino) Provider Account Secure join User-Level Sampling Access Delegation Encrypted Data Noise Injection Anonymized Statistics Sensitivity Analysis Privacy Policy Check Privacy Budget Management User Segment Insights Aggregation Thresholding Consumer Account Differential Privacy Query Rewrite

Privacy-Preserving Query Rewrite • Automatically rewrite SQL for protecting privacy
from query results ◦ Add random noise to aggregation expressions (e.g., COUNT, SUM, AVG, etc.) ◦ Hide aggregation results from a small number of people, e.g., less than 100 users • Adjust the amount of noise with ε parameter 6 Less Noise ε = 1.0 More Noise ε = 0.25

Differential Privacy • A method for quantifying the privacy with
ε (epsilon) parameter ◦ A higher ε-value adds less noise, but it loses more privacy ◦ A lower ε-value adds more noise, but it protects more privacy • ε parameter = privacy budget ◦ Each time you run a query, you will consume ε-privacy budget for the database ◦ Once the entire budget is used, no more query can be issued to the database 7 Less Noise ε = 1.0 More Noise ε = 0.25

Real-World Application of Differential Privacy • Google Maps ◦ The
visitor histogram adds noise to hide a contribution from a single person. • See also a list of real-world uses of diﬀerential privacy 8 An Example from Privacy on Apache Beam Privacy Leak! Randomized with Noise If no noise is applied …

Secure Exchange SQL: Internals • An SQL compiler written in
Scala ◦ Using the same ANTLR4 grammar with Trino SQL • Step 1: Parse SQL ◦ Build a LogicalPlan (unresolved) from a given SQL query • Step 2: Resolve column types ◦ With table schema and column tags (e.g., personal identiﬁers, private columns, etc.) • Tracks the propagation of personal records within the SQL plan • This SQL module is available in OSS: airframe-sql ◦ Note: The other Secure Exchange SQL modules are closed-source (proprietary) 9 Airframe Parse SQL Resolve Column Types

Step 3: Validate SQL Plans • Reject certain types of
queries and function usages • Examples: ◦ Reject other than SELECT queries ◦ Reject queries that return private columns ◦ Reject any transformation of private columns. • Powerful pattern matching in Scala 10 Reject Private Column Output Only allows SELECT (Relation) query

Step 4: Rewrite SQL • Transform logical plans ◦ Add
noise ◦ Add thresholding conditions ◦ etc. • Step 5: Post validation ◦ Conﬁrm the rewritten plan satisﬁes the privacy policies • Step 6: Generate SQL ◦ Produce a Trino-compatible SQL statement from the rewritten LogicalPlan 11 Adding Laplace Noise to Aggregation Expressions Find Aggregation Expressions Add Noise UDF

A More Complex SQL Rewrite Example • e.g., HAVING count(distinct
PID) > threshold • If personal identiﬁer (PID) column is missing, pull it up from subqueries 12 Add Thresholding Condition Pull-Up PID Columns (if missing) Recursively traverse plans to ﬁnd PID columns

Testing Query Rewrite • Describe test cases in YAML ◦
Original query ◦ Expected rewrite ◦ Conﬁguration parameters ◦ Table catalog (schema) • Verify the rewritten query ◦ Check all rules at unit tests • Launching Trino in the unit test, and actually execute the rewritten query to ﬁnd any syntax errors 13 User Contribution Bounding Noise Injection Aggregation Thresholding Result Cutoff After Noise Original SQL Rewritten SQL PID Propagation

Secure Exchange SQL: Supported Queries • A subset of Trino
SQL ◦ Only SELECT … GROUP BY queries ◦ No update • Aggregation requirement ◦ The final result of SQL queries must be an aggregation result (e.g., COUNT, SUM, AVG, etc.) ▪ Some aggregation functions like array_agg, window functions are not allowed • No private column exposure ◦ Private columns (e.g,. personal identifiers) cannot be used for GROUP BY keys or the final output columns • Avoid personal record amplification with joins ◦ Only support equi-joins, i.e., inner joins on personal identifiers ▪ JOIN USING (pid) 14

• To avoid directly sharing raw personal records ◦ For
the ease of taking user consent ▪ Sharing anonymized statistics is OK, but sharing raw data is NG ◦ Compliance with privacy regulation laws (GDPR, CCPA, HIPAA, etc.) • Access control methods are insufficient for privacy protection ◦ e.g., Database, table, column-level permission ◦ Column value masks (e.g., hiding Name, SSN) using Apache Ranger plugin for Trino ◦ Even though private columns are hidden, other non-private columns may work as personal identifiers (quasi identifiers) Why Only Aggregation Queries? 15 Differential Privacy in the Wild. Part 1 (SIGMOD 2017 Tutorial)

Database Reconstruction Attack (DRA) • If rules for producing query
results are known, the original data can be reconstructed even from aggregated results. ◦ Example rules: ▪ Mean (average) age 38 = (A1 + A2 + … + A7) / 7 ▪ Female count 4 = 7 - (S1 + S2 + … + S7), etc. ◦ 164 constraints in total • Test all possible combinations that can produce the target aggregation values (Brute-force) • Requires only 0.2 seconds with an OSS SAT Solver (Glucose) 16 Aggregated Results Reconstructed Data

Differentially Private SQL Engine • Differential Privacy (DP) ◦ Formalized
by Cynthia Dwork in 2006 to provide a mathematically rigorous definition of privacy ◦ DP is a property of a mechanism (e.g., SQL engine) to produce almost the same results regardless of the presence or absence of a single person in the database. • Differentially Private SQL Engine ◦ Considers all possible neighbour databases D1 and D2 that differ only records from a single person ◦ Ensures DP by adding sufficient amount of noise (scaled with ε) and thresholds 17 An illustration from “Why differential privacy is awesome” D1 D2

Differentially Private GROUP BY Aggregation • GROUP BY might output
unique key values from a small number of people ◦ In this case, query results for the neighbor databases Q(D1) and Q(D2) will be consistently different => Not differentially private. • Aggregation thresholding is required to hide small group keys ◦ Complement noise because it can’t be added for non-existent GROUP BY keys 18 Thresholding HAVING COUNT(distinct PID) > (threshold) Q(D1) Q(D2) Q(D1) and Q(D2) will be similar results = Differentially Private An example from “Almost differential privacy” (2019) Q(D1) and Q(D2) differ = Not Differentially Private

User-Contribution Bounding • Sensitivity: How sensitive the query result is
in the presence of a single person in the database? ◦ If a single user has 1,000 records in the database, the result of count(*) queries can be changed by 1,000 if the user is removed from the database. • If sensitivity is high, more noise is needed to satisfy DP • A practical approach: Take at most k-records from each user via sampling ◦ Implemented as a window function in Trino, using the memory-eﬃcient reservoir sampling algorithm. 19 An example from “Getting more useful results with differential privacy” (2021) User-Level Sampling

30+ Rules for Validation and SQL Rewrite • Adding various
rules to make query results differentially private (= almost the same) ◦ Noise, threshold, user-contribution bound, etc. ◦ Configurable with feature flags • Leveraging the powerful pattern matching in Scala ◦ Recursive tree traversal and transformation rules ◦ Spark SQL also uses pattern matching of Scala for query optimization 20

Increased Adoption of Differentially Privacy • The 2020 US Census
published the data adding noise with DP guarantee • Google ◦ Open sourced google/diﬀerential-privacy library for computing noise ◦ ZetaSQL with DP SQL support ◦ DP for COVID-19 tracing data • AWS ◦ Nov 2023: Announced diﬀerential privacy support in AWS Clean Rooms 21

Secure Exchange SQL: Summary 22 • A diﬀerentially private SQL
engine ◦ Produces almost the same query results regardless of the presence of an individual in the database • Privacy protection ◦ Tracks personal record propagation inside the SQL ◦ Anonymize aggregation query results via noise and thresholding • Minimized changes to Trino ◦ Rewrites SQL outside Trino ◦ UDFs for adding noise and sampling ◦ Table catalog: Access delegation to another account

Further Readings • Readings in Diﬀerential Privacy https://github.com/xerial/dp-readings • Topics
not addressed in this talk ◦ Sensitivity for other aggregation operators ◦ Privacy budget management ◦ Trade oﬀs with utility and privacy ◦ Gaussian noise ◦ Variants of DP ◦ Machine Learning ◦ Global vs Local DP ◦ Secure joins ◦ User-level aggregations ◦ etc. 23

Treasure Data Team: Data Clean Room Project 24

Secure Exchange SQL: Building A Privacy-Preserv...

Secure Exchange SQL: Building A Privacy-Preserving Data Clean Room Over Trino (Trino Summit 2023)

Taro L. Saito

More Decks by Taro L. Saito

Other Decks in Technology

Featured

Transcript

Secure Exchange SQL Building A Privacy Preserving Data Clean Room

A Brief History of Treasure Data and Trino • 2011

Treasure Data: Architecture • Collecting 10+ years of query logs

Trino at Treasure Data (2023) • Users ◦ 700+ accounts

New: Secure Exchange SQL (2023) • Enables privacy-preserving cross-party query

Privacy-Preserving Query Rewrite • Automatically rewrite SQL for protecting privacy

Differential Privacy • A method for quantifying the privacy with

Real-World Application of Differential Privacy • Google Maps ◦ The

Secure Exchange SQL: Internals • An SQL compiler written in

Step 3: Validate SQL Plans • Reject certain types of

Step 4: Rewrite SQL • Transform logical plans ◦ Add

A More Complex SQL Rewrite Example • e.g., HAVING count(distinct

Testing Query Rewrite • Describe test cases in YAML ◦

Secure Exchange SQL: Supported Queries • A subset of Trino

• To avoid directly sharing raw personal records ◦ For

Database Reconstruction Attack (DRA) • If rules for producing query

Differentially Private SQL Engine • Diﬀerential Privacy (DP) ◦ Formalized

Differentially Private GROUP BY Aggregation • GROUP BY might output

User-Contribution Bounding • Sensitivity: How sensitive the query result is

30+ Rules for Validation and SQL Rewrite • Adding various

Increased Adoption of Differentially Privacy • The 2020 US Census

Secure Exchange SQL: Summary 22 • A diﬀerentially private SQL

Further Readings • Readings in Diﬀerential Privacy https://github.com/xerial/dp-readings • Topics

Treasure Data Team: Data Clean Room Project 24