HDPView: Differentially Private Materialized View for Exploring High Dimensional Relational Data

HDPView: Differentially Private Materialized View for Exploring High Dimensional Relational
Data Fumiyuki Kato 1, Tsubasa Takahashi 2, Shun Takagi 1, Yang Cao 1, Seng Pei Liew 2, Masatoshi Yoshikawa 1 1 Kyoto University, 2 LINE Corporation 2022.9.7 VLDB 2022

Background 2

Data Exploration (DE) § What is data explora/on? –- Early
stage of data mining workﬂow § Data scien/st designs a data mining workﬂow based on the proper/es of target dataset § If the data is sensi,ve is this data explora,on possible? 3 Design main DM pipeline Try to understand basic properties of target dataset with any queries Data Exploration Data Scientist Final output

4 Summary: DP view for data exploration Q. How can
we construct a privacy-preserving view to explore the high- dimensional (e.g., 20D) sensitive data? (1) (2) (3) (4)

Requirements for DE under DP (1)(2) § Requirement (1): Workload
independence § Issuable query set should be unlimited, and not pre-deﬁned § (×) Workload op9miza9on methods (e.g., HDMM[1]) § Requirement (2): Analy;cal reliability § Es9matable scale of the error for any coun9ng queries § (×) Genera9ve model approach (e.g., Privbayes[2], DP-GAN[3]) 5 [1] R. McKenna, et.al,. Optimizing error of high-dimensional statistical queries under differential privacy. Proc. VLDB Endow., 11(10):1206–1219, June 2018. [2] J. Zhang, et.al,. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4):25, 2017. [3] J. Fan, el.al,. Relational data synthesis using generative adversarial networks: A design space exploration. Proc. VLDB Endow., 13(12):1962–1975, July 2020. Data Explorer Unlimited queries … Answer w/ error info No error guarantee View Generative model synthesize

Requirements for DE under DP (3)(4) § Requirement (3): Noise
resistance on high-dimensional data § Explore high-dimensional rela2onal data with less noise § (×) Exis2ng par22oning based methods (e.g., Privtree[4], DAWA’s par22on[5]) § Requirement (4): Space eﬃcient view § View to be explored should be space-eﬃcient 6 [4] Zhang, Jun, et.al,. Privtree: A differentially private algorithm for hierarchical decompositions. Proceedings of the 2016 SIGMOD. 2016. [5] C. Li, et.al,. A data-and workload-aware algorithm for range queries under differential privacy. Proceedings of the VLDB Endowment, 7(5):341–352, 2014. e.g., 20 columns

Preliminaries 7

8 Partitioning approach 1+"# 0+"$ 5+"% 4+"& 2+"' 1+"( 8+")
7+"* 13+"+ 62+"#, 64+"## 0+"#$ 0+"#% 0+"#& 1+"#' 1+"#( Age ~20 20~30 30~40 40~50 ~10M 20M 30M 40M 4+"# 13+"% 126+"& 0+"' 2+"( 24+"$ Salary § Take rela)onal input data as n-dimensional histogram, Partitioning "- ~/012034 56 7 Query: SELECT COUNT … 174+8, Z stddev = #( <= > 174+8′, Z′ stddev = ( <= > Perturbation Error (PE) can be reduced Original + noise Partitioned + noise

Partitioned blocks 9 Partitioning as a View o_swap by x86
4+"# 13+"$ 126+"% 0+"& 2+"' 24+"( § Considering par..oned blocks as a materialized view, which can sa.sfy req. (1) o_swap by x86 1+ "# % 1+ "# % 6+ "( % 6+ "( % 1+ "# % 1+ "# % 6+ "( % 6+ "( % 13+"$ 63+ "% ( 63+ "% ( 0+"& 0.5+ "' % 0.5+ "' % 0.5+ "' % 0.5+ "' % WHERE 20≦Age≦30 AND Salary=20 => 64.5+ "# % + "% ( + "' % 40~50 Age 30~40 20~30 ~20 ~10M 20M 30M 40M Salary req.(1): Can answer any range couting queries

10 Aggregation and Perturbation Error 0 5 4 2 1
7 13 62 64 0 0 0 1 1 24+"# 13+"$ 126+"% 13+"& 2+"' () : Block + with |() | elements -) : Original summed count in ./ ") : Laplace noise on ./ 0: Original counts 1 0() : = 3 |45| 6/ + 7/ Before After 1 0() 1 0() 1 0() 1 0() Block () -)+") |0 − 1 0() | = ( 0 − -) |() | ) − ") |() | § Partitioning causes aggregation error (AE), so after partition and perturbation, we have AE + PE 0 1 0() Aggregation Error (AE) Perturbation Error (PE) Error of 0 8 6

11 Error Optimization § We can formulate multi-dim partitioning as
optimization problem ! " #∈ % " &∈'# |& − * &'# | ≤ ! " #∈ % ,-('# ) + ! " #∈ % 1-('# ) ≤ " #∈ % ,- '# + % ⋅ 3 4 Goal: Minimization § An instance of the set partitioning problem: NP-Hard § Search algorithm for the optimal partitioning must satisfy DP We need to consider heuristic solutions to obtain better partitioning, and it must be simple enough to allow DP analysis.

Proposed method 12

13 Heuristics § One main heuris.cs is spli%ng can reduce
AE 64 50 1 1 64 50 1 1 50 64 § Decrease the number of blocks is more important in High-dimensional setting § The domain of n-dimensional histogram increases exponentially … Differences from previous studies PE × "# PE × "$ (mean = ('" + )* + + + +)/" = #.) AE = |64 − #.| + |50 − #.| + |+ − #.| + |1 − #.| = ++# AE = * AE = * AE = * AE = +" 1 PE 1 PE 1 PE 1 PE 1 PE

14 Existing Approach: Privtree [4] § Recursive splitting by Quadtree
for efficient and fine partitioning Quadtree-based recursive splitting until count is small enough Split in fixed way 2 1 1 3 3 1 1 Too fine-grained in High dimension → Too large PE (3) and space- inefficient (4) Problem More careful splitting strategy is necessary Stop Stop Stop Stop Stop Stop Stop if count is small enough

15 HDPView (Proposed method) § Recursive bisection–based differentially private search
algorithm 1 0 6 0 2 2 2 32 8 4 3 4 0 1 64 0 16 0 0 0 0 0 12 1 9 8 24 2 3 4 6 6 0 6 3 4 1 0 6 0 2 2 2 32 8 4 3 4 0 1 64 0 16 0 0 0 0 0 12 1 9 8 24 2 3 4 6 6 0 6 3 4 1 0 6 2 32 8 0 1 64 0 2 2 4 3 4 0 16 0 0 0 0 0 12 1 9 8 24 2 3 4 6 6 0 6 3 4 Carefully select only one cut point by Random cut mechanism 1 2 0 Stop bisection by Random converge mechanism that depens on AE … Each block runs 2 mechanisms 1. Random converge 2. Random cut Recursive bisection phase All blocks stop Aggregation Perturbation Got View

16 Random cut, Random converge § Random cut § Random
converge § We analyze a DP guarantee on the final view and an approximated error for the any queries on the view. (see our paper in detail) Only one cut point ! is probabilistically selected from all possible points using exponential mechanism Smaller total AE after bisection is more likely to be sampled Stop if Laplace noised AE of block Β is small enough #/%& is stddev of noise. If AE < #/%& , AE+PE will worsen by any bisection. if then, stop the recursive bisection Effective split for small AE Save unnecessary split req.2 DP

Experiments 17

18 § Task: 8 workloads of range counting queries §
1-way All range, {2,3}-way Marginal, {2,3}D-Prefix, {2,3,4}D-Random range § Metric: RMSE (Root mean squared error) § Competitors § Identity, HDMM, Privtree, Privbayes § Dataset: 8 real-world datasets Experimental Settings Dimension

§ Average relative RMSE (ARR) over all 8 workloads and
8 datasets 19 Identity Privtree HDMM Privbayes HDPView (ours) ARR 1.94×10' 7.05 35.34 3.79 +. ,, Noise resistance on HD data (22D) (15D) Relative RMSE Example Prefix-3D query req.3

§ Comparison with Privtree – #blocks is small § With
naïve Laplace over all domain 20 Space-efficient req.4

21 § Q: How can we construct a privacy-preserving view
to explore the high-dimensional sensitive dataset? § We propose HDPView, which create a high-dimensional differentially private view by properly balancing AE and PE for high-dimensional data § The experiments show HDPView’s noise resistance and space-efficiency compared to existing works Conclusion

Appendix 22

§ What is Differential Privacy? – Mathematical privacy definition §
Algorithm ! provides "-Differential Privacy if for neighboring databases # and #′ (differ by only one record) satisfies: 23 Pr ! # Pr ! #′ ≤ exp(") Differential Privacy (DP) Privacy parameter We can analyze the output distribution over multiple algorithms in composable way (Composition Theorem)

Contribution 24 HDMM [1] Privbayes [2], DP-GAN [3] Identity Privtree
[4], DAWA [5] HDPView (Ours) (1) Workload independence ✔ ✔ ✔ ✔ (2) Analytical reliability ✔ ✔ ✔ ✔ (3) Noise resistance on high-dimensional data ✔ ✔ Works only for low dimension (1 or 2) ✔ (4) Space efficiency ✔ ✔ ✔ (1) (2) (3) (4)

Characteristcs 25 § Flexible block par..oning § Be2er convergence §
Carefully cu-ng one point at a 1me, the number of generated blocks can be reduced, which reduces PE and creates space-eﬃcient view § Histograms of the real dataset are very sparse and require few blocks § Dimensional scalability § Similar to Mondrian [6] for K-anonymity § The algorithm execu1on is not aﬀected much by the increase in dimensionality Visualization on 2D Data (Left is ours) [6] LeFevre, Kristen, David J. DeWitt, and Raghu Ramakrishnan. "Mondrian multidimensional k-anonymity." 22nd International conference on data engineering (ICDE'06). IEEE, 2006. [4]

Range counting queries (1,2-D) 26

Range counting queries (3,4-D) 27

Comparison with round robin Privtree 28

§ Changes in the performance when adding attributes to Adult
one by one in HDPView, Privbayes, and HDMM § HDPView's performance degrades slowly but surely with increasing dimensionality, while Privbayes' performance rather improves Affects of dimensionality 29 Privbayes HDPView

HDPView: Differentially Private Materialized Vi...

HDPView: Differentially Private Materialized View for Exploring High Dimensional Relational Data

LINE Developers

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript