HDPView: Differentially Private Materialized View for Exploring High Dimensional Relational Data

Slide 1

Slide 1 text

HDPView: Differentially Private Materialized View for Exploring High Dimensional Relational Data Fumiyuki Kato 1, Tsubasa Takahashi 2, Shun Takagi 1, Yang Cao 1, Seng Pei Liew 2, Masatoshi Yoshikawa 1 1 Kyoto University, 2 LINE Corporation 2022.9.7 VLDB 2022

Slide 2

Slide 2 text

Background 2

Slide 3

Slide 3 text

Data Exploration (DE) § What is data explora/on? –- Early stage of data mining workﬂow § Data scien/st designs a data mining workﬂow based on the proper/es of target dataset § If the data is sensi,ve is this data explora,on possible? 3 Design main DM pipeline Try to understand basic properties of target dataset with any queries Data Exploration Data Scientist Final output

Slide 4

Slide 4 text

4 Summary: DP view for data exploration Q. How can we construct a privacy-preserving view to explore the high- dimensional (e.g., 20D) sensitive data? (1) (2) (3) (4)

Slide 5

Slide 5 text

Requirements for DE under DP (1)(2) § Requirement (1): Workload independence § Issuable query set should be unlimited, and not pre-deﬁned § (×) Workload op9miza9on methods (e.g., HDMM[1]) § Requirement (2): Analy;cal reliability § Es9matable scale of the error for any coun9ng queries § (×) Genera9ve model approach (e.g., Privbayes[2], DP-GAN[3]) 5 [1] R. McKenna, et.al,. Optimizing error of high-dimensional statistical queries under differential privacy. Proc. VLDB Endow., 11(10):1206–1219, June 2018. [2] J. Zhang, et.al,. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4):25, 2017. [3] J. Fan, el.al,. Relational data synthesis using generative adversarial networks: A design space exploration. Proc. VLDB Endow., 13(12):1962–1975, July 2020. Data Explorer Unlimited queries … Answer w/ error info No error guarantee View Generative model synthesize

Slide 6

Slide 6 text

Requirements for DE under DP (3)(4) § Requirement (3): Noise resistance on high-dimensional data § Explore high-dimensional rela2onal data with less noise § (×) Exis2ng par22oning based methods (e.g., Privtree[4], DAWA’s par22on[5]) § Requirement (4): Space eﬃcient view § View to be explored should be space-eﬃcient 6 [4] Zhang, Jun, et.al,. Privtree: A differentially private algorithm for hierarchical decompositions. Proceedings of the 2016 SIGMOD. 2016. [5] C. Li, et.al,. A data-and workload-aware algorithm for range queries under differential privacy. Proceedings of the VLDB Endowment, 7(5):341–352, 2014. e.g., 20 columns

Slide 7

Slide 7 text

Preliminaries 7

Slide 8

Slide 8 text

8 Partitioning approach 1+"# 0+"$ 5+"% 4+"& 2+"' 1+"( 8+") 7+"* 13+"+ 62+"#, 64+"## 0+"#$ 0+"#% 0+"#& 1+"#' 1+"#( Age ~20 20~30 30~40 40~50 ~10M 20M 30M 40M 4+"# 13+"% 126+"& 0+"' 2+"( 24+"$ Salary § Take rela)onal input data as n-dimensional histogram, Partitioning "- ~/012034 56 7 Query: SELECT COUNT … 174+8, Z stddev = #( <= > 174+8′, Z′ stddev = ( <= > Perturbation Error (PE) can be reduced Original + noise Partitioned + noise

Slide 9

Slide 9 text

Partitioned blocks 9 Partitioning as a View o_swap by x86 4+"# 13+"$ 126+"% 0+"& 2+"' 24+"( § Considering par..oned blocks as a materialized view, which can sa.sfy req. (1) o_swap by x86 1+ "# % 1+ "# % 6+ "( % 6+ "( % 1+ "# % 1+ "# % 6+ "( % 6+ "( % 13+"$ 63+ "% ( 63+ "% ( 0+"& 0.5+ "' % 0.5+ "' % 0.5+ "' % 0.5+ "' % WHERE 20≦Age≦30 AND Salary=20 => 64.5+ "# % + "% ( + "' % 40~50 Age 30~40 20~30 ~20 ~10M 20M 30M 40M Salary req.(1): Can answer any range couting queries

Slide 10

Slide 10 text

10 Aggregation and Perturbation Error 0 5 4 2 1 7 13 62 64 0 0 0 1 1 24+"# 13+"$ 126+"% 13+"& 2+"' () : Block + with |() | elements -) : Original summed count in ./ ") : Laplace noise on ./ 0: Original counts 1 0() : = 3 |45| 6/ + 7/ Before After 1 0() 1 0() 1 0() 1 0() Block () -)+") |0 − 1 0() | = ( 0 − -) |() | ) − ") |() | § Partitioning causes aggregation error (AE), so after partition and perturbation, we have AE + PE 0 1 0() Aggregation Error (AE) Perturbation Error (PE) Error of 0 8 6

Slide 11

Slide 11 text

11 Error Optimization § We can formulate multi-dim partitioning as optimization problem ! " #∈ % " &∈'# |& − * &'# | ≤ ! " #∈ % ,-('# ) + ! " #∈ % 1-('# ) ≤ " #∈ % ,- '# + % ⋅ 3 4 Goal: Minimization § An instance of the set partitioning problem: NP-Hard § Search algorithm for the optimal partitioning must satisfy DP We need to consider heuristic solutions to obtain better partitioning, and it must be simple enough to allow DP analysis.

Slide 12

Slide 12 text

Proposed method 12

Slide 13

Slide 13 text

13 Heuristics § One main heuris.cs is spli%ng can reduce AE 64 50 1 1 64 50 1 1 50 64 § Decrease the number of blocks is more important in High-dimensional setting § The domain of n-dimensional histogram increases exponentially … Differences from previous studies PE × "# PE × "$ (mean = ('" + )* + + + +)/" = #.) AE = |64 − #.| + |50 − #.| + |+ − #.| + |1 − #.| = ++# AE = * AE = * AE = * AE = +" 1 PE 1 PE 1 PE 1 PE 1 PE

Slide 14

Slide 14 text

14 Existing Approach: Privtree [4] § Recursive splitting by Quadtree for efficient and fine partitioning Quadtree-based recursive splitting until count is small enough Split in fixed way 2 1 1 3 3 1 1 Too fine-grained in High dimension → Too large PE (3) and space- inefficient (4) Problem More careful splitting strategy is necessary Stop Stop Stop Stop Stop Stop Stop if count is small enough

Slide 15

Slide 15 text

15 HDPView (Proposed method) § Recursive bisection–based differentially private search algorithm 1 0 6 0 2 2 2 32 8 4 3 4 0 1 64 0 16 0 0 0 0 0 12 1 9 8 24 2 3 4 6 6 0 6 3 4 1 0 6 0 2 2 2 32 8 4 3 4 0 1 64 0 16 0 0 0 0 0 12 1 9 8 24 2 3 4 6 6 0 6 3 4 1 0 6 2 32 8 0 1 64 0 2 2 4 3 4 0 16 0 0 0 0 0 12 1 9 8 24 2 3 4 6 6 0 6 3 4 Carefully select only one cut point by Random cut mechanism 1 2 0 Stop bisection by Random converge mechanism that depens on AE … Each block runs 2 mechanisms 1. Random converge 2. Random cut Recursive bisection phase All blocks stop Aggregation Perturbation Got View

Slide 16

Slide 16 text

16 Random cut, Random converge § Random cut § Random converge § We analyze a DP guarantee on the final view and an approximated error for the any queries on the view. (see our paper in detail) Only one cut point ! is probabilistically selected from all possible points using exponential mechanism Smaller total AE after bisection is more likely to be sampled Stop if Laplace noised AE of block Β is small enough #/%& is stddev of noise. If AE < #/%& , AE+PE will worsen by any bisection. if then, stop the recursive bisection Effective split for small AE Save unnecessary split req.2 DP

Slide 17

Slide 17 text

Experiments 17

Slide 18

Slide 18 text

18 § Task: 8 workloads of range counting queries § 1-way All range, {2,3}-way Marginal, {2,3}D-Prefix, {2,3,4}D-Random range § Metric: RMSE (Root mean squared error) § Competitors § Identity, HDMM, Privtree, Privbayes § Dataset: 8 real-world datasets Experimental Settings Dimension

Slide 19

Slide 19 text

§ Average relative RMSE (ARR) over all 8 workloads and 8 datasets 19 Identity Privtree HDMM Privbayes HDPView (ours) ARR 1.94×10' 7.05 35.34 3.79 +. ,, Noise resistance on HD data (22D) (15D) Relative RMSE Example Prefix-3D query req.3

Slide 20

Slide 20 text

§ Comparison with Privtree – #blocks is small § With naïve Laplace over all domain 20 Space-efficient req.4

Slide 21

Slide 21 text

21 § Q: How can we construct a privacy-preserving view to explore the high-dimensional sensitive dataset? § We propose HDPView, which create a high-dimensional differentially private view by properly balancing AE and PE for high-dimensional data § The experiments show HDPView’s noise resistance and space-efficiency compared to existing works Conclusion

Slide 22

Slide 22 text

Appendix 22

Slide 23

Slide 23 text

§ What is Differential Privacy? – Mathematical privacy definition § Algorithm ! provides "-Differential Privacy if for neighboring databases # and #′ (differ by only one record) satisfies: 23 Pr ! # Pr ! #′ ≤ exp(") Differential Privacy (DP) Privacy parameter We can analyze the output distribution over multiple algorithms in composable way (Composition Theorem)

Slide 24

Slide 24 text

Contribution 24 HDMM [1] Privbayes [2], DP-GAN [3] Identity Privtree [4], DAWA [5] HDPView (Ours) (1) Workload independence ✔ ✔ ✔ ✔ (2) Analytical reliability ✔ ✔ ✔ ✔ (3) Noise resistance on high-dimensional data ✔ ✔ Works only for low dimension (1 or 2) ✔ (4) Space efficiency ✔ ✔ ✔ (1) (2) (3) (4)

Slide 25

Slide 25 text

Characteristcs 25 § Flexible block par..oning § Be2er convergence § Carefully cu-ng one point at a 1me, the number of generated blocks can be reduced, which reduces PE and creates space-eﬃcient view § Histograms of the real dataset are very sparse and require few blocks § Dimensional scalability § Similar to Mondrian [6] for K-anonymity § The algorithm execu1on is not aﬀected much by the increase in dimensionality Visualization on 2D Data (Left is ours) [6] LeFevre, Kristen, David J. DeWitt, and Raghu Ramakrishnan. "Mondrian multidimensional k-anonymity." 22nd International conference on data engineering (ICDE'06). IEEE, 2006. [4]

Slide 26

Slide 26 text

Range counting queries (1,2-D) 26

Slide 27

Slide 27 text

Range counting queries (3,4-D) 27

Slide 28

Slide 28 text

Comparison with round robin Privtree 28

Slide 29

Slide 29 text

§ Changes in the performance when adding attributes to Adult one by one in HDPView, Privbayes, and HDMM § HDPView's performance degrades slowly but surely with increasing dimensionality, while Privbayes' performance rather improves Affects of dimensionality 29 Privbayes HDPView