Upgrade to Pro — share decks privately, control downloads, hide ads and more …

HDPView: Differentially Private Materialized View for Exploring High Dimensional Relational Data

HDPView: Differentially Private Materialized View for Exploring High Dimensional Relational Data

Fumiyuki Kato (1), Tsubasa Takahashi (2), Shun Takagi (1), Yang Cao (1),
Seng Pei Liew (2), Masatoshi Yoshikawa (1)
(1) Kyoto University, (2)LINE Corporation

2022.9.7 VLDB 2022

LINE Developers

October 05, 2022
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. HDPView: Differentially Private Materialized View for Exploring High Dimensional Relational

    Data Fumiyuki Kato 1, Tsubasa Takahashi 2, Shun Takagi 1, Yang Cao 1, Seng Pei Liew 2, Masatoshi Yoshikawa 1 1 Kyoto University, 2 LINE Corporation 2022.9.7 VLDB 2022
  2. Data Exploration (DE) § What is data explora/on? –- Early

    stage of data mining workflow § Data scien/st designs a data mining workflow based on the proper/es of target dataset § If the data is sensi,ve is this data explora,on possible? 3 Design main DM pipeline Try to understand basic properties of target dataset with any queries Data Exploration Data Scientist Final output
  3. 4 Summary: DP view for data exploration Q. How can

    we construct a privacy-preserving view to explore the high- dimensional (e.g., 20D) sensitive data? (1) (2) (3) (4)
  4. Requirements for DE under DP (1)(2) § Requirement (1): Workload

    independence § Issuable query set should be unlimited, and not pre-defined § (×) Workload op9miza9on methods (e.g., HDMM[1]) § Requirement (2): Analy;cal reliability § Es9matable scale of the error for any coun9ng queries § (×) Genera9ve model approach (e.g., Privbayes[2], DP-GAN[3]) 5 [1] R. McKenna, et.al,. Optimizing error of high-dimensional statistical queries under differential privacy. Proc. VLDB Endow., 11(10):1206–1219, June 2018. [2] J. Zhang, et.al,. Privbayes: Private data release via bayesian networks. ACM Transactions on Database Systems (TODS), 42(4):25, 2017. [3] J. Fan, el.al,. Relational data synthesis using generative adversarial networks: A design space exploration. Proc. VLDB Endow., 13(12):1962–1975, July 2020. Data Explorer Unlimited queries … Answer w/ error info No error guarantee View Generative model synthesize
  5. Requirements for DE under DP (3)(4) § Requirement (3): Noise

    resistance on high-dimensional data § Explore high-dimensional rela2onal data with less noise § (×) Exis2ng par22oning based methods (e.g., Privtree[4], DAWA’s par22on[5]) § Requirement (4): Space efficient view § View to be explored should be space-efficient 6 [4] Zhang, Jun, et.al,. Privtree: A differentially private algorithm for hierarchical decompositions. Proceedings of the 2016 SIGMOD. 2016. [5] C. Li, et.al,. A data-and workload-aware algorithm for range queries under differential privacy. Proceedings of the VLDB Endowment, 7(5):341–352, 2014. e.g., 20 columns
  6. 8 Partitioning approach 1+"# 0+"$ 5+"% 4+"& 2+"' 1+"( 8+")

    7+"* 13+"+ 62+"#, 64+"## 0+"#$ 0+"#% 0+"#& 1+"#' 1+"#( Age ~20 20~30 30~40 40~50 ~10M 20M 30M 40M 4+"# 13+"% 126+"& 0+"' 2+"( 24+"$ Salary § Take rela)onal input data as n-dimensional histogram, Partitioning "- ~/012034 56 7 Query: SELECT COUNT … 174+8, Z stddev = #( <= > 174+8′, Z′ stddev = ( <= > Perturbation Error (PE) can be reduced Original + noise Partitioned + noise
  7. Partitioned blocks 9 Partitioning as a View o_swap by x86

    4+"# 13+"$ 126+"% 0+"& 2+"' 24+"( § Considering par..oned blocks as a materialized view, which can sa.sfy req. (1) o_swap by x86 1+ "# % 1+ "# % 6+ "( % 6+ "( % 1+ "# % 1+ "# % 6+ "( % 6+ "( % 13+"$ 63+ "% ( 63+ "% ( 0+"& 0.5+ "' % 0.5+ "' % 0.5+ "' % 0.5+ "' % WHERE 20≦Age≦30 AND Salary=20 => 64.5+ "# % + "% ( + "' % 40~50 Age 30~40 20~30 ~20 ~10M 20M 30M 40M Salary req.(1): Can answer any range couting queries
  8. 10 Aggregation and Perturbation Error 0 5 4 2 1

    7 13 62 64 0 0 0 1 1 24+"# 13+"$ 126+"% 13+"& 2+"' () : Block + with |() | elements -) : Original summed count in ./ ") : Laplace noise on ./ 0: Original counts 1 0() : = 3 |45| 6/ + 7/ Before After 1 0() 1 0() 1 0() 1 0() Block () -)+") |0 − 1 0() | = ( 0 − -) |() | ) − ") |() | § Partitioning causes aggregation error (AE), so after partition and perturbation, we have AE + PE 0 1 0() Aggregation Error (AE) Perturbation Error (PE) Error of 0 8 6
  9. 11 Error Optimization § We can formulate multi-dim partitioning as

    optimization problem ! " #∈ % " &∈'# |& − * &'# | ≤ ! " #∈ % ,-('# ) + ! " #∈ % 1-('# ) ≤ " #∈ % ,- '# + % ⋅ 3 4 Goal: Minimization § An instance of the set partitioning problem: NP-Hard § Search algorithm for the optimal partitioning must satisfy DP We need to consider heuristic solutions to obtain better partitioning, and it must be simple enough to allow DP analysis.
  10. 13 Heuristics § One main heuris.cs is spli%ng can reduce

    AE 64 50 1 1 64 50 1 1 50 64 § Decrease the number of blocks is more important in High-dimensional setting § The domain of n-dimensional histogram increases exponentially … Differences from previous studies PE × "# PE × "$ (mean = ('" + )* + + + +)/" = #.) AE = |64 − #.| + |50 − #.| + |+ − #.| + |1 − #.| = ++# AE = * AE = * AE = * AE = +" 1 PE 1 PE 1 PE 1 PE 1 PE
  11. 14 Existing Approach: Privtree [4] § Recursive splitting by Quadtree

    for efficient and fine partitioning Quadtree-based recursive splitting until count is small enough Split in fixed way 2 1 1 3 3 1 1 Too fine-grained in High dimension → Too large PE (3) and space- inefficient (4) Problem More careful splitting strategy is necessary Stop Stop Stop Stop Stop Stop Stop if count is small enough
  12. 15 HDPView (Proposed method) § Recursive bisection–based differentially private search

    algorithm 1 0 6 0 2 2 2 32 8 4 3 4 0 1 64 0 16 0 0 0 0 0 12 1 9 8 24 2 3 4 6 6 0 6 3 4 1 0 6 0 2 2 2 32 8 4 3 4 0 1 64 0 16 0 0 0 0 0 12 1 9 8 24 2 3 4 6 6 0 6 3 4 1 0 6 2 32 8 0 1 64 0 2 2 4 3 4 0 16 0 0 0 0 0 12 1 9 8 24 2 3 4 6 6 0 6 3 4 Carefully select only one cut point by Random cut mechanism 1 2 0 Stop bisection by Random converge mechanism that depens on AE … Each block runs 2 mechanisms 1. Random converge 2. Random cut Recursive bisection phase All blocks stop Aggregation Perturbation Got View
  13. 16 Random cut, Random converge § Random cut § Random

    converge § We analyze a DP guarantee on the final view and an approximated error for the any queries on the view. (see our paper in detail) Only one cut point ! is probabilistically selected from all possible points using exponential mechanism Smaller total AE after bisection is more likely to be sampled Stop if Laplace noised AE of block Β is small enough #/%& is stddev of noise. If AE < #/%& , AE+PE will worsen by any bisection. if then, stop the recursive bisection Effective split for small AE Save unnecessary split req.2 DP
  14. 18 § Task: 8 workloads of range counting queries §

    1-way All range, {2,3}-way Marginal, {2,3}D-Prefix, {2,3,4}D-Random range § Metric: RMSE (Root mean squared error) § Competitors § Identity, HDMM, Privtree, Privbayes § Dataset: 8 real-world datasets Experimental Settings Dimension
  15. § Average relative RMSE (ARR) over all 8 workloads and

    8 datasets 19 Identity Privtree HDMM Privbayes HDPView (ours) ARR 1.94×10' 7.05 35.34 3.79 +. ,, Noise resistance on HD data (22D) (15D) Relative RMSE Example Prefix-3D query req.3
  16. § Comparison with Privtree – #blocks is small § With

    naïve Laplace over all domain 20 Space-efficient req.4
  17. 21 § Q: How can we construct a privacy-preserving view

    to explore the high-dimensional sensitive dataset? § We propose HDPView, which create a high-dimensional differentially private view by properly balancing AE and PE for high-dimensional data § The experiments show HDPView’s noise resistance and space-efficiency compared to existing works Conclusion
  18. § What is Differential Privacy? – Mathematical privacy definition §

    Algorithm ! provides "-Differential Privacy if for neighboring databases # and #′ (differ by only one record) satisfies: 23 Pr ! # Pr ! #′ ≤ exp(") Differential Privacy (DP) Privacy parameter We can analyze the output distribution over multiple algorithms in composable way (Composition Theorem)
  19. Contribution 24 HDMM [1] Privbayes [2], DP-GAN [3] Identity Privtree

    [4], DAWA [5] HDPView (Ours) (1) Workload independence ✔ ✔ ✔ ✔ (2) Analytical reliability ✔ ✔ ✔ ✔ (3) Noise resistance on high-dimensional data ✔ ✔ Works only for low dimension (1 or 2) ✔ (4) Space efficiency ✔ ✔ ✔ (1) (2) (3) (4)
  20. Characteristcs 25 § Flexible block par..oning § Be2er convergence §

    Carefully cu-ng one point at a 1me, the number of generated blocks can be reduced, which reduces PE and creates space-efficient view § Histograms of the real dataset are very sparse and require few blocks § Dimensional scalability § Similar to Mondrian [6] for K-anonymity § The algorithm execu1on is not affected much by the increase in dimensionality Visualization on 2D Data (Left is ours) [6] LeFevre, Kristen, David J. DeWitt, and Raghu Ramakrishnan. "Mondrian multidimensional k-anonymity." 22nd International conference on data engineering (ICDE'06). IEEE, 2006. [4]
  21. § Changes in the performance when adding attributes to Adult

    one by one in HDPView, Privbayes, and HDMM § HDPView's performance degrades slowly but surely with increasing dimensionality, while Privbayes' performance rather improves Affects of dimensionality 29 Privbayes HDPView