Upgrade to Pro — share decks privately, control downloads, hide ads and more …

P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative Model

P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative Model

Shun Takagi (Kyoto University)
TsubasaTakahashi (LINECorporation)
Yang Cao (Kyoto University)
Masatoshi Yoshikawa (Kyoto University)
https://arxiv.org/abs/2006.12101

Presentation slide at ICDE 2021 (37th IEEE International Conference on Data Engineering)

LINE Developers

May 24, 2021
Tweet

More Decks by LINE Developers

Other Decks in Technology

Transcript

  1. P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative

    Model 1 Shun Takagi†* Kyoto University Tsubasa Takahashi* LINE Corporation Yang Cao Kyoto University Masatoshi Yoshikawa Kyoto University †: A main part of the author’s work was done while staying LINE Corporation. *: Equal contribution.
  2. 2 Background • The privacy issue prevents from sharing data

    over different organizations or departments èHow to mitigate privacy risk when sharing data? • Privacy Preserving Data Synthesis (PPDS) • Sharing synthesized data with some privacy guarantee • We consider to construct a generative model with differential privacy sensitive data Data Mgmt Div. Rand. Seeds synthesized data Data Science Div. generative model generative model
  3. 3 Differential Privacy (DP) • DP gives a rigorous privacy

    guarantee via randomization • (ε, δ)-DP︓A privacy measure of a randomized mechanism • DP guarantees indistinguishability of neighboring datasets è It is difficult for an adversary to infer any record of ! Randomized Algorithm Randomized Algorithm Output Output Indistinguishability She is gone. Pr[ℳ ! ∈ '] ≤ exp(.) Pr ℳ !0 ∈ ' + 2 such that 34 !, !0 = 1
  4. 4 Existing methods cannot synthesize high dimensional data with preserving

    the original distribution Existing Problem original PrivBayes[2] generative model naive[4] DP-GM[3] deep generative model (VAE[1]) [2] J. Zhang, et al. "Privbayes: Private data release via bayesian networks." SIGMOD 2014. [3] G. Acs, et al. "Differentially private mixture of generative neural networks." IEEE Transactions on Knowledge and Data Engineering 31.6 (2018): 1109-1121. [4] M. Abadi, et al. "Deep learning with differential privacy." CCS 2016.
  5. Our Contribution 5 Bayesian Net. GANs VAEs Ours Noise Robustness

    × × ◦ ◦ Preserving Data Distribution ◦ × × ◦ naive[4] DP-GM[3] Ours All models are built under differential privacy constraints (! = #). PrivBayes[2] original
  6. 7 • A generative probabilistical model with embedding and reconstruction

    • VAE synthesizes data by taking !~#(0, ') as input in the middle layer Variational AutoEncoder (VAE[1]) • embedding︓embed data ) into latent space !~#(0, ') • reconstruction︓ synthesize * ) such that ) ≈ * ) from ! z~standard normal distribution #(0, ') latent space original space reconstruct embed original space ! ) * ) ,- ,. #(0, ') synthesize training the process / → 1 → 2 /
  7. 8 DP-stochastic gradient discent (SGD) By adding noise y to

    update Δ, the generative model satisfies DP DP-SGD[4]: Training a Neural Network with DP !" !# name height disease A 140 Yes B 160 No C 180 No !" ← !" + Δ" + '" !# ← !# + Δ# + '# iteration
  8. 9 By training VAE with DP-SGD, it satisfies DP, but

    the training does not converge well VAE[1] + DP-SGD[4] latent space reconstruct embed original space ! " # " $% $& embed with noise reconstruct with noise original space
  9. 10 10 Difficulty of Learning DP Generative Model latent space

    original space reconstruct embed original space ! " # " "′ !′ % "′ without noise latent space original space reconstruct embed original space ! " !′ "′ & "'' noise with noise The reconstruction of (' becomes data that is like mixed data of ) and )′
  10. 11 1. Clustering data, 2. training each VAE with each

    class of data DP-GM [3] Pros • Each VAE can generate plausible data even if embedding is noisy Cons • Each VAE is trained with less data due to the clustering, which causes the requirement for more noise latent space original space reconstruct embed original space ! " # " !′ "′ % "′ noise VAE0 VAE9 VAE1 ...
  11. 13 Two-phased training: embedding → reconstruction It improves tolerance to

    noise Overview of P3GM • Phase 1: Training embedding • Phase 2: Training reconstruction with the embedding latent space original space reconstruct original space ! " # " !′ "′ % "′ Phase 1 Phase 2 embedding and reconstruction are simultaneously trained cf. VAE embed
  12. 14 Design of the Phased Model • The assumption “latent

    variable z=training data x” • enables the phased training • simplifies embedding • That is, we train the process ! → # = ! → % ! to be & ≈ ( & latent space =original space original space reconstruct original space )′(= &′) & ( & )(= &) &′ - &′ embed
  13. 15 Phase 1: Embedding • The training ! → #(=

    !) → ' ! requires the prior of # • The prior of # is the prior of ! due to our assumption, but it is intractable →We approximate the prior of ! by mixture of Gaussian (MoG) latent space = original space original space reconstruct original space #′ ! ' ! # !′ + !′ embed MoG Phase 1
  14. 16 Phase 1: Embedding • To protect privacy with DP,

    we use DP-PCA[9] and DP-EM[8] latent space original space reconstruct original space !′ # $ # ! #′ % #′ dimensionality reduction MoG Phase 1 We use the dimensionality reduction by PCA and the EM algorithm for estimating of MoG at latent space DP-PCA # ! DP-EM $ #
  15. 17 Phase 2: Reconstruction • Reconstruction is trained with DP-SGD

    to be ! ≈ # ! while fixing the trained embedding latent space original space Reconstruct original space $′ ! # ! $ !′ & !′ Embed Phase 2 MoG DP-PCA ! $ # !
  16. 18 Synthesize latent space original space Reconstruct original space !′

    # $ ! % $′ MoG z~MoG • We can synthesize data by inputting z following the MoG estimated in Phase 1 synthesize
  17. 20 Setting Evaluate the score such as classification accuracy using

    raw data datasets The quality of synthesized data = The score of a machine learning model trained by the synthesized data binary classification by four models - AUROC:How much the model can discriminate two classes - AUPRC: How accurately the model can output true positive 10 class classification by neural network - Accuracy
  18. Experimental Results 21 Dataset AUROC AUPRC PrivBayes Ryan’s DP-GM P3GM

    original PrivBayes Ryan’s DP-GM P3GM or Kaggle Credit 0.5520 0.5326 0.8805 0.9232 0.9663 0.2084 0.2503 0.3301 0.5208 0.8 UCI ESR 0.5377 0.5757 0.4911 0.8243 0.8698 0.5419 0.4265 0.3311 0.7559 0.8 Adult 0.8530 0.5048 0.7806 0.8321 0.9119 0.6374 0.2584 0.4502 0.5917 0.7 UCI ISOLET 0.5100 0.5326 0.4695 0.6855 0.9891 0.2084 0.2099 0.1816 0.3287 0.9 TABLE VII: Classification accuracies on image datasets. Dataset VAE DP-GM PrivBayes Ryan’s P3GM MNIST 0.8571 0.4973 0.0970 0.2385 0.7946 Fashion 0.7854 0.5200 0.0996 0.2408 0.7311 (a) AUROC (b) AUPRC Fig. 5: Reducing dimension improves accuracy (MNIST). Fig. 6: Only P3GM high-dimensionalit Too much small dimensionality lacks the expres for embedding. From the result, d p = [10, 100] lo solution with balancing the accuracy and the dim reduction on the MNIST dataset. TABLE VI: Performance comparison on four real datasets. Each score is the average AUROC or AUPRC over four classifiers listed in Table V. P3GM outperforms other two differentially private models on three datasets. Dataset AUROC AUPRC PrivBayes Ryan’s DP-GM P3GM original PrivBayes Ryan’s DP-GM P3GM original Kaggle Credit 0.5520 0.5326 0.8805 0.9232 0.9663 0.2084 0.2503 0.3301 0.5208 0.8927 UCI ESR 0.5377 0.5757 0.4911 0.8243 0.8698 0.5419 0.4265 0.3311 0.7559 0.8098 Adult 0.8530 0.5048 0.7806 0.8321 0.9119 0.6374 0.2584 0.4502 0.5917 0.7844 UCI ISOLET 0.5100 0.5326 0.4695 0.6855 0.9891 0.2084 0.2099 0.1816 0.3287 0.9623 TABLE VII: Classification accuracies on image datasets. Image Dataset Tabular Dataset DP-GM: VAE-based method PrivBayes: Bayesian network based method Ryan’s: the method that won the NIST competition Our model P3GM gets the highest score at 5/6 data *The privacy guarantee is (!, !#$%)-DP
  19. Learning Efficiency at the MNIST dataset 22 (a) Reconstruction loss

    (MNIST). (b) Reconstruction loss (Kaggle Credit) (c) Fig. 7: P3GM demonstrates higher learning efficiency than DP-VA NIST). (b) Reconstruction loss (Kaggle Credit) (c) Classification accuracy (MNIST). ( ates higher learning efficiency than DP-VAE. More simple model increases reconstruction error at each iteration classification accuracy at each epoch Fast convergence Fast convergence
  20. 23 Conclusion • We propose P3GM, which satisfies DP and

    can synthesize data that is similar to original data even if the dimensionality is high • The experiments show that P3GM outperforms existing approaches for high-dimensional data original data PrivBayes[2] generative model naive method[4] DP-GM[3] Deep generative model(VAE[1]) P3GM
  21. 25 References [1] DP Kingma and Max Welling. "Auto-encoding variational

    bayes." arXiv preprint arXiv:1312.6114 (2013). [2] J. Zhang, et al. "Privbayes: Private data release via bayesian networks." SIGMOD 2014. [3] G. Acs, et al. "Differentially private mixture of generative neural networks." IEEE Transactions on Knowledge and Data Engineering 31.6 (2018): 1109-1121. [4] M. Abadi, et al. "Deep learning with differential privacy." CCS 2016. [5] I. Goodfellow, et al. "Generative adversarial nets." NIPS (2014). [6] Xie, Liyang, et al. "Differentially private generative adversarial network." arXiv preprint arXiv:1802.06739 (2018). [7] J. Jordon, et al. “Generating Synthetic Data with Differential Privacy Guarantees.” ICLR (2019). [8] M. Park, et al. "DP-EM: Differentially private expectation maximization." AISTATS (2017). [9] W. Jiang, et al. "Wishart mechanism for differentially private principal components analysis." AAAI (2016).