P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative Model

Slide 1

Slide 1 text

P3GM: Private High-Dimensional Data Release via Privacy Preserving Phased Generative Model 1 Shun Takagi†* Kyoto University Tsubasa Takahashi* LINE Corporation Yang Cao Kyoto University Masatoshi Yoshikawa Kyoto University †: A main part of the author’s work was done while staying LINE Corporation. *: Equal contribution.

Slide 2

Slide 2 text

2 Background • The privacy issue prevents from sharing data over different organizations or departments èHow to mitigate privacy risk when sharing data? • Privacy Preserving Data Synthesis (PPDS) • Sharing synthesized data with some privacy guarantee • We consider to construct a generative model with differential privacy sensitive data Data Mgmt Div. Rand. Seeds synthesized data Data Science Div. generative model generative model

Slide 3

Slide 3 text

3 Differential Privacy (DP) • DP gives a rigorous privacy guarantee via randomization • (ε, δ)-DP︓A privacy measure of a randomized mechanism • DP guarantees indistinguishability of neighboring datasets è It is difficult for an adversary to infer any record of ! Randomized Algorithm Randomized Algorithm Output Output Indistinguishability She is gone. Pr[ℳ ! ∈ '] ≤ exp(.) Pr ℳ !0 ∈ ' + 2 such that 34 !, !0 = 1

Slide 4

Slide 4 text

4 Existing methods cannot synthesize high dimensional data with preserving the original distribution Existing Problem original PrivBayes[2] generative model naive[4] DP-GM[3] deep generative model (VAE[1]) [2] J. Zhang, et al. "Privbayes: Private data release via bayesian networks." SIGMOD 2014. [3] G. Acs, et al. "Differentially private mixture of generative neural networks." IEEE Transactions on Knowledge and Data Engineering 31.6 (2018): 1109-1121. [4] M. Abadi, et al. "Deep learning with differential privacy." CCS 2016.

Slide 5

Slide 5 text

Our Contribution 5 Bayesian Net. GANs VAEs Ours Noise Robustness × × ○ ○ Preserving Data Distribution ○ × × ○ naive[4] DP-GM[3] Ours All models are built under differential privacy constraints (! = #). PrivBayes[2] original

Slide 6

Slide 6 text

Preliminaries 6

Slide 7

Slide 7 text

7 • A generative probabilistical model with embedding and reconstruction • VAE synthesizes data by taking !~#(0, ') as input in the middle layer Variational AutoEncoder (VAE[1]) • embedding︓embed data ) into latent space !~#(0, ') • reconstruction︓ synthesize * ) such that ) ≈ * ) from ! z~standard normal distribution #(0, ') latent space original space reconstruct embed original space ! ) * ) ,- ,. #(0, ') synthesize training the process / → 1 → 2 /

Slide 8

Slide 8 text

8 DP-stochastic gradient discent (SGD) By adding noise y to update Δ, the generative model satisfies DP DP-SGD[4]: Training a Neural Network with DP !" !# name height disease A 140 Yes B 160 No C 180 No !" ← !" + Δ" + '" !# ← !# + Δ# + '# iteration

Slide 9

Slide 9 text

9 By training VAE with DP-SGD, it satisfies DP, but the training does not converge well VAE[1] + DP-SGD[4] latent space reconstruct embed original space ! " # " $% $& embed with noise reconstruct with noise original space

Slide 10

Slide 10 text

10 10 Difficulty of Learning DP Generative Model latent space original space reconstruct embed original space ! " # " "′ !′ % "′ without noise latent space original space reconstruct embed original space ! " !′ "′ & "'' noise with noise The reconstruction of (' becomes data that is like mixed data of ) and )′

Slide 11

Slide 11 text

11 1. Clustering data, 2. training each VAE with each class of data DP-GM [3] Pros • Each VAE can generate plausible data even if embedding is noisy Cons • Each VAE is trained with less data due to the clustering, which causes the requirement for more noise latent space original space reconstruct embed original space ! " # " !′ "′ % "′ noise VAE0 VAE9 VAE1 ...

Slide 12

Slide 12 text

New Generative Model: Privacy Preserving Phased Generative Model (P3GM) 12

Slide 13

Slide 13 text

13 Two-phased training: embedding → reconstruction It improves tolerance to noise Overview of P3GM • Phase 1: Training embedding • Phase 2: Training reconstruction with the embedding latent space original space reconstruct original space ! " # " !′ "′ % "′ Phase 1 Phase 2 embedding and reconstruction are simultaneously trained cf. VAE embed

Slide 14

Slide 14 text

14 Design of the Phased Model • The assumption “latent variable z=training data x” • enables the phased training • simplifies embedding • That is, we train the process ! → # = ! → % ! to be & ≈ ( & latent space =original space original space reconstruct original space )′(= &′) & ( & )(= &) &′ - &′ embed

Slide 15

Slide 15 text

15 Phase 1: Embedding • The training ! → #(= !) → ' ! requires the prior of # • The prior of # is the prior of ! due to our assumption, but it is intractable →We approximate the prior of ! by mixture of Gaussian (MoG) latent space = original space original space reconstruct original space #′ ! ' ! # !′ + !′ embed MoG Phase 1

Slide 16

Slide 16 text

16 Phase 1: Embedding • To protect privacy with DP, we use DP-PCA[9] and DP-EM[8] latent space original space reconstruct original space !′ # $ # ! #′ % #′ dimensionality reduction MoG Phase 1 We use the dimensionality reduction by PCA and the EM algorithm for estimating of MoG at latent space DP-PCA # ! DP-EM $ #

Slide 17

Slide 17 text

17 Phase 2: Reconstruction • Reconstruction is trained with DP-SGD to be ! ≈ # ! while fixing the trained embedding latent space original space Reconstruct original space $′ ! # ! $ !′ & !′ Embed Phase 2 MoG DP-PCA ! $ # !

Slide 18

Slide 18 text

18 Synthesize latent space original space Reconstruct original space !′ # $ ! % $′ MoG z~MoG • We can synthesize data by inputting z following the MoG estimated in Phase 1 synthesize

Slide 19

Slide 19 text

Experiments 19

Slide 20

Slide 20 text

20 Setting Evaluate the score such as classification accuracy using raw data datasets The quality of synthesized data = The score of a machine learning model trained by the synthesized data binary classification by four models - AUROC:How much the model can discriminate two classes - AUPRC: How accurately the model can output true positive 10 class classification by neural network - Accuracy

Slide 21

Slide 21 text

Experimental Results 21 Dataset AUROC AUPRC PrivBayes Ryan’s DP-GM P3GM original PrivBayes Ryan’s DP-GM P3GM or Kaggle Credit 0.5520 0.5326 0.8805 0.9232 0.9663 0.2084 0.2503 0.3301 0.5208 0.8 UCI ESR 0.5377 0.5757 0.4911 0.8243 0.8698 0.5419 0.4265 0.3311 0.7559 0.8 Adult 0.8530 0.5048 0.7806 0.8321 0.9119 0.6374 0.2584 0.4502 0.5917 0.7 UCI ISOLET 0.5100 0.5326 0.4695 0.6855 0.9891 0.2084 0.2099 0.1816 0.3287 0.9 TABLE VII: Classification accuracies on image datasets. Dataset VAE DP-GM PrivBayes Ryan’s P3GM MNIST 0.8571 0.4973 0.0970 0.2385 0.7946 Fashion 0.7854 0.5200 0.0996 0.2408 0.7311 (a) AUROC (b) AUPRC Fig. 5: Reducing dimension improves accuracy (MNIST). Fig. 6: Only P3GM high-dimensionalit Too much small dimensionality lacks the expres for embedding. From the result, d p = [10, 100] lo solution with balancing the accuracy and the dim reduction on the MNIST dataset. TABLE VI: Performance comparison on four real datasets. Each score is the average AUROC or AUPRC over four classifiers listed in Table V. P3GM outperforms other two differentially private models on three datasets. Dataset AUROC AUPRC PrivBayes Ryan’s DP-GM P3GM original PrivBayes Ryan’s DP-GM P3GM original Kaggle Credit 0.5520 0.5326 0.8805 0.9232 0.9663 0.2084 0.2503 0.3301 0.5208 0.8927 UCI ESR 0.5377 0.5757 0.4911 0.8243 0.8698 0.5419 0.4265 0.3311 0.7559 0.8098 Adult 0.8530 0.5048 0.7806 0.8321 0.9119 0.6374 0.2584 0.4502 0.5917 0.7844 UCI ISOLET 0.5100 0.5326 0.4695 0.6855 0.9891 0.2084 0.2099 0.1816 0.3287 0.9623 TABLE VII: Classification accuracies on image datasets. Image Dataset Tabular Dataset DP-GM: VAE-based method PrivBayes: Bayesian network based method Ryan’s: the method that won the NIST competition Our model P3GM gets the highest score at 5/6 data *The privacy guarantee is (!, !#$%)-DP

Slide 22

Slide 22 text

Learning Efficiency at the MNIST dataset 22 (a) Reconstruction loss (MNIST). (b) Reconstruction loss (Kaggle Credit) (c) Fig. 7: P3GM demonstrates higher learning efficiency than DP-VA NIST). (b) Reconstruction loss (Kaggle Credit) (c) Classification accuracy (MNIST). ( ates higher learning efficiency than DP-VAE. More simple model increases reconstruction error at each iteration classification accuracy at each epoch Fast convergence Fast convergence

Slide 23

Slide 23 text

23 Conclusion • We propose P3GM, which satisfies DP and can synthesize data that is similar to original data even if the dimensionality is high • The experiments show that P3GM outperforms existing approaches for high-dimensional data original data PrivBayes[2] generative model naive method[4] DP-GM[3] Deep generative model（VAE[1]） P3GM

Slide 24

Slide 24 text

Thank you 24

Slide 25

Slide 25 text

25 References [1] DP Kingma and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013). [2] J. Zhang, et al. "Privbayes: Private data release via bayesian networks." SIGMOD 2014. [3] G. Acs, et al. "Differentially private mixture of generative neural networks." IEEE Transactions on Knowledge and Data Engineering 31.6 (2018): 1109-1121. [4] M. Abadi, et al. "Deep learning with differential privacy." CCS 2016. [5] I. Goodfellow, et al. "Generative adversarial nets." NIPS (2014). [6] Xie, Liyang, et al. "Differentially private generative adversarial network." arXiv preprint arXiv:1802.06739 (2018). [7] J. Jordon, et al. “Generating Synthetic Data with Differential Privacy Guarantees.” ICLR (2019). [8] M. Park, et al. "DP-EM: Differentially private expectation maximization." AISTATS (2017). [9] W. Jiang, et al. "Wishart mechanism for differentially private principal components analysis." AAAI (2016).