High-Dimensional Data Release via Privacy Preserving Phased Generative Model, ICDE2021 • Liew, Takahashi, Ueno. PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning, ICLR2022 2 Tsubasa Takahashi Seng Pei Liew Satoshi Hasegawa LINE株式会社 Data Scienceセンター ML Privacyチーム
research topic in the area of statistics and data analytics that uses hashing, subsampling and noise injection to enable crowdsourced learning while keeping the data of individual users completely private.” On WWDC2016, Craig Federighi (Apple) said [2], • [1] C. Dwork. Differential privacy. ICALP, 2006. • [2] https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/
[4] 22 [4] N. Hoshino. A firm foundation for statistical disclosure control. Japanese Journal of Statistics and Data Science, 3(2):721–746, 2020. (出典) [4]のTable 3 (出典) [4]のTable 4
No David Yes … !の隣接データベース(の⼀部) NAME Cancer Alice Yes Bob No Cynthia No David Yes Eve Yes NAME Cancer Alice Yes Cynthia No David Yes NAME Cancer Alice Yes Bob No David Yes NAME Cancer Alice Yes Bob No Cynthia No David Yes Franc No ! 隣接データベース︓任意の1要素だけが異なるデータベースの組 1要素の追加と削除のみを考える "# $, $& = 1 "# ⋅,⋅ ︓ハミング距離
&-差分プライバシー を満たすとは 任意の隣接データベースの組 ', ') ∈ # および 任意の出⼒の集合 + ⊆ % に対して以下が成り⽴つときである Pr ℳ ' ∈ + ≤ exp 3 Pr ℳ ') ∈ + NAME Cancer Alice Yes Bob No Cynthia No David Yes NAME Cancer Alice Yes Bob No Cynthia No David Yes Eve Yes NAME Cancer Alice Yes Cynthia No David Yes ℳ ℳ ℳ 456(&)程度しか 区別ができない è ⼊⼒の差異も区別が難しい [1] C. Dwork. Differential privacy. ICALP, 2006.
≤ exp . Pr ℳ '/ ∈ ) + 1 Bounded 差分プライバシー [5] NAME Cancer Alice Yes Bob No Cynthia No David Yes 2の隣接データベース 2 NAME Cancer Alice Yes Bob Yes Cynthia No David Yes NAME Cancer Alice No Bob No Cynthia No David Yes NAME Cancer Alice Yes Bob No Cynthia Yes David Yes NAME Cancer Alice Yes Bob No Cynthia No David No 1タプルの値の変更だけ考える [5] D. Kifer and A. Machanavajjhala. No free lunch in data privacy. SIGMOD2011
leaves the device. Low-cost communication Tailor-made Personalization Orchestrating FL procedure Learning global model via aggregation Mitigating cold start issues Free from storing huge clients’ data Immediate Adaptation 9
better privacy amplification theoretically via subsampling/shuffling 2 (出典) Feldman et al, 21, “A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling”
better privacy under composition, requiring new tools such as Renyi differential privacy (RDP), privacy loss distribution (PLD) 3 (出典): [2012.12803], [2106.02848], [2106.08567]
learning have also been considered • Problem: sampling of clients by server is impractical (client may not respond in time, battery outage etc.) • Should rely only on randomness independently generated by each individual client (random check-in) 4 (出典) Balle et al, 20, “Privacy amplification via random check- ins”
is obtainable • Note: a centralized and trusted orchestrating server is still required to hide the identities of participating clients 5 (出典) Balle et al, 20, “Privacy amplification via random check- ins” shuffling
with DP. • Main problem of DP-SGD: Utility degrades with the dimensionality of model. • CNN for MNIST usually contains ~1 million or less parameters • RoBERTa-Large (NLP) contains 354 million parameters. • We relax the problem by considering fine-tuning (with DP) models pre-trained with public data • Models like BERT are usually pre-trained with public data (Web) • Image classification models can be pre-trained with e.g. CIFAR100 6
smaller NN without losing much utility by pruning the NN appropriately • Freeze pre-trained weights according to lottery ticket hypothesis and update only unfrozen weights 8 (出典) Luo et al, 21, “Scalable Differential Privacy With Sparse Network Finetuning”
a subset of low-order marginals to fit graphical models (e.g., PrivBayes) • Deep generative models: Train deep generative models such as Generative Adversarial Network (GAN) with DP • How to measure the “usefulness” of synthetic data? • Test the performance of machine learning models trained with synthetic data (classification accuracy, F1 score) • Measure the discrepancy between real and synthetic data distributions (two-sample test, max-mean discrepancy, etc.) 10
Embedded 2 Privately Embedded k Aux Synthesized 1 Synthesized 2 Synthesized k Critic Adv. Recon. Learner Generator (1) (2) (3) (4) DP Flow (one-shot) Training Flow … … • One can use DP-SGD to train generative models, but DP-SGD has a lot of problems as mentioned before. • An alternative approach is calculating certain values from private data (kind of similar to marginal-based methods) and train a deep generative model based on it. (出典) Liew et al, 22, “PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning”
• Training modern deep learning models • Private synthetic data generation • Decentralized settings: • Learning/collecting user information without violating user privacy • (not introduced today) Federated analytics: heavy hitter, A/B test • Tackling research issues in DP requires a mixture of theoretical and empirical tool/work (which I think is interesting) • There is still a lot of R&D issues in practice 13