Center 3 R&D Activity • R&D on Privacy Techs (LINE Data Science Center) • Differential Privacy / Federated Learning / … • R&D on Trustworthy AI Selected Publication • Differential Privacy @VLDB22 / SIGMOD22 / ICLR22 / ICDE21 • DP w/ Homomorphic Encryption @BigData22 • Adversarial Attacks @BigData19 • Anomaly/OOD Detection @WWW17 / WACV23 Univ. NEC LINE B.E. / M.E. (CS) from U. Tsukuba Ph.D. from U. Tsukuba Visiting Scholar @CMU 上林奨励賞 Central Labs R&D on Data Privacy2010~15 R&D on AI Security2016~18 R&D on Privacy Tech 2019~ 2010~18 2018.12~
feature is now on your app 6 Federated Learning w/ Differential Privacy https://www.youtube.com/watch?v=kTBshg1O7b0 https://tech-verse.me/ja/sessions/124
will reveal the data of specific individuals. 9 Difference Attack avg. salary = 7M JPY avg. salary = 6.8M JPY Alice’s salary can be revealed by using this simple math. … … … … Alice Alice was retired. 30 engineers 29 engineers 700 x 30 – 680 x 29 = 12M JPY
be estimated n The vulnerability has been fixed. 10 A case study of difference attack: Facebook’s PII-based Targeting https://www.youtube.com/watch?v=Lp-IwYvxGpk https://www.ftc.gov/system/files/documents/public_events /1223263/p155407privacyconmislove_1.pdf Facebook had installed thresholding, rounding, etc. for disclosure control, but they could be passed phone number estimation from e-mail address / web access history estimation
11 Database Reconstruction Reconstruction like “sudoku” There are rules, algorithms and dependencies An Example https://www2.census.gov/about/training-workshops/2021/2021-05-07-das-presentation.pdf
research n For the results of 2010, it found that reconstruction attacks are possible 12 A case study of reconstruction: US Census 2010 https://www2.census.gov/about/training-workshops/2021/2021-05-07-das-presentation.pdf
quasi-identifiers n Quasi-identifiers: a (predefined) combination of attributes that has a chance to identify individuals 13 k-anonymization https://dataprivacylab.org/dataprivacy/projects/kanonymity/kanonymity.pdf
identifiers n Unfortunately, the records can be re-identified and linked w/ public data 15 A case study of de-anonymization: Netflix Prize Pseudo ID Title Rating Review Date 1023 xxx 5 20xx/1/1 yyy 5 20xx/1/1 zzz 2 20xx/1/1 … aaa 5 20xx/3/1 20ab xxx 4 zzz 5 … … 98u7 ddd 2 20xx/4/5 fff 4 20xx/4/6 Title Rating Comments xxx 3 xxx 5 (political interests are included) yyy 5 zzz 2 … aaa 5 … … 8 Ratings à Identify w/ 99% acc. 2 Ratings à Identify w/ 68% acc. Mr. A Ms. B Linking Netflix Data IMDB Anonymous Data Using external knowledge identified as a real person
topic in the area of statistics and data analytics that uses hashing, subsampling and noise injection to enable crowdsourced learning while keeping the data of individual users completely private.” On WWDC2016, Craig Federighi (Apple) said, https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/
for any two neighboring databases 𝑫, 𝑫! ∈ 𝓓 such that 𝐷′ differs from 𝐷 in at most one record and any subset of outputs 𝑆 ⊆ 𝒮, it holds that 20 𝝐-Differential Privacy Pr ℳ 𝐷 ∈ 𝑆 ≤ exp 𝜖 Pr ℳ 𝐷! ∈ 𝑆 𝝐 : privacy parameter, privacy budget (𝟎 ≤ 𝝐 ≤ ∞) C. Dwork. Differential privacy. ICALP, 2006. Sensitive Data 𝑫 Output 𝑫′︓𝑫’neighboring databases 𝜖 0 ∞ 0.5 1 2 strong weak 4 8 …
n Differential privacy aims to conceal the difference among the neighbors 21 Neighboring Databases NAME Salary Alice ¥10M Bob ¥20M Cynthia ¥5M David ¥3M … 𝑫’s neighboring databases (examples) NAME Salary Alice ¥10M Bob ¥20M Cynthia ¥5M David ¥3M Eve ¥15M NAME Salary Alice ¥10M Cynthia ¥5M David ¥3M NAME Salary Alice ¥10M Bob ¥20M David ¥3M NAME Salary Alice ¥10M Bob ¥20M Cynthia ¥5M David ¥3M Franc ¥100M 𝑫 𝑑" 𝐷, 𝐷! = 1 𝑑# ⋅,⋅ ︓Humming distance In the most standard case, we assume adding/removing one record.
n Assume a game that guesses the input source from the randomized output 29 Differential Privacy as Hypothesis Testing ℳ 𝑦 𝐷 or 𝐷# ? 𝐷 or 𝐷# 𝐷 𝐷′ 𝜖456 = max log 1 − 𝛿 − FP FN , log 1 − 𝛿 − FN FP Empirical differential privacy Peter Kairouz, et al. The composition theorem for differential privacy. ICML2015 True Input Guess False Positive 𝐷 𝐷* False Negative 𝐷* 𝐷
loose n Seeking a tighter composition theorem is a core of DP researches 32 Sequential Composition is Loose 𝜖 #compositions Existing composition theorems ・Strong Composition ・Advance Composition ・Rényi Differential Privacy (RDP) … Sequential Com position ideal
dataset n Issue: training process is sensitive to noise since the process is complicated n Approach: data embedding that is robust against noise under dp constraint 40 Privacy Preserving Data Synthesis Train with Generative Model Synthesize Naïve Method (VAE w/ DP-SGD) P3GM (ours) ε=1.0 ε=0.2 PEARL (ours) ε=1.0 ε=1.0 Naïve P3GM PEARL Embedding End-to-end w/ DP-SGD DP-PCA Characteristic Function under DP Reconstruction DP-SGD Non-private (adversarial) High reconstruction performances under practical privacy level (ε≦1) Accepted at ICDE2021 / ICLR2022 https://arxiv.org/abs/ 2006.12101 https://arxiv.org/abs /2106.04590
the privacy of individuals n No trusted entity is required 42 Privacy-Preserving Mechanism for Collecting Data ℳ ℳ ℳ Server 𝑥2 𝑥3 𝑥8 ] 𝑥2 ] 𝑥3 ] 𝑥8 … … Indistinguishable 𝑥f ∈ 𝒳 𝒳 ∈ { } Randomized Original
frequency with having noise-resistance property and communication efficiency 47 Rand. Mech. w/ Probabilistic Data Structure https://petsymposium.org/2016/files/papers/Building_a_RAPPOR_with_the_Unknown__Pri vacy-Preserving_Learning_of_Associations_and_Data_Dictionaries.pdf https://machinelearning.apple.com/research/learnin g-with-privacy-at-scale RAPPOR by Google (Bloom Filter) Private Count Mean Sketch by Apple
never leaves clients’ devices 48 Federated Learning Non-participants of FL Global Model https://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf https://arxiv.org/abs/1912.04977 First FL paper Survey paper
Gradients - How easy is it to break privacy in federated learning?” https://arxiv.org/abs/2003.14053 Can we reconstruct an image used in training from a gradient? à Yes.
them w/ noise n Local model: clients send randomized grads and server aggregates them 50 Federated Learning under Differential Privacy Global Model Global Model Raw Gradient Randomized Gradient Noise Injection Central Model Local Model
Each client encrypts their randomized content w/ the server’s public key, then shuffler only mixes their identifies w/o looking at the contents 55 Shuffle model – an intermediate privacy model ] 𝑥2 ] 𝑥3 ] 𝑥8 l 𝑥2 (l 𝑥2, l 𝑥3, … , l 𝑥8) l 𝑥3 l 𝑥8 Server Randomized w/ 𝝐𝟎 Shuffle Shuffler Send the shuffled batch anonymized
n In each round, every client relays her randomized reports to one of her neighbors (e.g., friends on a social network) via an encrypted channel 58 Network Shuffling Accepted at SIGMOD2022 https://arxiv.org/abs/2204.03919 The larger graph amplifies privacy the more.
Model) • Query Release via Laplace Mechanism • Machine Learning via DP-SGD • Local Differential Privacy • Stats Gathering via Randomized Response • Federated Learning via LDP-SGD • Shuffle Model – an intermediate privacy model Topics in this lecture 60