Tsubasa TAKAHASHI, Ph.D. Senior Research Scientist at LINE Data Science Center 3 R&D Activity • R&D on Privacy Techs (LINE Data Science Center) • Differential Privacy / Federated Learning / … • R&D on Trustworthy AI Selected Publication • Differential Privacy @VLDB22 / SIGMOD22 / ICLR22 / ICDE21 • DP w/ Homomorphic Encryption @BigData22 • Adversarial Attacks @BigData19 • Anomaly/OOD Detection @WWW17 / WACV23 Univ. NEC LINE B.E. / M.E. (CS) from U. Tsukuba Ph.D. from U. Tsukuba Visiting Scholar @CMU 上林奨励賞 Central Labs R&D on Data Privacy2010~15 R&D on AI Security2016~18 R&D on Privacy Tech 2019~ 2010~18 2018.12~
n Publications on major database and machine learning conferences n These achievements are based on collaborations w/ academia 5 LINE’s R&D on Privacy Techs https://linecorp.com/ja/pr/news/ja/2022/4269
n Released on late September 2022 n Learning sticker recommendation feature is now on your app 6 Federated Learning w/ Differential Privacy https://www.youtube.com/watch?v=kTBshg1O7b0 https://tech-verse.me/ja/sessions/124
Privacy Techs is an “Innovation Triger” 7 https://infocert.digital/analyst-reports/2021-gartner-hype-cycle-for-privacy/ 市場動向︓the 2021 Gartner Hype Cycle for Privacy
n Even when only statistical information is disclosed, the "difference" will reveal the data of specific individuals. 9 Difference Attack avg. salary = 7M JPY avg. salary = 6.8M JPY Alice’s salary can be revealed by using this simple math. … … … … Alice Alice was retired. 30 engineers 29 engineers 700 x 30 – 680 x 29 = 12M JPY
n From stats delivered to advertisers, various user info could be estimated n The vulnerability has been fixed. 10 A case study of difference attack: Facebook’s PII-based Targeting https://www.youtube.com/watch?v=Lp-IwYvxGpk https://www.ftc.gov/system/files/documents/public_events /1223263/p155407privacyconmislove_1.pdf Facebook had installed thresholding, rounding, etc. for disclosure control, but they could be passed phone number estimation from e-mail address / web access history estimation
n Recreation of individual-level data from tabular or aggregate data 11 Database Reconstruction Reconstruction like “sudoku” There are rules, algorithms and dependencies An Example https://www2.census.gov/about/training-workshops/2021/2021-05-07-das-presentation.pdf
n US Census reports various stats for policy-making and academic research n For the results of 2010, it found that reconstruction attacks are possible 12 A case study of reconstruction: US Census 2010 https://www2.census.gov/about/training-workshops/2021/2021-05-07-das-presentation.pdf
n k-anonymization: any record has k-1 duplicates having the same quasi-identifiers n Quasi-identifiers: a (predefined) combination of attributes that has a chance to identify individuals 13 k-anonymization https://dataprivacylab.org/dataprivacy/projects/kanonymity/kanonymity.pdf
n k-anonymization is only effective behind assumed adversary’s knowledge n By linking external knowledge, adversary can achieve re-identification 14 k-anonymization is vulnerable
n Data analytics competition w/ publishing a dataset that removes identifiers n Unfortunately, the records can be re-identified and linked w/ public data 15 A case study of de-anonymization: Netflix Prize Pseudo ID Title Rating Review Date 1023 xxx 5 20xx/1/1 yyy 5 20xx/1/1 zzz 2 20xx/1/1 … aaa 5 20xx/3/1 20ab xxx 4 zzz 5 … … 98u7 ddd 2 20xx/4/5 fff 4 20xx/4/6 Title Rating Comments xxx 3 xxx 5 (political interests are included) yyy 5 zzz 2 … aaa 5 … … 8 Ratings à Identify w/ 99% acc. 2 Ratings à Identify w/ 68% acc. Mr. A Ms. B Linking Netflix Data IMDB Anonymous Data Using external knowledge identified as a real person
What is Differential Privacy? 17 “Differential privacy is a research topic in the area of statistics and data analytics that uses hashing, subsampling and noise injection to enable crowdsourced learning while keeping the data of individual users completely private.” On WWDC2016, Craig Federighi (Apple) said, https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/
A randomized mechanism ℳ: 𝒟 → 𝒮 satisfies 𝜖-DP if, for any two neighboring databases 𝑫, 𝑫! ∈ 𝓓 such that 𝐷′ differs from 𝐷 in at most one record and any subset of outputs 𝑆 ⊆ 𝒮, it holds that 20 𝝐-Differential Privacy Pr ℳ 𝐷 ∈ 𝑆 ≤ exp 𝜖 Pr ℳ 𝐷! ∈ 𝑆 𝝐 : privacy parameter, privacy budget (𝟎 ≤ 𝝐 ≤ ∞) C. Dwork. Differential privacy. ICALP, 2006. Sensitive Data 𝑫 Output 𝑫′︓𝑫’neighboring databases 𝜖 0 ∞ 0.5 1 2 strong weak 4 8 …
n Any pair of databases that differ only one record n Differential privacy aims to conceal the difference among the neighbors 21 Neighboring Databases NAME Salary Alice ¥10M Bob ¥20M Cynthia ¥5M David ¥3M … 𝑫’s neighboring databases (examples) NAME Salary Alice ¥10M Bob ¥20M Cynthia ¥5M David ¥3M Eve ¥15M NAME Salary Alice ¥10M Cynthia ¥5M David ¥3M NAME Salary Alice ¥10M Bob ¥20M David ¥3M NAME Salary Alice ¥10M Bob ¥20M Cynthia ¥5M David ¥3M Franc ¥100M 𝑫 𝑑" 𝐷, 𝐷! = 1 𝑑# ⋅,⋅ ︓Humming distance In the most standard case, we assume adding/removing one record.
n An interpretation of DP in view of statistical testing n Assume a game that guesses the input source from the randomized output 29 Differential Privacy as Hypothesis Testing ℳ 𝑦 𝐷 or 𝐷# ? 𝐷 or 𝐷# 𝐷 𝐷′ 𝜖456 = max log 1 − 𝛿 − FP FN , log 1 − 𝛿 − FN FP Empirical differential privacy Peter Kairouz, et al. The composition theorem for differential privacy. ICML2015 True Input Guess False Positive 𝐷 𝐷* False Negative 𝐷* 𝐷
n Sequential composition is the most conservative upper-bound à very loose n Seeking a tighter composition theorem is a core of DP researches 32 Sequential Composition is Loose 𝜖 #compositions Existing composition theorems ・Strong Composition ・Advance Composition ・Rényi Differential Privacy (RDP) … Sequential Com position ideal
n Construction of intermediate privatized “view” (P-view) towards actualizing any query responses with smaller noise 35 HDPView: A Differentially Private View Noise Resistance Space Efficient Query Agnostic Analytical Reliability Accepted at VLDB2022 https://arxiv.org/abs/2203.06791
n Get randomized model parameters by randomizing the gradient n Employ gradient clipping since the sensitivity of gradients is intractable 38 DP-SGD: Differentially Private Stochastic Gradient Decent Sensitive Database 𝑫 𝑔@ = ∇A. ℒ(𝑥; 𝜃@) 𝜃@?2 = 𝜃@ − 𝜂𝑔@ Sample batch Compute gradient Update parameters 𝜃B Until converge Non-private SGD https://arxiv.org/abs/1607.00133
n Training a data synthesis model that imitates original sensitive dataset n Issue: training process is sensitive to noise since the process is complicated n Approach: data embedding that is robust against noise under dp constraint 40 Privacy Preserving Data Synthesis Train with Generative Model Synthesize Naïve Method (VAE w/ DP-SGD) P3GM (ours) ε=1.0 ε=0.2 PEARL (ours) ε=1.0 ε=1.0 Naïve P3GM PEARL Embedding End-to-end w/ DP-SGD DP-PCA Characteristic Function under DP Reconstruction DP-SGD Non-private (adversarial) High reconstruction performances under practical privacy level (ε≦1) Accepted at ICDE2021 / ICLR2022 https://arxiv.org/abs/ 2006.12101 https://arxiv.org/abs /2106.04590
n Privacy-preserving mechanism allows inferring statistics about populations while preserving the privacy of individuals n No trusted entity is required 42 Privacy-Preserving Mechanism for Collecting Data ℳ ℳ ℳ Server 𝑥2 𝑥3 𝑥8 ] 𝑥2 ] 𝑥3 ] 𝑥8 … … Indistinguishable 𝑥f ∈ 𝒳 𝒳 ∈ { } Randomized Original
(Central) DP vs Local DP 44 ℳ ℳ ℳ Server 𝑥2 𝑥3 𝑥8 ] 𝑥2 ] 𝑥3 ] 𝑥8 … … Indistinguishable 𝑥2 𝑥3 𝑥8 … ℳ Server Trusted Not required to be trusted Neighboring DB: add/remove Neighboring DB: replacement Central DP Local DP
n Examples on synthetic data (N randomized reports, including 100 items) n Errors are significantly reduced when gathering more randomized reports 46 Stats Gathering w/ Privacy at Scale 𝑁 = 10,000 𝑁 = 10,000,000
n The probabilistic data structure is very useful to estimate frequency with having noise-resistance property and communication efficiency 47 Rand. Mech. w/ Probabilistic Data Structure https://petsymposium.org/2016/files/papers/Building_a_RAPPOR_with_the_Unknown__Pri vacy-Preserving_Learning_of_Associations_and_Data_Dictionaries.pdf https://machinelearning.apple.com/research/learnin g-with-privacy-at-scale RAPPOR by Google (Bloom Filter) Private Count Mean Sketch by Apple
n Collaborative learning w/ server and clients n Raw data never leaves clients’ devices 48 Federated Learning Non-participants of FL Global Model https://proceedings.mlr.press/v54/mcmahan17a/mcmahan17a.pdf https://arxiv.org/abs/1912.04977 First FL paper Survey paper
Gradient Inversion - Privacy Issues in FL 49 (出典) “Inverting Gradients - How easy is it to break privacy in federated learning?” https://arxiv.org/abs/2003.14053 Can we reconstruct an image used in training from a gradient? à Yes.
n Central model: clients send raw grads and server aggregates them w/ noise n Local model: clients send randomized grads and server aggregates them 50 Federated Learning under Differential Privacy Global Model Global Model Raw Gradient Randomized Gradient Noise Injection Central Model Local Model
n Randomize response via randomizing gradient’s direction n Randomly select the green zone or the white zone, and then uniformly pick a vector from the selected zone 51 LDP-SGD https://arxiv.org/abs/2001.03618
n Empirical measurement with instantiated adversaries for LDP-SGD n The worst-case flipping the gradient direction reaches the theoretical bound 52 Empirical Privacy Measurement in LDP-SGD https://arxiv.org/abs/2206.09122
n LDP enables us to collect users’ data in a privatized way, but the amount of noise tends to be prohibitable 53 Issues in Local DP Randomized Original Global Model Randomized Gradient
n Intermediate trusted entity “shuffler” anonymizes local users’ identity n Each client encrypts their randomized content w/ the server’s public key, then shuffler only mixes their identifies w/o looking at the contents 55 Shuffle model – an intermediate privacy model ] 𝑥2 ] 𝑥3 ] 𝑥8 l 𝑥2 (l 𝑥2, l 𝑥3, … , l 𝑥8) l 𝑥3 l 𝑥8 Server Randomized w/ 𝝐𝟎 Shuffle Shuffler Send the shuffled batch anonymized
n Shuffler can amplify differential privacy à possibility to decrease local noise n The amplification on shuffler translates LDP on clients into CDP 56 Privacy Amplification via Shuffling 𝜖I = 8 (LDP) 𝛿 = 101J 𝑘 = 10 by Hiding among clones Example in k-randomized response https://arxiv.org/abs/2012.12803 Privacy Amplification 8 𝑥2 8 𝑥& 8 𝑥3 : 𝑥2 (: 𝑥2 , : 𝑥& , … , : 𝑥3 ) : 𝑥& : 𝑥3 Shuffler Server 𝝐𝟎 (LDP) 𝝐 < 𝝐𝟎 (CDP)
n Using shuffler and sub-sampling, FL also can employ privacy amplifications n Clients randomly check-ins federated learning at each iteration 57 Shuffle Model in Federated Learning Higher accuracy at a strong privacy level (smaller 𝜖) weak privacy ] 𝑥2 ] 𝑥K ] 𝑥8 Shuffler Aggregator strong privacy Privacy amplification Sub-sampling & Shuffling in FL https://arxiv.org/abs/2206.03151 𝝐𝐥𝐝𝐩 = 𝟖 𝝐𝐜𝐝𝐩 = 𝟏
n Decentralized shuffling via multi-round random walks on a graph n In each round, every client relays her randomized reports to one of her neighbors (e.g., friends on a social network) via an encrypted channel 58 Network Shuffling Accepted at SIGMOD2022 https://arxiv.org/abs/2204.03919 The larger graph amplifies privacy the more.
• Privacy Risks, Issues, and Case-studies • Differential Privacy (Central Model) • Query Release via Laplace Mechanism • Machine Learning via DP-SGD • Local Differential Privacy • Stats Gathering via Randomized Response • Federated Learning via LDP-SGD • Shuffle Model – an intermediate privacy model Topics in this lecture 60