差分プライバシーによるデータ活用最前線

差分プライバシーによるデータ活⽤最前線 ~ 基礎から最新の研究事例まで ~ 2022.2.27 DEIM2022チュートリアル T5 髙橋翼リュウ
センペイ⻑⾕川聡 ML Privacy Team LINE Corp.

発表者の紹介 Related Publications • Takagi, Takahashi, Cao, Yoshikawa. P3GM: Private
High-Dimensional Data Release via Privacy Preserving Phased Generative Model, ICDE2021 • Liew, Takahashi, Ueno. PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning, ICLR2022 2 Tsubasa Takahashi Seng Pei Liew Satoshi Hasegawa LINE株式会社 Data Scienceセンター ML Privacyチーム

チュートリアルの⽬的 • 差分プライバシーの普及 • 差分プライバシー研究の発展と促進 • 差分プライバシーの学習コンテンツの提供 • 社内勉強会の⼆次利⽤ 3

Roadmap • Part 1. 差分プライバシーの基礎（髙橋; 30分） • 差分プライバシーとは︖ • プライバシーモデル
• ランダム化メカニズム • Part 2. Federated Learningと差分プライバシー（⻑⾕川; 25分） • Federated Learning with Local DP • Privacy Amplification via Shuffling / Sub-sampling • Part 3. 差分プライバシー最前線（リュウ; 15分） • Privacy Amplification • ⼤規模モデルのファインチューニング • プライバシー保護データ合成 4

差分プライバシーとは︖ 5

差分プライバシー (Differential Privacy) [1] とは︖ 6 “Differential privacy is a
research topic in the area of statistics and data analytics that uses hashing, subsampling and noise injection to enable crowdsourced learning while keeping the data of individual users completely private.” On WWDC2016, Craig Federighi (Apple) said [2], • [1] C. Dwork. Differential privacy. ICALP, 2006. • [2] https://www.wired.com/2016/06/apples-differential-privacy-collecting-data/

Disclosure Avoidance in US Census 2020 [3] ⽶国の国政調査 (2020年) の結果は
差分プライバシーで保護 [3] https://www.census.gov/about/policies/privacy/statistical_safeguards/disclosure-avoidance-2020-census.html

差分プライバシーとは 8 Privacy for Stats / ML / Queries Privacy
by Randomization Privacy at Scale

Privacy by Randomization 9 Randomizer Randomizer Randomizer Randomizer ⼊⼒の違いによって⽣じる出⼒確率の「差」を制限
è 差分プライバシーにおけるプライバシーの考え⽅ … … … 統計的な出⼒ランダム化された出⼒ノイズ加算の⽬的

Privacy at Scale 10 ! = 10,000 ! = 10,000,000
ランダム化※したデータからの頻度推定（⼈⼯データを利⽤） ※ Local DPを保証

やさしい差分プライバシー ※ 厳密さよりも直感的な理解を優先 11

クエリ︓どのフルーツが好き︖ 12 100回ずつ聞きました。

クエリ︓どのフルーツが好き︖ 13 100回ずつ聞きました。 0.72 0.18 0.10 0.68 0.16 0.14
0.23 0.36 0.41

クエリ︓どのフルーツが好き︖ 14 100回ずつ聞きました。 0.72 0.18 0.10 0.68 0.16 0.14
0.23 0.36 0.41 区別がつかない区別が容易

差分プライバシー＝出⼒の区別の困難さの尺度 15 Pr # = % ≤ '(
Pr # = % % ∈ { } と間に成⽴する , が⼩さいほど⼆⼈ (の好み/出⼒) を区別することが困難

区別が容易なときには︖ 16 100回ずつ聞きました。 0.68 0.16 0.14 0.23 0.36 0.41
区別が容易

区別が容易なときには︖ 17 0.48 0.24 0.28 0.41 0.27 0.32 ノイズを加算して区別を難しくしよう︕
【注意】乱暴なノイズの加算は差分プライバシーを満たさない。ノイズは適正に設計・加算されている必要がある (後述)

差分プライバシーとは 18 出⼒の識別困難さを表す尺度プライバシー強度を ε で表現ノイズ加算によりプライバシー保護を達成

プライバシーモデル ※ ここからは厳密なおはなし 19

２つのプライバシーモデル 20 Rando mizer Rando mizer Rando mizer Server Server
Rando mizer セントラルモデルローカルモデルサーバーから第三者への統計値のリリースクライアントからサーバーへのデータ収集

（⼀般化した）!-差分プライバシーメカニズム ℳ: $ → & が !-差分プライバシーを満たすとは隣接する任意の⼊⼒の組
', ') ∈ $ および任意の出⼒の集合 + ⊆ & に対して以下が成り⽴つときである 21 Pr ℳ ' ∈ + ≤ exp 3 Pr ℳ ') ∈ + exp 3 ≃ 1 + 3 3が⼩さいときには概ね⼊⼒が変化しても出⼒の傾向はほとんど変わらない（最⼤でも 789(!)） !︓プライバシー強度 0 ∞ 0.5 1 2 強弱実⽤的な範囲 4 8 … 関数やアルゴリズム等【重要】後ほど解説︕

プライバシー強度 ε の解釈 • 証拠能⼒ (Evidence) としての強さに関する考察 (via Bayes Factor)
[4] 22 [4] N. Hoshino. A firm foundation for statistical disclosure control. Japanese Journal of Statistics and Data Science, 3(2):721–746, 2020. (出典) [4]のTable 3 (出典) [4]のTable 4

隣接データベース 23 セントラルモデルにおける隣接性 NAME Cancer Alice Yes Bob No Cynthia
No David Yes … !の隣接データベース（の⼀部） NAME Cancer Alice Yes Bob No Cynthia No David Yes Eve Yes NAME Cancer Alice Yes Cynthia No David Yes NAME Cancer Alice Yes Bob No David Yes NAME Cancer Alice Yes Bob No Cynthia No David Yes Franc No ! 隣接データベース︓任意の1要素だけが異なるデータベースの組 1要素の追加と削除のみを考える "# $, $& = 1 "# ⋅,⋅ ︓ハミング距離

(セントラル) 差分プライバシー [1] 24 メカニズム ℳ: # → % が
&-差分プライバシーを満たすとは任意の隣接データベースの組 ', ') ∈ # および任意の出⼒の集合 + ⊆ % に対して以下が成り⽴つときである Pr ℳ ' ∈ + ≤ exp 3 Pr ℳ ') ∈ + NAME Cancer Alice Yes Bob No Cynthia No David Yes NAME Cancer Alice Yes Bob No Cynthia No David Yes Eve Yes NAME Cancer Alice Yes Cynthia No David Yes ℳ ℳ ℳ 456(&)程度しか区別ができない è ⼊⼒の差異も区別が難しい [1] C. Dwork. Differential privacy. ICALP, 2006.

CDPのバリエーション 25 !, # -差分プライバシー Pr ℳ ' ∈ )
≤ exp . Pr ℳ '/ ∈ ) + 1 Bounded 差分プライバシー [5] NAME Cancer Alice Yes Bob No Cynthia No David Yes 2の隣接データベース 2 NAME Cancer Alice Yes Bob Yes Cynthia No David Yes NAME Cancer Alice No Bob No Cynthia No David Yes NAME Cancer Alice Yes Bob No Cynthia Yes David Yes NAME Cancer Alice Yes Bob No Cynthia No David No 1タプルの値の変更だけ考える [5] D. Kifer and A. Machanavajjhala. No free lunch in data privacy. SIGMOD2011

ローカル差分プライバシー (Local DP, LDP) [6] 26 メカニズム ℳ: # →
% が &-ローカル差分プライバシーを満たすとは任意の⼊⼒の組 ', ') ∈ # および任意の出⼒の集合 + ⊆ % に対して以下が成り⽴つときである Pr ℳ ' ∈ + ≤ exp 3 Pr ℳ ') ∈ + ℳ ℳ ℳ [6] J. C. Duchi, M. I. Jordan, and M. J. Wainwright. Local privacy and statistical minimax rates. FOCS2013 クライアントは⼀塊のデータ' をℳに⼊⼒ Server # ∈ { } ひとりひとりが何を送ってきたか区別が難しい

プライバシーモデルの整理 27 セントラルモデル（⼊⼒の要素数が１以上）ローカルモデル（⼊⼒の要素数が１）⼊⼒の隣接性︓追加 / 削除（編集距離１）
Central DP Removal Local DP ⼊⼒の隣接性︓付け替え（編集距離２） Bounded DP Local DP Central DPとLocal DPは想定するモデルに違いがある

差分プライバシーを保証するランダム化メカニズム 28

ノイズの設計とセンシティビティセンシティビティΔ" • 関数 # の出⼒の最⼤の変化量（想定する隣接性における） 29 Δ" = sup
# ( − # (* + Examples Δ,-./01234 = 1 Δ6078/ = 1 Δ4938 = 1 : ラプラスメカニズム • 平均0、分散b = Δ" /= のラプラス分布からノイズをサンプリング ℳ ( = # ( + Lap 0, Δ" = ※ CDPの場合乱暴なノイズの加算は差分プライバシーを保証しませんノイズで覆い隠す度合い

ラプラスメカニズムが!-DPであることの証明 30 Pr[% & = (] Pr[% &* = (]
= Π, -./0 (, − 2 & , Π, -./0 (, − 2 &* , = Π, exp 678 (, − 2 & , − (, − 2 &* , ≤ Π, exp 678 2 & , − 2 &* , = exp 678 : , 2 & , − 2 &* , = exp 678 2 & − 2 &* 8 = exp ; Δ= 2 & − 2 &* 8 ≤ exp ; -./0 > = 1 26 exp(−678|>|) 6 = Δ= ; Δ= ≥ 2 & − 2 &* 8 >8 − >E ≤ |>8 − >E |

ラプラスメカニズムの実装 31 ポイントを抑えればコーディングはとても簡単

ラプラスメカニズムの挙動 32 ! = 1, Δ& = 1 ! =
1, Δ& = 1 ! = 1, Δ& = 1 Due to generating random noise, the outputs are probabilistic. 全く同じ条件でも確率的な挙動をする

ラプラスメカニズムの挙動 33 ! = 0.1 ! = 0.5 ! =
2 ! = 0.05 ! = 10 Δ) = 1 プライバシー強度 ! を変動させた場合

34 ラプラスメカニズムの挙動 ! = 1, Δ& = 1 ! =
1, Δ& = 4 ! = 1, Δ& = 0.1 Δ&を変動させた場合 ※ 本来、ヒストグラムのセンシティビティは1。この⽐較はシミュレーションセンシティビティが⼩さいアルゴリズムを考えることが有⽤性の観点で重要

プライバシー消費の管理 35 Server ℳ ℳ ℳ "# "$ "% メカニズムℳを左記のように3回実施。
直列合成定理によるプライバシー消費の合計は "&'&() = "# + "$ + "% メカニズムでデータを評価するたびに「プライバシーを消費する」と考える • 直列合成定理は最もナイーブでルーズなプライバシー消費の合算⽅法である • Renyi-DPやAdvanced Compositionなどのタイトな合成⼿法が提案されている差分プライバシーを適正に利⽤するためのキーポイント合計のプライバシー消費 ,-.-/0 が事前に設定したプライバシー予算 , を超えないように管理が必要

DP-SGD [7]︓DPな機械学習フレームワーク 36 SGD (確率的勾配降下法) DP-SGD 損失 ℒ " を計算
勾配を計算パラメータを更新ミニバッチ"をサンプリング " 損失を計算勾配を計算パラメータを更新ミニバッチ"を⼀様ランダムにサンプリング " 勾配にノイズ加算 How ??? 勾配のセンシティビティ = 勾配のL2ノルム è センシティビティを定めることが困難 [7] M. Adadi et. al. Deep learning with differential privacy. CCS2016

DP-SGD [7]︓DPな機械学習フレームワーク 37 DP-SGD 損失 ℒ "; $ を計算勾配
% を計算パラメータを更新ミニバッチ&を⼀様ランダムにサンプリング & 勾配にノイズ加算勾配へのノイズ加算クリッピング勾配の ℓ( -ノルムを最⼤でも定数 ) にするノイズの設計平均 0, 分散 )* (のガウスノイズ *: ノイズスケーラーセンシティビティを + にしてしまうノイズ低減の⼯夫 " ∈ & 毎に勾配を計算 à クリップ à 総和 à ノイズ加算 à 平均化サンプル毎に勾配がクリップされているため総和を取ってもセンシティビティは + のまま • パラメータ更新の度にプライバシー消費が⽣じる • サンプリングレート、ノイズスケーラー、訓練回数によって算出するモーメントアカウント (およびその発展) が提案されているプライバシー消費の管理 [7] M. Adadi et. al. Deep learning with differential privacy. CCS2016

DP-SGD [7]︓DPな機械学習フレームワーク 38 DP-SGD 損失 ℒ "; $ を計算勾配
% を計算パラメータを更新ミニバッチ&を⼀様ランダムにサンプリング & 勾配にノイズ加算勾配へのノイズ加算クリッピングノイズの設計 ': ノイズスケーラーセンシティビティを ( にしてしまうノイズ低減の⼯夫サンプル毎に勾配がクリップされているため総和を取ってもセンシティビティは ( のまま )* % = % ⋅ min 1, 2 % 3 4 ∼ 6 0, 2' 3 8 % = 1 & 4 + : ;∈= )* ∇ℒ "; $ • パラメータ更新の度にプライバシー消費が⽣じる • サンプリングレート、ノイズスケーラー、訓練回数によって算出するモーメントアカウント (およびその発展) が提案されているプライバシー消費の管理 [7] M. Adadi et. al. Deep learning with differential privacy. CCS2016

DP-SGD利⽤時の注意点 DP-SGDによる機械学習の制限 • 訓練回数に制限がある • 訓練の度にプライバシーを消費するため、所定の回数で訓練の打ち切りが必要 • ノイズの影響により学習効率が下がる • ノイズを⼩さくするためには、クリップサイズを⼩さくする必要がある
è 学習率 (learning rate) を⼩さくして学習を緩やかにすること、と同様の効果 • ハイパーパラメーターチューニングができない（場合が多い） 39 DP-SGD 損失 ℒ "; $ を計算勾配 % を計算パラメータを更新ミニバッチ&を⼀様ランダムにサンプリング & 勾配にノイズ加算

匿名化との⽐較 • k-匿名化 [8] • 背景知識に依存したプライバシー保護 • 匿名化データの連続したリリースに適していない • 多次元データの匿名化は有⽤性の棄損が激しい
• 匿名化データ全体をリリースする必要がある • わかりやすい • 差分プライバシー • 背景知識に依存しないプライバシー保護 • 連続したリリースに対応できる（プライバシーバジェット内で） • 多次元データでも有⽤性を維持できるデータ合成技術の研究が進んでいる • 推定したパラメータだけで知識やデータの共有ができる • わかりづらい 40 [8] L. Sweeney. k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570, 2002.

パート１のまとめ • 差分プライバシー（DP）の基礎を紹介 • DPは、メカニズムに対する「出⼒の識別困難さを表す尺度」 • ⼆つのプライバシーモデル︓Central DP と Local
DP • ノイズの設計には、センシティビティの考慮が必須 • ランダム化メカニズム • ラプラスメカニズム • DP-SGD • プライバシー消費の管理がとても重要 41

Roadmap • Part 1. 差分プライバシーの基礎（髙橋; 30分） • 差分プライバシーとは︖ • プライバシーモデル
• ランダム化メカニズム • Part 2. Federated Learningと差分プライバシー（⻑⾕川; 25分） • Federated Learning with Local DP • Privacy Amplification via Shuffling / Sub-sampling • Part 3. 差分プライバシー最前線（リュウ; 15分） • Privacy Amplification • ⼤規模モデルのファインチューニング • プライバシー保護データ合成 1

Open Source Software 2

差分プライバシーに関するツール (1) • ライブラリ(DPの基礎的な演算や統計・機械学習含む) • OpenDP : https://github.com/opendp/opendp • 主要開発者
: Harvard University, Microsoft • 利⽤⾔語 : Python, Rust • Diffprivlib : https://github.com/IBM/differential-privacy-library • 主要開発者 : IBM • 利⽤⾔語 : Python • Google’s Differential Privacy : https://github.com/google/differential- privacy • 主要開発者 : Google • 利⽤⾔語 : C++, Go, Java, Python 3

差分プライバシーに関するツール (2) • DP-SGD • Tensorflow Privacy : https://github.com/tensorflow/privacy •
主要開発者 : Google • 利⽤⾔語 : Python • Opacus: https://github.com/pytorch/opacus • 主要開発者 : Meta • 利⽤⾔語 : Python • データ解析(SQL関連) • Elastic sensitivity : https://github.com/uber-archive/sql-differential- privacy • 主要開発者 : Uber, UC Berkeley • PrivateSQL : https://github.com/google/differential- privacy/tree/main/examples/zetasql • 主要開発者 : Google 4

例) テストの点数の平均値を求める • 差分プライベートなmean関数を⽤いるだけで実現できる 5 差分プライバシ未適⽤の場合 2.0-DPの場合(Diffprivlibを利⽤)

例) テストの平均値と分散値を求める • プライバシの消費も管理しながら実施 6

Federated Learning with Local DP 7

Federated Learning︓Overview Non-participants of FL Global Model 8

Federated Learning︓Overview Non-participants of FL Global Model Raw data never
leaves the device. Low-cost communication Tailor-made Personalization Orchestrating FL procedure Learning global model via aggregation Mitigating cold start issues Free from storing huge clients’ data Immediate Adaptation 9

Federated Learning︓Notations • Evaluate the base model at client !
• "#$ = ∇' ((*# ; ,$ ) • Update the model from . reports • ̅ "$ = 0 1 ∑#∈ 1 "#$ • ,$40 = ,$ − 6$ ̅ "$ Notation - ,: global (base) model - *# : local data of client ! - H: #clients - I: #rounds - .:#participants/round 10 , . … H *# ,0 ,J "#0 "#J ̅ "0 ̅ "J

Privacy Risks in Federated Learning • Evaluate the base model
at client ! • "#$ = ∇'((*#; ,$) • Update from . reports • ̅ "$ = 0 1 ∑#∈ 1 "#$ • ,$40 = ,$ − 6$ ̅ "$ Notation - ,: global (base) model - *# : local data of client ! - H: #clients - I: #rounds - .:#participants/round , . … H *# ,0 ,J "#0 "#J ̅ "0 ̅ "J 訓練データの推定・再構成モデルの世代間の差異から訓練データを推定 ,J ,0 11 inversion

Inverting Gradients (出典) “Inverting Gradients - How easy is it
to break privacy in federated learning?” https://arxiv.org/abs/2003.14053 勾配から訓練データ (画像) を復元できるか︖ 12

FL with Local Differential Privacy Non-participants of FL + +
Differential Privacy Differential Privacy + + + + + + + + Possible Solutions • Local Differential Privacy • LDP + Shuffling (後半で紹介) 13 ノイズを加算することで出⼒の差異を制限 (どんな⼊⼒でも出⼒がほぼ同じに⾒える) ノイズの⼤きさは⼊⼒が出⼒に与える影響の⼤きさに依存 (ここでは勾配のノルム) 多数のレポートを集約することでノイズを打ち消し合う効果がある

Gradient Randomization Mechanism under LDP • LDP-SGD[Duchiら18] : Randomize Responseをベースにした⼿法
1. 勾配!のノルムをクリッピング 2. ノルムの⼤きさに応じて!の符号を確率的に反転 • ⼤きい場合 : そのままでありやすい • ⼩さい場合 : 反転されやすい 3. 超球からランダムベクトル"をサンプリング • #$ %&#$ の確率で緑⾊ • % %&#$ の確率で⽩⾊ 4. "を送信 • 'が⼤きいかつ勾配のノルムが⼤きい場合, 緑⾊の枠に⼊りやすい. • たくさん集めて平均化することで, 元の勾配!に近い値が得られる. 14 Duchi et al. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 2018.

Experiments: LDP下でどの程度FLが有効か︖ データセット • MNIST(⼿書き数字認識データセット) FLの設定 • クライアント数N︓2,000,000 (2×106) •
サンプル数/クライアント︓5 • 集計バッチサイズM︓1,000 (103) • 各クライアントは1ラウンドずつ参加 LDP Mechanism • LDP-SGD︓ε=1.0, 2.0, 4.0 + + + + + + + + + + 15 Differential Privacy Differential Privacy

Experiments: LDP下でどの程度FLが有効か︖ 考察 • ⾼いaccuracyを得るには, ノイズが⼩さい(プライバシー強度を緩める(εが⼤きい))必要あり • ε = 1.0
: acc = 65%, ε = 2.0 : acc = 85%, ε = 4.0 : acc = 90% • ノイズを⼩さくしつつも, プライバシーを強化する技術があると良いのでは? 課題 16

Privacy Amplification via Shuffling / Sub-sampling 17

クライアントからのレポートをシャッフルし, サーバに対し送信元をわからなくする Shufflingとは? + Differential Privacy + + +++ Shuffler
Swap / Remove Identifiers Anonymized Shuffler should be a “trusted” entity. !"-LDP at client !-CDP 18 + + ++

Privacy Amplification via Shuffling & Sub-sampling 19 + Differential Privacy
+ + +++ Shuffler Swap / Remove Identifiers Anonymized Shuffler should be a “trusted” entity. Sub-sampling Ex.) ! = 2×10', ( = 10) *+ = 4.0 à * = 2.23 (δ = 1/!) *3-LDP at client *-CDP Shuffling効果とSub-sampling効果を勘案して4を導出 Exの計算は, Girgis et al. , Rényi Differential Privacy of the Subsampled Shuffle Model in Distributed Learning, NeurIPS 2021

Shufflingの課題 • Shufflerを信⽤する必要がある • マルチパーティ計算やセキュアハードウェアを⽤いる⽅法も提案されている • セキュアハードウェア(Intel SGX)を使った⽅法を次に紹介 • ⼤きい参加者数(!)
or ⼩さいサンプリングレート(!/#)が必要 • !が⼩さ過ぎると平均化によるノイズ打ち消し効果が不⼗分に • Privacy Compositionが未成熟でルーズ（もっと増幅するはず） • ⽇進⽉歩でタイトな計算⽅法が提案され続けている（難解。。。） • 社会実装されたプラクティスが不⾜ 20

Intel Software Guard eXecution • Intel CPUにおける安全なプログラム実⾏領域(≒Enclave(⾶び地)) • メモリ上に暗号化した領域を作成することで実現 •
(⼤雑把に⾔うと)OSやVM, サーバの管理者が内容を知ることができない領域 21 図: Tsai et al., Graphene-SGX: A Practical Library OS for Unmodified Applications on SGX, USENIX ATC 17 Intel SGXの脅威モデル対策できる攻撃 • プロセス攻撃 • OS・VMによる攻撃 • Hardwareへの攻撃対策できない攻撃 • サイドチャネル攻撃 • CPU⾃体の脆弱性

Shuffling with Intel SGX[Bittauら17] • Remote Attestation : Enclaveが正当なものかを検証してデータを送信 •
Oblivious Shuffling : サイドチャネル攻撃を緩和できるShuffle 22 Enclave(SGX) + + ++ Oblivious Shuffling Swap / Remove Identifiers Anonymized Server with Intel SGX Remote Attestation Training + +

Oblivious Shuffling • (⼤雑把に⾔うと) データの中⾝に依存した動作をしないShuffle • メモリアクセス頻度や計算時間がデータによって変わらないShuffle • なぜOblivious Shufflingが必要?
è サイドチャネル攻撃対策 • メモリアクセス頻度や計算時間を利⽤し, 秘密を暴く攻撃の対策のため • どのような⽅法? • 例) 乱数 + ソーティングネットワーク 23 A B C D データ乱数 34 42 20 14 D C B A ⼤⼩⽐較

パート2のまとめ • FLは近年注⽬されているプライバシ保護型協調学習の⼀つ • ⽣データはクライアント外に送信せず、評価結果だけサーバーとやりとり • 様々なバリエーションが研究されている • FLにおけるプライバシ保護の話題 •
(L)DPの適⽤がスタンダード • 秘密計算や、秘密計算+CDP/LDPも議論されている • LDPではPrivacy Amplificationの議論がホット • オープンイシュー • FLのバリデーションはどう実施したらよいか︖特にDP下ではどうすべきか︖ • FLで保証したセキュリティ・プライバシをどう説明するか︖ 24 様々なPPMLの⼒を結集する必要がありそう

差分プライバシーの最新研究 Privacy accounting (shuffling, composition) Shuffling & beyond Fine-tuning of
large-scale models Data synthesis 1

Privacy accounting of shuffling • Recent research topics include obtaining
better privacy amplification theoretically via subsampling/shuffling 2 (出典) Feldman et al, 21, “A Simple and Nearly Optimal Analysis of Privacy Amplification by Shuffling”

Privacy accounting (composition) • Recent research topics also include obtaining
better privacy under composition, requiring new tools such as Renyi differential privacy (RDP), privacy loss distribution (PLD) 3 (出典): [2012.12803], [2106.02848], [2106.08567]

Shuffling & beyond • More practical privacy models of federated
learning have also been considered • Problem: sampling of clients by server is impractical (client may not respond in time, battery outage etc.) • Should rely only on randomness independently generated by each individual client (random check-in) 4 (出典) Balle et al, 20, “Privacy amplification via random check- ins”

Shuffling & beyond • Privacy amplification rate similar to shuffling
is obtainable • Note: a centralized and trusted orchestrating server is still required to hide the identities of participating clients 5 (出典) Balle et al, 20, “Privacy amplification via random check- ins” shuffling

⼤規模モデルのファインチューニング • We consider training modern large-scale models (Resnets, BERT)
with DP. • Main problem of DP-SGD: Utility degrades with the dimensionality of model. • CNN for MNIST usually contains ~1 million or less parameters • RoBERTa-Large (NLP) contains 354 million parameters. • We relax the problem by considering fine-tuning (with DP) models pre-trained with public data • Models like BERT are usually pre-trained with public data (Web) • Image classification models can be pre-trained with e.g. CIFAR100 6

⼤規模モデルのファインチューニング • Fine-tuning large models is still difficult 7 (出典)
Yu et al, 20, “DO NOT LET PRIVACY OVERBILL UTILITY: GRADIENT EMBEDDING PERTURBATION FOR PRIVATE LEARNING”

⼤規模モデルのファインチューニング • Lottery ticket hypothesis: One can train a much
smaller NN without losing much utility by pruning the NN appropriately • Freeze pre-trained weights according to lottery ticket hypothesis and update only unfrozen weights 8 (出典) Luo et al, 21, “Scalable Differential Privacy With Sparse Network Finetuning”

データ合成 • Why? 9

データ合成(詳細) • Approaches (see also [arXiv:2112.09238] ) • Marginal-based: measure
a subset of low-order marginals to fit graphical models (e.g., PrivBayes) • Deep generative models: Train deep generative models such as Generative Adversarial Network (GAN) with DP • How to measure the “usefulness” of synthetic data? • Test the performance of machine learning models trained with synthetic data (classification accuracy, F1 score) • Measure the discrepancy between real and synthetic data distributions (two-sample test, max-mean discrepancy, etc.) 10

データ合成 (PrivBayes) 11 age workclass education title income Pr ∗
≈ Pr %&' ⋅ Pr ')*| %&' ⋅ Pr ,-./ | %&', ')* ⋅ Pr 1213' | ')*, ,-./ ⋅ Pr 245-6' | ,-./, 1213' • Calculate the correlations with DP to build the graphical model. • Estimate the marginal probabilities with DP. (出典) Zhang et al, 17, “PrivBayes: Private Data Release via Bayesian Networks”

データ合成(deep generative models) 12 Sensitive Data Privately Embedded 1 Privately
Embedded 2 Privately Embedded k Aux Synthesized 1 Synthesized 2 Synthesized k Critic Adv. Recon. Learner Generator (1) (2) (3) (4) DP Flow (one-shot) Training Flow … … • One can use DP-SGD to train generative models, but DP-SGD has a lot of problems as mentioned before. • An alternative approach is calculating certain values from private data (kind of similar to marginal-based methods) and train a deep generative model based on it. (出典) Liew et al, 22, “PEARL: Data Synthesis via Private Embeddings and Adversarial Reconstruction Learning”

Summary • DP is being used under • Centralized settings:
• Training modern deep learning models • Private synthetic data generation • Decentralized settings: • Learning/collecting user information without violating user privacy • (not introduced today) Federated analytics: heavy hitter, A/B test • Tackling research issues in DP requires a mixture of theoretical and empirical tool/work (which I think is interesting) • There is still a lot of R&D issues in practice 13

チュートリアルの終わりに • 差分プライバシー (DP) は、厳密なプライバシー保護を提供する上に、幅広いデータ解析への適⽤が可能 • すでに実世界にデプロイされ、オープンソースツールも豊富 •
差分プライバシーの利⽤は、プライバシー保護が必要なデータ活⽤の有望なソリューション 14

差分プライバシーによるデータ活用最前線

差分プライバシーによるデータ活用最前線

More Decks by LINE Developers

Other Decks in Technology

Featured

Transcript