Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Pycon 2024 - Is Your Model Private?

Pycon 2024 - Is Your Model Private?

As the popularity of Machine Learning models continues to soar, concerns about the risks associated with black box models have become more prominent. While much attention has been given to the development of unfair models that may discriminate against certain minorities, there exists another concern often overlooked: the privacy risks posed by ML models.

Research has substantiated that ML models are susceptible to various attacks, with one notable example being the Membership Inference attack, enabling the prediction of whether a specific sample was present during training.

Join me in this talk, where I will explain the privacy risks inherent in Machine Learning models. Beyond exploring potential attacks, I will elucidate how techniques such as Differential Privacy and tools like Opacus (https://github.com/pytorch/opacus) can play crucial roles in training more robust and secure models.

Luca Corbucci

May 24, 2024
Tweet

More Decks by Luca Corbucci

Other Decks in Technology

Transcript

  1. What’s the color of the cat? An attacker wants to

    know If a sample was used to Train the model In a Membership Inference Attack,
  2. What’s the color of the cat? 0 45 90 0

    1 2 4 0 20 40 0 1 2 4 The model will be more confident when we query it with the image that was in the training dataset
  3. Differential Privacy (An intuition using databases) Suppose you have two

    databases That differs in one single instance
  4. Differential Privacy (An intuition using databases) You query both of

    them and you have two different results You can infer something about the missing instance
  5. Differential Privacy Differential Privacy allows you to query the databases

    adding some randomisation to the answer. (An intuition using databases) You will have (more or less) the same output regardless of the presence of one sample
  6. Differential Privacy (A slightly more advanced definition) P[A( ) =

    O] ≤ P[A( ) = O] eϵ Given two databases which differ in only one instance:
  7. P[A( ) = O] ≤ P[A( ) = O] eϵ

    tells us how much these two probabilities are similar eϵ is called “privacy budget” and represents an upper bound on how much we can leak information ϵ How to interpret the ϵ Given two databases which differ in only one instance:
  8. Differential Privacy (A more relaxed definition) P[A( ) = O]

    ≤ P[A( ) = O] +δ eϵ The parameter quantifies the probability that something goes wrong. The algorithm will be differentially private with probability 1 -δ δ Given two databases which differ in only one instance:
  9. This has a cost: the more we want our query

    to be private the more is the noise we need to add and the less is the “usefulness” of the query
  10. Differential Privacy P[A( ) = ] ≤ P[A( ) =

    ] eϵ The outputs of the two neural networks will be similar regardless of the presence of in the dataset
  11. SGD def sgd(): for each batch L_t: for each sample

    x_i in the batch: g_t(x_i) = compute_gradient(M, x_i) g_t = average of gradients M = M - lr * g_t Return M
  12. SGD DP-SGD def sgd(): for each batch L_t: for each

    sample x_i in the batch: g_t(x_i) = compute_gradient(M, x_i) g_t = average of gradients M = M - lr * g_t Return M def sgd(): for each batch L_t: for each sample x_i: g_t(x_i) = compute_gradient(M, x_i) g_t = average of gradients M = M - lr * g_t Return M
  13. SGD DP-SGD def dp_sgd(): for each batch L_t: for each

    sample x_i: g_t(x_i) = compute_gradient(M, x_i) g_t(x_i) = clip_gradient() g_t = average of clipped gradients + Noise M = M - lr * g_t Return M def sgd(): for each batch L_t: for each sample x_i in the batch: g_t(x_i) = compute_gradient(M, x_i) g_t = average of gradients M = M - lr * g_t Return M
  14. Differentially Private NN are just a wrapper away * *

    if you carefully choose your privacy parameters model, optimizer, train_loader = privacy_engine.make_private_with_epsilon( module=model, # the model you want to train with DP optimizer=optimizer, data_loader=train_loader, epochs=EPOCHS, target_epsilon=EPSILON, # privacy budget target_delta=DELTA, max_grad_norm=MAX_GRAD_NORM, # clipping value )
  15. A few notes on the privacy parameters Choosing the is

    a tradeoff between the utility of the model and the privacy we want to guarantee ϵ
  16. A few notes on the privacy parameters If we set

    a low we will need to introduce a lot of noise during the training Choosing the is a tradeoff between the utility of the model and the privacy we want to guarantee ϵ ϵ
  17. A few notes on the privacy parameters If we set

    a low we will need to introduce a lot of noise during the training This will degrade the model performances! Choosing the is a tradeoff between the utility of the model and the privacy we want to guarantee ϵ ϵ
  18. References 1) Evaluating and Testing Unintended Memorization in Neural Networks

    https:// bair.berkeley.edu/blog/2019/08/13/memorization/ 2) Scalable Extraction of Training Data from (Production) Language Models https://arxiv.org/pdf/2311.17035 3) Membership Inference Attacks against Machine Learning Models https:// arxiv.org/abs/1610.05820 4) A friendly, non-technical introduction to differential privacy https:// desfontain.es/blog/friendly-intro-to-differential-privacy.html 5) Deep Learning with Differential Privacy https://arxiv.org/abs/1607.00133 6) Opacus https://opacus.ai/ 7) Tensorflow Privacy https://github.com/tensorflow/privacy
  19. References 8) A list of real-world uses of differential privacy

    https://desfontain.es/blog/ real-world-differential-privacy.html 9) Improving Gboard language models via private federated analytics https://research.google/blog/improving-gboard-language-models-via- private-federated-analytics/ 10) Learning with Privacy at Scale https://docs-assets.developer.apple.com/ ml-research/papers/learning-with-privacy-at-scale.pdf