Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Symbolic Distillation of Neural Networks

Symbolic Distillation of Neural Networks

I describe a general framework for distilling symbolic models from neural networks.

Miles Cranmer

March 01, 2023
Tweet

Other Decks in Science

Transcript

  1. P2 ∝ a3 Kepler’s third law Newton’s law of gravitation,

    
 to explain it Kepler’s third law Empirical fi t: Problem:
  2. P2 ∝ a3 Kepler’s third law Newton’s law of gravitation,

    
 to explain it Kepler’s third law Planck’s law B = 2hν3 c2 ( exp ( hν kB T) − 1 ) −1 Empirical fi t: Problem:
  3. P2 ∝ a3 Kepler’s third law Newton’s law of gravitation,

    
 to explain it Kepler’s third law Planck’s law B = 2hν3 c2 ( exp ( hν kB T) − 1 ) −1 Empirical fi t: Quantum mechanics, 
 to explain it (Partially) Problem:
  4. P2 ∝ a3 Kepler’s third law Newton’s law of gravitation,

    
 to explain it Kepler’s third law Planck’s law B = 2hν3 c2 ( exp ( hν kB T) − 1 ) −1 Empirical fi t: Neural 
 Network
 Weights Quantum mechanics, 
 to explain it (Partially) Problem:
  5. P2 ∝ a3 Kepler’s third law Newton’s law of gravitation,

    
 to explain it Kepler’s third law Planck’s law B = 2hν3 c2 ( exp ( hν kB T) − 1 ) −1 Empirical fi t: Neural 
 Network
 Weights ??? Quantum mechanics, 
 to explain it (Partially) Problem:
  6. What I want: I want ML to create models in

    a language* I can understand** → Insights into existing models → Understand biases, learned shortcuts → Can place stronger priors on learned functions
  7. Industry version of interpretability • All revolves around saliency or

    feature importance • Consider the “saliency map” Omeiza et al., 2019
  8. Industry version of interpretability • All revolves around saliency or

    feature importance • Consider the “saliency map” Omeiza et al., 2019
  9. Symbolic regression • Symbolic regression fi nds analytic expressions to

    fi t a dataset. • (~another name for "program synthesis”) • Pioneering work by Langley et al., 1980s; Koza et al., 1990s; Lipson et al., 2000s
  10. How can I try this? • Open-source • Extensible Python

    API compatible with scikit-learn • Can be distributed over 1000s of cores
 (w/ slurm, PBS, LSF, or Kubernetes)
  11. How can I try this? • Open-source • Extensible Python

    API compatible with scikit-learn • Can be distributed over 1000s of cores
 (w/ slurm, PBS, LSF, or Kubernetes) • Custom operators, losses, constraints
  12. Great! But, there’s a problem: • Genetic algorithms, like PySR,

    scale terribly with expression complexity.
  13. Great! But, there’s a problem: • Genetic algorithms, like PySR,

    scale terribly with expression complexity. • One must search over:
  14. Great! But, there’s a problem: • Genetic algorithms, like PySR,

    scale terribly with expression complexity. • One must search over: • (permutations of operators) x (permutations of variables + possible constants)
  15. Great! But, there’s a problem: • Genetic algorithms, like PySR,

    scale terribly with expression complexity. • One must search over: • (permutations of operators) x (permutations of variables + possible constants)
  16. Great! But, there’s a problem: • Genetic algorithms, like PySR,

    scale terribly with expression complexity. • One must search over: • (permutations of operators) x (permutations of variables + possible constants) • But, we know that neural networks can ef fi ciently fi nd very complex functions!
  17. Great! But, there’s a problem: • Genetic algorithms, like PySR,

    scale terribly with expression complexity. • One must search over: • (permutations of operators) x (permutations of variables + possible constants) • But, we know that neural networks can ef fi ciently fi nd very complex functions! • Can we exploit this?
  18. Symbolic Distillation Neural network Approximation in my domain-speci fi c

    language Miles Cranmer, Rui Xu, Peter Battaglia and Shirley Ho, 
 ML4Physics Workshop @ NeurIPS 2019 Miles Cranmer, Alvaro Sanchez-Gonzalez, Peter Battaglia, Rui Xu, Kyle Cranmer, David Spergel and Shirley Ho, 
 NeurIPS, 2020
  19. How this works: 1. Train NN normally, 
 and freeze

    parameters. 2. Record input/outputs of
 network over training set.
  20. How this works: 1. Train NN normally, 
 and freeze

    parameters. 2. Record input/outputs of
 network over training set. PySR 3. Fit the input/outputs of the neural network with PySR
  21. Full Symbolic Distillation (g ∘ f)(x1 , x2 , x3

    , x4 ) = Fully-interpretable approximation of the original neural network!
  22. Full Symbolic Distillation (g ∘ f)(x1 , x2 , x3

    , x4 ) = Fully-interpretable approximation of the original neural network!
  23. Full Symbolic Distillation (g ∘ f)(x1 , x2 , x3

    , x4 ) = Fully-interpretable approximation of the original neural network! • Easier to interpret and compare with existing models in the domain speci fi c language
  24. Full Symbolic Distillation (g ∘ f)(x1 , x2 , x3

    , x4 ) = Fully-interpretable approximation of the original neural network! • Easier to interpret and compare with existing models in the domain speci fi c language • Easier to impose symbolic priors (can potentially get better generalization!)
  25. vs Instead of having to fi nd this complex expression,

    
 I have reduced it to fi nding multiple, simple expressions.
  26. Searching over expressions Searching over expressions n2 → 2n vs

    Instead of having to fi nd this complex expression, 
 I have reduced it to fi nding multiple, simple expressions.
  27. Xi y Inductive bias • Introducing some form of inductive

    bias is needed to eliminate the functional degeneracy. For example:
  28. Xi y Inductive bias • Introducing some form of inductive

    bias is needed to eliminate the functional degeneracy. For example: • the latent space between and could have some aggregation over a set! f g ∑ i
  29. Inductive bias • Other inductive biases to eliminate the degeneracy:

    • Sparsity on latent space ( fewer equations, fewer variables) ⇒
  30. Inductive bias • Other inductive biases to eliminate the degeneracy:

    • Sparsity on latent space ( fewer equations, fewer variables) ⇒ • (Also see related work of Sebastian Wetzel & Roger Melko; and Steve Brunton & Nathan Kutz!)
  31. Inductive bias • Other inductive biases to eliminate the degeneracy:

    • Sparsity on latent space ( fewer equations, fewer variables) ⇒ • (Also see related work of Sebastian Wetzel & Roger Melko; and Steve Brunton & Nathan Kutz!) • Smoothness penalty (try to encourage expression-like behavior)
  32. Inductive bias • Other inductive biases to eliminate the degeneracy:

    • Sparsity on latent space ( fewer equations, fewer variables) ⇒ • (Also see related work of Sebastian Wetzel & Roger Melko; and Steve Brunton & Nathan Kutz!) • Smoothness penalty (try to encourage expression-like behavior) • “Disentangled sparsity”
  33. Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for

    Explainable AI” 
 Workshop on Sparse Neural Networks, 2021, pp. 7 
 https://astroautomata.com/data/sjnn_paper.pdf
  34. • Disentangled Sparsity: Miles Cranmer, Can Cui, et al. “Disentangled

    Sparsity Networks for Explainable AI” 
 Workshop on Sparse Neural Networks, 2021, pp. 7 
 https://astroautomata.com/data/sjnn_paper.pdf
  35. • Disentangled Sparsity: • Want few latent features AND want

    each latent feature to have few dependencies Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for Explainable AI” 
 Workshop on Sparse Neural Networks, 2021, pp. 7 
 https://astroautomata.com/data/sjnn_paper.pdf
  36. • Disentangled Sparsity: • Want few latent features AND want

    each latent feature to have few dependencies • This makes things much easier for the genetic algorithm! Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for Explainable AI” 
 Workshop on Sparse Neural Networks, 2021, pp. 7 
 https://astroautomata.com/data/sjnn_paper.pdf
  37. • Disentangled Sparsity: • Want few latent features AND want

    each latent feature to have few dependencies • This makes things much easier for the genetic algorithm! Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for Explainable AI” 
 Workshop on Sparse Neural Networks, 2021, pp. 7 
 https://astroautomata.com/data/sjnn_paper.pdf
  38. Example: Graph neural network activations = forces, under a sparsity

    regularization Miles Cranmer, Alvaro Sanchez-Gonzalez, Peter Battaglia, Rui Xu, Kyle Cranmer, David Spergel and Shirley Ho, 
 NeurIPS, 2020
  39. Example: 
 Discovering Orbital Mechanics Can we learn Newton’s law

    of gravity simply by observing the solar system?
 
 Unknown masses, and unknown dynamical model.
  40. “Rediscovering orbital mechanics with machine learning” (2022) 
 Pablo Lemos,

    Niall Jeffrey, Miles Cranmer, Shirley Ho, Peter Battaglia Example: 
 Discovering Orbital Mechanics Can we learn Newton’s law of gravity simply by observing the solar system?
 
 Unknown masses, and unknown dynamical model.
  41. Simpli fi cation: • At some time : • Known

    position for each planet • Known acceleration for each planet • Unknown parameter for each planet • Unknown force t xi ∈ ℝ3 ·· xi ∈ ℝ3 vi ∈ ℝ f(xi − xj , vi , vj )
  42. Simpli fi cation: • Optimize: ·· xi ≈ 1 vi

    ∑ j≠i f(xi − xj , vi , vj )
  43. Simpli fi cation: • Optimize: ·· xi ≈ 1 vi

    ∑ j≠i f(xi − xj , vi , vj ) Known acceleration
  44. Simpli fi cation: • Optimize: ·· xi ≈ 1 vi

    ∑ j≠i f(xi − xj , vi , vj ) Known acceleration Newton’s laws
 of motion (assumed)
  45. Simpli fi cation: • Optimize: ·· xi ≈ 1 vi

    ∑ j≠i f(xi − xj , vi , vj ) Known acceleration Learned 
 force law Newton’s laws
 of motion (assumed)
  46. Simpli fi cation: • Optimize: ·· xi ≈ 1 vi

    ∑ j≠i f(xi − xj , vi , vj ) Known acceleration Learned 
 force law Learned parameters for planets i, j Newton’s laws
 of motion (assumed)
  47. Simpli fi cation: • Optimize: ·· xi ≈ 1 vi

    ∑ j≠i f(xi − xj , vi , vj ) Known acceleration Learned 
 force law Learned parameters for planets i, j Newton’s laws
 of motion (assumed) Learn via gradient descent.
 
 This allows us to fi nd
 both and simultaneously f f vi
  48. Training: • NASA’s HORIZONS ephemeris data • 31 bodies: •

    Sun • Planets • Moons with mass > 1e18 kg • (Therefore: 465 connections) • 30 years, 1980-2010 for training • 2010-2013 for validation
  49. Next: interpretation ·· xi ≈ 1 vi ∑ j≠i f(xi

    − xj , vi , vj ) Approximate input/output of with symbolic regression. f
  50. Interpretation Results for f Complexity Accuracy/Complexity Tradeo ff * *from

    Cranmer+2020; similar to Schmidt & Lipson, 2009 = − d(log(error)) d(complexity)
  51. Why isn’t this working well? • Let’s look at the

    mass values in comparison with the true masses:
  52. Why isn’t this working well? • Let’s look at the

    mass values in comparison with the true masses:
  53. Solution: re-optimize ! vi • The were optimized for the

    neural network. vi • The symbolic formula is not a *perfect* approximation of the network.
  54. Solution: re-optimize ! vi • The were optimized for the

    neural network. vi • The symbolic formula is not a *perfect* approximation of the network. • Thus: we need to re-optimize for the symbolic function ! vi f
  55. Learned Coarse Models for Ef fi cient Turbulence Simulation 


    (ICLR 2022) 
 
 Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer, 
 Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez- Gonzalez Example:
  56. Learned Coarse Models for Ef fi cient Turbulence Simulation 


    (ICLR 2022) 
 
 Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer, 
 Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez- Gonzalez Example: Trained to reproduce turbulence simulations at lower resolution:
  57. Learned Coarse Models for Ef fi cient Turbulence Simulation 


    (ICLR 2022) 
 
 Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer, 
 Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez- Gonzalez 1000x speedup: Example: Trained to reproduce turbulence simulations at lower resolution:
  58. Learned Coarse Models for Ef fi cient Turbulence Simulation 


    (ICLR 2022) 
 
 Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer, 
 Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez- Gonzalez 1000x speedup: How did the model actually achieve this? Example: Trained to reproduce turbulence simulations at lower resolution:
  59. Summary • Symbolic distillation is a technique for translating ML

    models into a domain speci fi c language • Can do this for expressions/programs using PySR • Exciting future applications in understanding turbulence, and other physical systems