Symbolic Distillation of Neural Networks

Symbolic Distillation  of Neural Networks Miles Cranmer Flatiron Institute University
of Cambridge Princeton University

P2 ∝ a3 Kepler’s third law Kepler’s third law Empirical
fi t: Problem:

P2 ∝ a3 Kepler’s third law Newton’s law of gravitation,
  to explain it Kepler’s third law Empirical fi t: Problem:

  to explain it Kepler’s third law Planck’s law B = 2hν3 c2 ( exp ( hν kB T) − 1 ) −1 Empirical fi t: Problem:

  to explain it Kepler’s third law Planck’s law B = 2hν3 c2 ( exp ( hν kB T) − 1 ) −1 Empirical fi t: Quantum mechanics,   to explain it (Partially) Problem:

  to explain it Kepler’s third law Planck’s law B = 2hν3 c2 ( exp ( hν kB T) − 1 ) −1 Empirical fi t: Neural   Network  Weights Quantum mechanics,   to explain it (Partially) Problem:

  to explain it Kepler’s third law Planck’s law B = 2hν3 c2 ( exp ( hν kB T) − 1 ) −1 Empirical fi t: Neural   Network  Weights ??? Quantum mechanics,   to explain it (Partially) Problem:

What I want: I want ML to create models in
a language* I can understand** → Insights into existing models → Understand biases, learned shortcuts → Can place stronger priors on learned functions

Industry version of interpretability • All revolves around saliency or
feature importance • Consider the “saliency map” Omeiza et al., 2019

(- Feynman’s blackboard)

Science already has a modeling language

Science already has a modeling language Computer Vision

Science already has a modeling language Computer Vision ???

Science already has a modeling language Computer Vision Science ???

We should build interpretations in this existing language: mathematical expressions!

Symbolic regression • Symbolic regression fi nds analytic expressions to
fi t a dataset. • (~another name for "program synthesis”) • Pioneering work by Langley et al., 1980s; Koza et al., 1990s; Lipson et al., 2000s

How can I try this?

How can I try this? • Open-source

How can I try this? • Open-source • Extensible Python
API compatible with scikit-learn

API compatible with scikit-learn • Can be distributed over 1000s of cores  (w/ slurm, PBS, LSF, or Kubernetes)

API compatible with scikit-learn • Can be distributed over 1000s of cores  (w/ slurm, PBS, LSF, or Kubernetes) • Custom operators, losses, constraints

Great! But, there’s a problem:

Great! But, there’s a problem: • Genetic algorithms, like PySR,
scale terribly with expression complexity.

scale terribly with expression complexity. • One must search over:

scale terribly with expression complexity. • One must search over: • (permutations of operators) x (permutations of variables + possible constants)

scale terribly with expression complexity. • One must search over: • (permutations of operators) x (permutations of variables + possible constants) • But, we know that neural networks can ef fi ciently fi nd very complex functions!

scale terribly with expression complexity. • One must search over: • (permutations of operators) x (permutations of variables + possible constants) • But, we know that neural networks can ef fi ciently fi nd very complex functions! • Can we exploit this?

Symbolic Distillation Neural network

Symbolic Distillation Neural network Approximation in my domain-speci fi c
language Miles Cranmer, Rui Xu, Peter Battaglia and Shirley Ho,   ML4Physics Workshop @ NeurIPS 2019 Miles Cranmer, Alvaro Sanchez-Gonzalez, Peter Battaglia, Rui Xu, Kyle Cranmer, David Spergel and Shirley Ho,   NeurIPS, 2020

How this works: 1. Train NN normally,   and freeze
parameters.

parameters. 2. Record input/outputs of  network over training set.

parameters. 2. Record input/outputs of  network over training set. PySR 3. Fit the input/outputs of the neural network with PySR

Analogy “Taylor expanding the Neural Network”

Full Symbolic Distillation

Full Symbolic Distillation Learns features? Uses features  for calculation?

Full Symbolic Distillation Re-train , to pick up any errors 
in the approximation of g f 🔄

Full Symbolic Distillation (g ∘ f)(x1 , x2 , x3
, x4 ) =

, x4 ) = Fully-interpretable approximation of the original neural network!

, x4 ) = Fully-interpretable approximation of the original neural network! • Easier to interpret and compare with existing models in the domain speci fi c language

, x4 ) = Fully-interpretable approximation of the original neural network! • Easier to interpret and compare with existing models in the domain speci fi c language • Easier to impose symbolic priors (can potentially get better generalization!)

vs Instead of having to fi nd this complex expression,
  I have reduced it to fi nding multiple, simple expressions.

Searching over expressions Searching over expressions n2 → 2n vs
Instead of having to fi nd this complex expression,   I have reduced it to fi nding multiple, simple expressions.

What about the functional degeneracy? Any over-complicated functional form that
learns, could invert! f g

Xi y Inductive bias • Introducing some form of inductive
bias is needed to eliminate the functional degeneracy. For example:

Xi y Inductive bias • Introducing some form of inductive
bias is needed to eliminate the functional degeneracy. For example: • the latent space between and could have some aggregation over a set! f g ∑ i

Inductive bias

Inductive bias • Other inductive biases to eliminate the degeneracy:

• Sparsity on latent space ( fewer equations, fewer variables) ⇒

• Sparsity on latent space ( fewer equations, fewer variables) ⇒ • (Also see related work of Sebastian Wetzel & Roger Melko; and Steve Brunton & Nathan Kutz!)

• Sparsity on latent space ( fewer equations, fewer variables) ⇒ • (Also see related work of Sebastian Wetzel & Roger Melko; and Steve Brunton & Nathan Kutz!) • Smoothness penalty (try to encourage expression-like behavior)

• Sparsity on latent space ( fewer equations, fewer variables) ⇒ • (Also see related work of Sebastian Wetzel & Roger Melko; and Steve Brunton & Nathan Kutz!) • Smoothness penalty (try to encourage expression-like behavior) • “Disentangled sparsity”

Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for
Explainable AI”   Workshop on Sparse Neural Networks, 2021, pp. 7   https://astroautomata.com/data/sjnn_paper.pdf

• Disentangled Sparsity: Miles Cranmer, Can Cui, et al. “Disentangled
Sparsity Networks for Explainable AI”   Workshop on Sparse Neural Networks, 2021, pp. 7   https://astroautomata.com/data/sjnn_paper.pdf

• Disentangled Sparsity: • Want few latent features AND want
each latent feature to have few dependencies Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for Explainable AI”   Workshop on Sparse Neural Networks, 2021, pp. 7   https://astroautomata.com/data/sjnn_paper.pdf

• Disentangled Sparsity: • Want few latent features AND want
each latent feature to have few dependencies • This makes things much easier for the genetic algorithm! Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for Explainable AI”   Workshop on Sparse Neural Networks, 2021, pp. 7   https://astroautomata.com/data/sjnn_paper.pdf

Example: Graph neural network activations = forces, under a sparsity
regularization Miles Cranmer, Alvaro Sanchez-Gonzalez, Peter Battaglia, Rui Xu, Kyle Cranmer, David Spergel and Shirley Ho,   NeurIPS, 2020

Example:   Discovering Orbital Mechanics

Example:   Discovering Orbital Mechanics Can we learn Newton’s law
of gravity simply by observing the solar system?    Unknown masses, and unknown dynamical model.

“Rediscovering orbital mechanics with machine learning” (2022)   Pablo Lemos,
Niall Jeffrey, Miles Cranmer, Shirley Ho, Peter Battaglia Example:   Discovering Orbital Mechanics Can we learn Newton’s law of gravity simply by observing the solar system?    Unknown masses, and unknown dynamical model.

Simpli fi cation:

Simpli fi cation: • At some time : • Known
position for each planet • Known acceleration for each planet • Unknown parameter for each planet • Unknown force t xi ∈ ℝ3 ·· xi ∈ ℝ3 vi ∈ ℝ f(xi − xj , vi , vj )

Simpli fi cation:

Simpli fi cation: • Optimize: ·· xi ≈ 1 vi
∑ j≠i f(xi − xj , vi , vj )

∑ j≠i f(xi − xj , vi , vj ) Known acceleration

∑ j≠i f(xi − xj , vi , vj ) Known acceleration Newton’s laws  of motion (assumed)

∑ j≠i f(xi − xj , vi , vj ) Known acceleration Learned   force law Newton’s laws  of motion (assumed)

∑ j≠i f(xi − xj , vi , vj ) Known acceleration Learned   force law Learned parameters for planets i, j Newton’s laws  of motion (assumed)

∑ j≠i f(xi − xj , vi , vj ) Known acceleration Learned   force law Learned parameters for planets i, j Newton’s laws  of motion (assumed) Learn via gradient descent.    This allows us to fi nd  both and simultaneously f f vi

Training: • NASA’s HORIZONS ephemeris data • 31 bodies: •
Sun • Planets • Moons with mass > 1e18 kg • (Therefore: 465 connections) • 30 years, 1980-2010 for training • 2010-2013 for validation

Next: interpretation ·· xi ≈ 1 vi ∑ j≠i f(xi
− xj , vi , vj ) Approximate input/output of with symbolic regression. f

Interpretation Results for f Complexity Accuracy/Complexity Tradeo ff * *from
Cranmer+2020; similar to Schmidt & Lipson, 2009

Interpretation Results for f Complexity Accuracy/Complexity Tradeo ff * *from
Cranmer+2020; similar to Schmidt & Lipson, 2009 = − d(log(error)) d(complexity)

Test the symbolic model:

Why isn’t this working well? • Let’s look at the
mass values in comparison with the true masses:

Solution: re-optimize ! vi

Solution: re-optimize ! vi • The were optimized for the
neural network. vi

neural network. vi • The symbolic formula is not a *perfect* approximation of the network.

neural network. vi • The symbolic formula is not a *perfect* approximation of the network. • Thus: we need to re-optimize for the symbolic function ! vi f

V. Ongoing Work: Turbulence Work includes: Dmitrii Kochkov, Keaton Burns,
Drummond Fielding, and others

Learned Coarse Models for Ef fi cient Turbulence Simulation  
(ICLR 2022)     Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer,   Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez- Gonzalez Example:

(ICLR 2022)     Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer,   Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez- Gonzalez Example: Trained to reproduce turbulence simulations at lower resolution:

(ICLR 2022)     Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer,   Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez- Gonzalez 1000x speedup: Example: Trained to reproduce turbulence simulations at lower resolution:

(ICLR 2022)     Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer,   Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez- Gonzalez 1000x speedup: How did the model actually achieve this? Example: Trained to reproduce turbulence simulations at lower resolution:

(Preliminary results) τxx τxy τyx τyy (Non-symmetric, since o ff
set!)

Summary • Symbolic distillation is a technique for translating ML
models into a domain speci fi c language • Can do this for expressions/programs using PySR • Exciting future applications in understanding turbulence, and other physical systems

Symbolic Distillation of Neural Networks

Symbolic Distillation of Neural Networks

Other Decks in Science

Featured

Transcript