360

# Symbolic Distillation of Neural Networks

I describe a general framework for distilling symbolic models from neural networks. March 01, 2023

## Transcript

1. Symbolic Distillation
of Neural Networks
Miles Cranmer
Flatiron Institute

University of Cambridge

Princeton University

2. P2 ∝ a3
Kepler’s third law
Kepler’s third law
Empirical
fi
t:
Problem:

3. P2 ∝ a3
Kepler’s third law
Newton’s law of
gravitation,
to explain it
Kepler’s third law
Empirical
fi
t:
Problem:

4. P2 ∝ a3
Kepler’s third law
Newton’s law of
gravitation,
to explain it
Kepler’s third law Planck’s law
B =
2hν3
c2 (
exp (

kB
T) − 1
)
−1
Empirical
fi
t:
Problem:

5. P2 ∝ a3
Kepler’s third law
Newton’s law of
gravitation,
to explain it
Kepler’s third law Planck’s law
B =
2hν3
c2 (
exp (

kB
T) − 1
)
−1
Empirical
fi
t:
Quantum
mechanics,
to explain it
(Partially)
Problem:

6. P2 ∝ a3
Kepler’s third law
Newton’s law of
gravitation,
to explain it
Kepler’s third law Planck’s law
B =
2hν3
c2 (
exp (

kB
T) − 1
)
−1
Empirical
fi
t: Neural
Network
Weights
Quantum
mechanics,
to explain it
(Partially)
Problem:

7. P2 ∝ a3
Kepler’s third law
Newton’s law of
gravitation,
to explain it
Kepler’s third law Planck’s law
B =
2hν3
c2 (
exp (

kB
T) − 1
)
−1
Empirical
fi
t: Neural
Network
Weights
???
Quantum
mechanics,
to explain it
(Partially)
Problem:

8. What I want:
I want ML to create models in a language* I can understand**

→ Insights into existing models

→ Understand biases, learned shortcuts

→ Can place stronger priors on learned functions

9. Industry version of interpretability
• All revolves around saliency or feature
importance

• Consider the “saliency map”
Omeiza et al., 2019

10. Industry version of interpretability
• All revolves around saliency or feature
importance

• Consider the “saliency map”
Omeiza et al., 2019

11. (- Feynman’s blackboard)

12. Science already has a modeling language

13. Science already has a modeling language
Computer Vision

14. Science already has a modeling language
Computer Vision
???

15. Science already has a modeling language
Computer Vision Science
???

16. Science already has a modeling language
Computer Vision Science
???

17. We should build interpretations in this existing
language: mathematical expressions!

18. Symbolic regression
• Symbolic regression
fi
nds analytic expressions to
fi
t a dataset.

• (~another name for "program synthesis”)

• Pioneering work by Langley et al., 1980s; Koza et al., 1990s; Lipson et al., 2000s

19. How can I try this?

20. How can I try this?
• Open-source

21. How can I try this?
• Open-source
• Extensible Python API compatible with scikit-learn

22. How can I try this?
• Open-source
• Extensible Python API compatible with scikit-learn
• Can be distributed over 1000s of cores
(w/ slurm, PBS, LSF, or Kubernetes)

23. How can I try this?
• Open-source
• Extensible Python API compatible with scikit-learn
• Can be distributed over 1000s of cores
(w/ slurm, PBS, LSF, or Kubernetes)
• Custom operators, losses, constraints

24. Great!

But, there’s a problem:

25. Great!

But, there’s a problem:
• Genetic algorithms, like PySR, scale terribly with expression complexity.

26. Great!

But, there’s a problem:
• Genetic algorithms, like PySR, scale terribly with expression complexity.
• One must search over:

27. Great!

But, there’s a problem:
• Genetic algorithms, like PySR, scale terribly with expression complexity.
• One must search over:
• (permutations of operators) x (permutations of variables + possible constants)

28. Great!

But, there’s a problem:
• Genetic algorithms, like PySR, scale terribly with expression complexity.
• One must search over:
• (permutations of operators) x (permutations of variables + possible constants)

29. Great!

But, there’s a problem:
• Genetic algorithms, like PySR, scale terribly with expression complexity.
• One must search over:
• (permutations of operators) x (permutations of variables + possible constants)
• But, we know that neural networks can ef
fi
ciently
fi
nd very complex functions!

30. Great!

But, there’s a problem:
• Genetic algorithms, like PySR, scale terribly with expression complexity.
• One must search over:
• (permutations of operators) x (permutations of variables + possible constants)
• But, we know that neural networks can ef
fi
ciently
fi
nd very complex functions!
• Can we exploit this?

31. Symbolic Distillation
Neural network

32. Symbolic Distillation
Neural network
Approximation in my
domain-speci
fi
c language
Miles Cranmer, Rui Xu, Peter Battaglia and
Shirley Ho,

ML4Physics Workshop @ NeurIPS 2019
Miles Cranmer, Alvaro Sanchez-Gonzalez,
Peter Battaglia, Rui Xu, Kyle Cranmer, David
Spergel and Shirley Ho,

NeurIPS, 2020

33. How this works:
1. Train NN normally,
and freeze parameters.

34. How this works:
1. Train NN normally,
and freeze parameters.
2. Record input/outputs of
network over training set.

35. How this works:
1. Train NN normally,
and freeze parameters.
2. Record input/outputs of
network over training set.
PySR
3. Fit the input/outputs of the
neural network with PySR

36. Analogy
“Taylor expanding the Neural Network”

37. Analogy
“Taylor expanding the Neural Network”

38. Full Symbolic Distillation

39. Full Symbolic Distillation
Learns features?
Uses features
for calculation?

40. Full Symbolic Distillation
Learns features?
Uses features
for calculation?

41. Full Symbolic Distillation

42. Full Symbolic Distillation
Re-train , to pick up any errors
in the approximation of
g
f
🔄

43. Full Symbolic Distillation

44. Full Symbolic Distillation

45. Full Symbolic Distillation

46. Full Symbolic Distillation

47. Full Symbolic Distillation
(g ∘ f)(x1
, x2
, x3
, x4
) =

48. Full Symbolic Distillation
(g ∘ f)(x1
, x2
, x3
, x4
) =
Fully-interpretable approximation of the original neural network!

49. Full Symbolic Distillation
(g ∘ f)(x1
, x2
, x3
, x4
) =
Fully-interpretable approximation of the original neural network!

50. Full Symbolic Distillation
(g ∘ f)(x1
, x2
, x3
, x4
) =
Fully-interpretable approximation of the original neural network!
• Easier to interpret and compare with existing models in the domain speci
fi
c
language

51. Full Symbolic Distillation
(g ∘ f)(x1
, x2
, x3
, x4
) =
Fully-interpretable approximation of the original neural network!
• Easier to interpret and compare with existing models in the domain speci
fi
c
language
• Easier to impose symbolic priors (can potentially get better generalization!)

52. vs
fi
nd this complex expression,

I have reduced it to
fi
nding multiple, simple expressions.

53. Searching over expressions Searching over expressions
n2 → 2n
vs
fi
nd this complex expression,

I have reduced it to
fi
nding multiple, simple expressions.

54. What about the functional degeneracy?
Any over-complicated functional form that learns,
could invert!
f
g

55. Xi
y
Inductive bias
• Introducing some form of inductive bias is needed to eliminate the
functional degeneracy. For example:

56. Xi
y
Inductive bias
• Introducing some form of inductive bias is needed to eliminate the
functional degeneracy. For example:
• the latent space between and could have some aggregation over a set!
f g

i

57. Inductive bias

58. Inductive bias
• Other inductive biases to eliminate the degeneracy:

59. Inductive bias
• Other inductive biases to eliminate the degeneracy:
• Sparsity on latent space ( fewer equations, fewer variables)

60. Inductive bias
• Other inductive biases to eliminate the degeneracy:
• Sparsity on latent space ( fewer equations, fewer variables)

• (Also see related work of Sebastian Wetzel & Roger Melko; and Steve Brunton & Nathan Kutz!)

61. Inductive bias
• Other inductive biases to eliminate the degeneracy:
• Sparsity on latent space ( fewer equations, fewer variables)

• (Also see related work of Sebastian Wetzel & Roger Melko; and Steve Brunton & Nathan Kutz!)
• Smoothness penalty (try to encourage expression-like behavior)

62. Inductive bias
• Other inductive biases to eliminate the degeneracy:
• Sparsity on latent space ( fewer equations, fewer variables)

• (Also see related work of Sebastian Wetzel & Roger Melko; and Steve Brunton & Nathan Kutz!)
• Smoothness penalty (try to encourage expression-like behavior)
• “Disentangled sparsity”

63. Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for Explainable AI”

Workshop on Sparse Neural Networks, 2021, pp. 7

https://astroautomata.com/data/sjnn_paper.pdf

64. • Disentangled Sparsity:
Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for Explainable AI”

Workshop on Sparse Neural Networks, 2021, pp. 7

https://astroautomata.com/data/sjnn_paper.pdf

65. • Disentangled Sparsity:
• Want few latent features AND want each latent feature to have few
dependencies
Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for Explainable AI”

Workshop on Sparse Neural Networks, 2021, pp. 7

https://astroautomata.com/data/sjnn_paper.pdf

66. • Disentangled Sparsity:
• Want few latent features AND want each latent feature to have few
dependencies
• This makes things much easier for the genetic algorithm!
Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for Explainable AI”

Workshop on Sparse Neural Networks, 2021, pp. 7

https://astroautomata.com/data/sjnn_paper.pdf

67. • Disentangled Sparsity:
• Want few latent features AND want each latent feature to have few
dependencies
• This makes things much easier for the genetic algorithm!
Miles Cranmer, Can Cui, et al. “Disentangled Sparsity Networks for Explainable AI”

Workshop on Sparse Neural Networks, 2021, pp. 7

https://astroautomata.com/data/sjnn_paper.pdf

68. Example: Graph neural network activations =
forces, under a sparsity regularization
Miles Cranmer, Alvaro Sanchez-Gonzalez, Peter Battaglia, Rui Xu, Kyle Cranmer, David Spergel and Shirley Ho,

NeurIPS, 2020

69. Example:

Discovering Orbital Mechanics

70. Example:

Discovering Orbital Mechanics
Can we learn Newton’s law of gravity simply by
observing the solar system?

Unknown masses, and unknown dynamical model.

71. “Rediscovering orbital mechanics with machine learning” (2022)

Pablo Lemos, Niall Jeffrey, Miles Cranmer, Shirley Ho, Peter Battaglia
Example:

Discovering Orbital Mechanics
Can we learn Newton’s law of gravity simply by
observing the solar system?

Unknown masses, and unknown dynamical model.

72. Simpli
fi
cation:

73. Simpli
fi
cation:
• At some time :

• Known position for each planet

• Known acceleration for each planet

• Unknown parameter for each planet

• Unknown force
t
xi
∈ ℝ3
··
xi
∈ ℝ3
vi
∈ ℝ
f(xi
− xj
, vi
, vj
)

74. Simpli
fi
cation:

75. Simpli
fi
cation:
• Optimize:

··
xi

1
vi

j≠i
f(xi
− xj
, vi
, vj
)

76. Simpli
fi
cation:
• Optimize:

··
xi

1
vi

j≠i
f(xi
− xj
, vi
, vj
)
Known
acceleration

77. Simpli
fi
cation:
• Optimize:

··
xi

1
vi

j≠i
f(xi
− xj
, vi
, vj
)
Known
acceleration
Newton’s laws
of motion (assumed)

78. Simpli
fi
cation:
• Optimize:

··
xi

1
vi

j≠i
f(xi
− xj
, vi
, vj
)
Known
acceleration Learned
force law
Newton’s laws
of motion (assumed)

79. Simpli
fi
cation:
• Optimize:

··
xi

1
vi

j≠i
f(xi
− xj
, vi
, vj
)
Known
acceleration Learned
force law
Learned parameters
for planets i, j
Newton’s laws
of motion (assumed)

80. Simpli
fi
cation:
• Optimize:

··
xi

1
vi

j≠i
f(xi
− xj
, vi
, vj
)
Known
acceleration Learned
force law
Learned parameters
for planets i, j
Newton’s laws
of motion (assumed)

This allows us to
fi
nd
both and simultaneously
f
f vi

81. Training:
• NASA’s HORIZONS ephemeris data

• 31 bodies:

• Sun

• Planets

• Moons with mass > 1e18 kg

• (Therefore: 465 connections)

• 30 years, 1980-2010 for training

• 2010-2013 for validation

82. Next: interpretation
··
xi

1
vi

j≠i
f(xi
− xj
, vi
, vj
)
Approximate input/output of with symbolic regression.
f

83. Interpretation Results for f
Complexity
Accuracy/Complexity
ff
*
*from Cranmer+2020; similar to
Schmidt & Lipson, 2009

84. Interpretation Results for f
Complexity
Accuracy/Complexity
ff
*
*from Cranmer+2020; similar to
Schmidt & Lipson, 2009

85. Interpretation Results for f
Complexity
Accuracy/Complexity
ff
*
*from Cranmer+2020; similar to
Schmidt & Lipson, 2009

86. Interpretation Results for f
Complexity
Accuracy/Complexity
ff
*
*from Cranmer+2020; similar to
Schmidt & Lipson, 2009

87. Interpretation Results for f
Complexity
Accuracy/Complexity
ff
*
*from Cranmer+2020; similar to
Schmidt & Lipson, 2009

88. Interpretation Results for f
Complexity
Accuracy/Complexity
ff
*
*from Cranmer+2020; similar to
Schmidt & Lipson, 2009

89. Interpretation Results for f
Complexity
Accuracy/Complexity
ff
*
*from Cranmer+2020; similar to
Schmidt & Lipson, 2009

90. Interpretation Results for f
Complexity
Accuracy/Complexity
ff
*
*from Cranmer+2020; similar to
Schmidt & Lipson, 2009
= −
d(log(error))
d(complexity)

91. Test the symbolic model:

92. Why isn’t this working well?
• Let’s look at the mass values in comparison with the true masses:

93. Why isn’t this working well?
• Let’s look at the mass values in comparison with the true masses:

94. Solution: re-optimize !
vi

95. Solution: re-optimize !
vi
• The were optimized for the neural network.
vi

96. Solution: re-optimize !
vi
• The were optimized for the neural network.
vi
• The symbolic formula is not a *perfect* approximation of the network.

97. Solution: re-optimize !
vi
• The were optimized for the neural network.
vi
• The symbolic formula is not a *perfect* approximation of the network.
• Thus: we need to re-optimize for the symbolic function !
vi
f

98. V. Ongoing Work: Turbulence
Work includes: Dmitrii Kochkov, Keaton Burns, Drummond Fielding,
and others

99. Learned Coarse Models for Ef
fi
cient Turbulence Simulation

(ICLR 2022)

Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer,

Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez-
Gonzalez
Example:

100. Learned Coarse Models for Ef
fi
cient Turbulence Simulation

(ICLR 2022)

Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer,

Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez-
Gonzalez
Example:
Trained to reproduce turbulence simulations at lower resolution:

101. Learned Coarse Models for Ef
fi
cient Turbulence Simulation

(ICLR 2022)

Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer,

Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez-
Gonzalez
1000x speedup:
Example:
Trained to reproduce turbulence simulations at lower resolution:

102. Learned Coarse Models for Ef
fi
cient Turbulence Simulation

(ICLR 2022)

Kimberly Stachenfeld, Drummond B. Fielding, Dmitrii Kochkov, Miles Cranmer,

Tobias Pfaff, Jonathan Godwin, Can Cui, Shirley Ho, Peter Battaglia, Alvaro Sanchez-
Gonzalez
1000x speedup:
How did the model actually achieve this?
Example:
Trained to reproduce turbulence simulations at lower resolution:

103. (Preliminary results)
τxx
τxy
τyx
τyy
(Non-symmetric, since o
ff
set!)

104. Summary
• Symbolic distillation is a technique for translating ML models into a
domain speci
fi
c language

• Can do this for expressions/programs using PySR

• Exciting future applications in understanding turbulence, and other
physical systems