Repurpose, Reuse, Recycle the building blocks of Machine Learning

Repurpose, Reuse, Recycle the building blocks of Machine Learning Gianmarco
De Francisci Morales   Principal Researcher   [email protected] 1

Machine Learning 2

LEGO 3

Today's Plan 4

Today's Plan Vapnik-Chervonenkis (VC) dimension From: Statistical learning theory and
model selection To: Approximate frequent subgraph mining 4

Today's Plan Vapnik-Chervonenkis (VC) dimension From: Statistical learning theory and
model selection To: Approximate frequent subgraph mining Automatic differentiation From: Backpropagation for deep learning To: Learning agent-based models 4

VC dimension 5

5 reasons to like the VC dimension First approximation algorithm
for frequent subgraph mining Sampling-based algorithm Approximation guarantees on frequency No false negatives, perfect recall 100x faster than exact algorithm 6

Linear model in 2D Can shatter   3 points Cannot
shatter   4 points 7

VC dimension de fi nition HARD! 8

VC dimension de fi nition Concept from statistical learning theory
  Informally: measure of model capacity HARD! 8

  Informally: measure of model capacity a set of elements called points   a family of subsets of called ranges,   is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) HARD! 8

  Informally: measure of model capacity a set of elements called points   a family of subsets of called ranges,   is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) The projection of on is the set of subsets ℛ D ⊆ 𝒟 ℛ ∩ D := {h ∩ D ∣ h ∈ ℛ} HARD! 8

  Informally: measure of model capacity a set of elements called points   a family of subsets of called ranges,   is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) The projection of on is the set of subsets ℛ D ⊆ 𝒟 ℛ ∩ D := {h ∩ D ∣ h ∈ ℛ} is shattered by if its projection contains all the subsets of : D ℛ D ℛ ∩ D = 2 |D| HARD! 8

  Informally: measure of model capacity a set of elements called points   a family of subsets of called ranges,   is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) The projection of on is the set of subsets ℛ D ⊆ 𝒟 ℛ ∩ D := {h ∩ D ∣ h ∈ ℛ} is shattered by if its projection contains all the subsets of : D ℛ D ℛ ∩ D = 2 |D| The VC dimension of is the largest cardinality of a set that is shattered by d ( 𝒟 , ℛ) ℛ HARD! 8

Example: Intervals 9

Example: Intervals Let be the elements of 𝒟 ℤ 9

Example: Intervals Let be the elements of 𝒟 ℤ Let
  be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 9

o Example: Intervals Let be the elements of 𝒟 ℤ
Let   be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 9

Let   be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 Impossible to shatter set of three elements {c, d, e} c < d < e 9

Let   be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 Impossible to shatter set of three elements {c, d, e} c < d < e No range s.t. R ∈ ℛ R ∩ {c, d, e} = {c, e} 9

Let   be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 Impossible to shatter set of three elements {c, d, e} c < d < e No range s.t. R ∈ ℛ R ∩ {c, d, e} = {c, e} VC dimension of this = ( 𝒟 , ℛ) 2 9

Pr test error ≤ training error + 1 N d
( log ( 2N d ) + 1 ) − log ( δ 4) = 1 − δ VC dimension in ML 10

VC dimension for data analysis 11

VC dimension for data analysis Dataset = Sample 11

VC dimension for data analysis Dataset = Sample How good
an approximation can we get from a sample? 11

an approximation can we get from a sample? "When analyzing a random sample of size , with probability , the results are within an factor of the true results" N 1 − δ ε 11

an approximation can we get from a sample? "When analyzing a random sample of size , with probability , the results are within an factor of the true results" N 1 − δ ε Trade-off among sample size, accuracy, and complexity of the task 11

-sample and VC dimension ε 12

-sample and VC dimension ε -sample for : for a
subset s.t.     ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ 12

subset s.t.     ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ a range space with VC-dimension ( 𝒟 , ℛ) d 12

subset s.t.     ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ a range space with VC-dimension ( 𝒟 , ℛ) d Random sample of size N = 𝒪 ( 1 ε2 (d + log 1 δ )) 12

subset s.t.     ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ a range space with VC-dimension ( 𝒟 , ℛ) d Random sample of size N = 𝒪 ( 1 ε2 (d + log 1 δ )) Is -sample for with probability ε ( 𝒟 , ℛ) 1 − δ 12

Example applications Betweenness Centrality Clustering Coef fi cient Set Cover
Frequent Itemset Mining 13

Graph Pattern Mining 14

Patterns and orbits HARD! 15

Patterns and orbits Pattern: connected labeled graph HARD! 15 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Patterns and orbits Pattern: connected labeled graph Pattern equality: isomorphism
HARD! 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Patterns and orbits Pattern: connected labeled graph Pattern equality: isomorphism
Automorphism: isomorphism to itself HARD! 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

automorphisms and their set is denoted as Aut(⌧). Given a
pattern % = (+%, ⇢% ) in P and a vertex E 2 +% , the orbit ⌫% (E) of +% that is mapped to E by any automorphism of %, i.e., ⌫% (E) ⌘ {D 2 +% : 9` 2 Aut(%) s.t. `(D) = E} . The orbits of % form a partitioning of+% , for each D 2 ⌫% (E), it holds ⌫% (D) = in ⌫% (E) have the same label. In Fig. 1 we show examples of two patterns w v3 v1 v2 v3 v1 v2 O3 O2 O1 O2 O1 Fig. 1. Examples of paerns and orbits. Colors represent vertex labels, while circle paern on the le, v1 and v2 belong to the same orbit $1. On the right, each vertex Patterns and orbits Pattern: connected labeled graph Pattern equality: isomorphism Automorphism: isomorphism to itself Orbit: subset of pattern mapped to each other by automorphisms V2 V1 V3 V3 V2 V1 HARD! 15

Frequency of a pattern Graph Pattern Frequency 16

Frequency of a pattern Graph Pattern Frequency 1 16

Frequency of a pattern Graph Pattern Frequency 1 4 16

Frequency of a pattern Graph Pattern Frequency 1 4 Not
anti-monotone! 16

Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph
Pattern Frequency Image 17

Pattern Frequency Image {V1} 17

Pattern Frequency 1 Image {V1} 17

Pattern Frequency 1 Image {V1} {V2,V3,V4,V5} 17

Pattern Frequency 1 4 Image {V1} {V2,V3,V4,V5} 17

Pattern Frequency 1 4 Image {V1} {V2,V3,V4,V5} min(1,4)=1 17

Pattern Frequency 1 4 Image {V1} {V2,V3,V4,V5} Anti-monotone! min(1,4)=1 17

Relative MNI frequency = image set of orbit of pattern
on Relative MNI frequency of pattern in graph     ZV (q) q P V P G = (V, E) fV (P) = min q∈P { |ZV (q)| |V| } 18

Approx. Frequent Subgraph Mining 19

Approx. Frequent Subgraph Mining Given threshold , sample of vertices
τ S 19

τ S With probability at least 1 − δ 19

τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ 19

τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ Find s.t. (P, εp ) fV (P) − fS (P) = |ZV (q)| |V| − |ZS (q)| |S| ≤ εP 19

τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ Find s.t. (P, εp ) fV (P) − fS (P) = |ZV (q)| |V| − |ZS (q)| |S| ≤ εP |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε -sample ε 19

Empirical VC dimension for FSG 20

Empirical VC dimension for FSG orbits of frequent patterns  
Use range space Ri = {ZV (q) : q is an orbit of P with fV (P) ≥ τ} (V, Ri ) 20

Use range space Ri = {ZV (q) : q is an orbit of P with fV (P) ≥ τ} (V, Ri ) acceptable failure probability   uniform sample of of size   upper bound to the VC dimension δ ∈ (0,1) S V s d 20

Use range space Ri = {ZV (q) : q is an orbit of P with fV (P) ≥ τ} (V, Ri ) acceptable failure probability   uniform sample of of size   upper bound to the VC dimension δ ∈ (0,1) S V s d With high probability is an -sample for for S ε (V, Ri ) ε = d + log 1 δ 2s 20

Pruning 21

Pruning -sample guarantee: ε |Ri ∩ V| |V| − |Ri
∩ S| |S| ≤ εi 21

∩ S| |S| ≤ εi Given that we can bound the error on every orbit,   we can bound the error on its minimum 21

∩ S| |S| ≤ εi Given that we can bound the error on every orbit,   we can bound the error on its minimum fV (Pi ) − fS (Pi ) ≤ εi ⟹ fS (Pi ) ≥ fV (Pi ) − εi ≥ τ − εi 21

∩ S| |S| ≤ εi Given that we can bound the error on every orbit,   we can bound the error on its minimum fV (Pi ) − fS (Pi ) ≤ εi ⟹ fS (Pi ) ≥ fV (Pi ) − εi ≥ τ − εi Lower bound on the frequency of a frequent pattern in the sample 21

Search space 22

MaNIACS 1) Find image sets of the orbits of unpruned
patterns with vertices 2) Use them to compute an upper bound to the VC dimension of 3) Compute such that is an -sample for 4) Prune patterns that cannot be frequent with lower bound 5) Extend unpruned patterns to get candidate patterns with vertices ZS (q) i (V, Ri ) εi S εi (V, Ri ) fS (Pi ) ≥ τ − εi i + 1 23

0.18 0.20 0.22 0.24 0.26 0.28 0.30 Min Frequency Threshold
ø 102 103 104 105 Running Time (s) Æ=1 Æ=0.8 exact Results First sampling-based algorithm Approximation guarantees on computed frequency No false negatives 24 1K 1.4K 1.7K 2K 2.3K 2.6K 2.9K Sample Size 0.01 0.02 0.03 0.04 0.05 0.06 0.07 MaxAE Bound MaxAE "2 "3 "4 "5

0.18 0.20 0.22 0.24 0.26 0.28 0.30 Min Frequency Threshold
ø 102 103 104 105 Running Time (s) Æ=1 Æ=0.8 exact Results First sampling-based algorithm Approximation guarantees on computed frequency No false negatives 24

Automatic Differentiation 25

Autodiff Set of techniques to evaluate the partial derivative of
a computer program Chain rule to break complex expressions Originally created for neural networks and deep learning (backpropagation) Different from numerical and symbolic differentiation ∂f(g(x)) ∂x = ∂f ∂g ∂g ∂x 26

Alternatives 27

Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei
) − f(x) h 27

) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding 27

) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding Symbolic: Input=computation graph, Output=symbolic derivative 27

) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding Symbolic: Input=computation graph, Output=symbolic derivative Example: Mathematica 27

) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding Symbolic: Input=computation graph, Output=symbolic derivative Example: Mathematica Slow (search and apply rules) and large intermediate state 27

Computational graph 28

Forward/Reverse mode 29

Example Automatic Differentiation (autodiff) • Create computation graph for gradient
computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 , = 1 1 + *.(012034320545) 1/% 30

computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) - % = 1/% à 89 85 = −1/%& 31

computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 - % = % + 1 à 89 85 = 1 32

computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 ∗ - % = *5 à 89 85 = *5 33

computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 ∗ ∗ −1 ∗ 89 814 - %, " = %" à 8; 81 = % 34

computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 ∗ ∗ −1 ∗ 89 814 ∗ 89 816 35

Libraries 36

A few highlights Machine Learning (Tensorﬂow, Pytorch are AD libraries
specialized for ML) Learning protein structure (e.g., AlphaFold) Many-body Schrodinger equation (e.g., FermiNet) Stellarator coil design Di↵erentiable ray tracing Model uncertainty & sensitivity Optimization of ﬂuid simulations Example   applications Neural Networks Optimization Ray tracing Fluid simulations Many more... 37

Agent-based model Evolution over time of system of autonomous agents
Mechanistic and causal model of behavior Encodes sociological assumptions Agents interact according to prede fi ned rules Agents are simulated to draw conclusions 38

Example: Schelling's segregation 2 types of agents: R and B
Satisfaction: number of neighbors of same color Homophily parameter If τ Si < τ → relocate 39

What about data? ABM is "theory development tool" Some people
use it as forecasting tool Calibration of parameters: run simulations with different parameters until model is able to reproduce summary statistics of data Manual, expensive, and error-prone process 40

Can we do better? 41

Can we do better? Yes! 41

Can we do better? Yes! Rewrite ABM as Probabilistic Generative
Model 41

Model Write likelihood of parameters given data ℒ(Θ|X) 41

Model Write likelihood of parameters given data ℒ(Θ|X) Maximize via Auto Differentiation ̂ Θ = arg max Θ ℒ(Θ|X) 41

Opinion dynamics How people's belief evolve Polarization, Radicalization, Echo Chambers
Data from Social Media 42

Bounded Con fi dence Model Opinion Each time agents interact
they get closer if they are closer than Positive interaction xu ∈ [−1,1] ϵ+ 43

Repulsive behavior Can interactions back fi re? Each time agents
interact they get further away if they were further than Negative interaction ϵ− 44

0 2 n+ = 0.6 n = 1.2 0 2
n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time Opinion Trajectories Parameter values encode different assumptions and   determine signi fi cantly different latent trajectories 45

Rewrite as probabilistic model Replace step function with smooth version
(sigmoid)             |xu − xv | > ϵ− ⟹ S(u, v) = − 1 P((u, v) ∈ E ∣ S(u, v) = − 1) ∝ σ (|xu − xv | − ϵ−) Opinion distance Likelihood 46

Learning from data Assume we see presence of interactions But
signs are latent And opinions of users are latent Can we learn the dynamics and parameters of the system? 47

ales Part B2 ALBEDO xt x0 xt+1 ↵t s u,
v T t Figure 2: Translation of everage recent advances in probabilistic programming to express our models. These frameworks combine erative programming languages with primitives that stic constructs, such as sampling from a distribution. de a naturally rich environment for transforming PGABM counterparts. Once a model is written in diﬀerent algorithms can be used to solve the variable m. wn a proof-of-concept of how a traditional opinion based on bounded conﬁdence [16] can be translated orm [46]. Figure 2 shows the plate notation for such d from our work [46]), where we represent the latent users at time t with xt (x0 is the initial condition), observed interaction from the data. Similarly to Learning problem Given observable interactions     fi nd:   opinions for nodes in time and   sign of each edge   with maximum likelihood Use EM and gradient descent via automatic differentiation G = (V, E) xt V × {0,…, T} → [−1,1] s E → {−, +} 48

Reconstructing synthetic data Estimated x0 True x0 Estimated xt True
xt 49

Recovering parameters 0 2 n+ = 0.6 n = 1.2
0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time. 0 2 n+ = 0.6 n = 1.2 0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time. 50 0 2 n+ = 0.6 n = 1.2 0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time.

Recovering parameters 51 Figure 4: Examples of synthetic data traces
generated in each s

Real data: Reddit Comments score = upvotes Estimate position of
users and subreddits in opinion space Larger estimated distance of user from subreddit lower score of user on that subreddit → 52

Call to Action Machine Learning is a treasure trove of
interesting building blocks VC dimension for approximation algorithms Automatic differentiation for agent- based models Repurpose it for your own goals Be curious, be bold: hack and invent! 53

G. Preti, G. De Francisci Morales, M. Riondato   “MaNIACS:
Approximate Mining of Frequent Subgraph Patterns through Sampling”   KDD 2021 + ACM TIST 2023 C. Monti, G. De Francisci Morales, F. Bonchi   “Learning Opinion Dynamics From Social Traces”   KDD 2020 C. Monti, M. Pangallo, G. De Francisci Morales, F. Bonchi   “On Learning Agent-Based Models from Data”   SciRep 2022 (accepted) + arXiv:2205.05052 54 [email protected] https://gdfm.me @gdfm7

Repurpose, Reuse, Recycle the building blocks o...

Repurpose, Reuse, Recycle the building blocks of Machine Learning

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Featured

Transcript