Repurpose, Reuse, Recycle the building blocks of Machine Learning

Slide 1

Slide 1 text

Repurpose, Reuse, Recycle the building blocks of Machine Learning Gianmarco De Francisci Morales   Principal Researcher   [email protected] 1

Slide 2

Slide 2 text

Machine Learning 2

Slide 3

Slide 3 text

Machine Learning 2

Slide 4

Slide 4 text

LEGO 3

Slide 5

Slide 5 text

LEGO 3

Slide 6

Slide 6 text

Today's Plan 4

Slide 7

Slide 7 text

Today's Plan Vapnik-Chervonenkis (VC) dimension From: Statistical learning theory and model selection To: Approximate frequent subgraph mining 4

Slide 8

Slide 8 text

Today's Plan Vapnik-Chervonenkis (VC) dimension From: Statistical learning theory and model selection To: Approximate frequent subgraph mining Automatic differentiation From: Backpropagation for deep learning To: Learning agent-based models 4

Slide 9

Slide 9 text

VC dimension 5

Slide 10

Slide 10 text

5 reasons to like the VC dimension First approximation algorithm for frequent subgraph mining Sampling-based algorithm Approximation guarantees on frequency No false negatives, perfect recall 100x faster than exact algorithm 6

Slide 11

Slide 11 text

Linear model in 2D Can shatter   3 points Cannot shatter   4 points 7

Slide 12

Slide 12 text

VC dimension de fi nition HARD! 8

Slide 13

Slide 13 text

VC dimension de fi nition Concept from statistical learning theory   Informally: measure of model capacity HARD! 8

Slide 14

Slide 14 text

Slide 15

Slide 15 text

Slide 16

Slide 16 text

VC dimension de fi nition Concept from statistical learning theory   Informally: measure of model capacity a set of elements called points   a family of subsets of called ranges,   is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) The projection of on is the set of subsets ℛ D ⊆ 𝒟 ℛ ∩ D := {h ∩ D ∣ h ∈ ℛ} is shattered by if its projection contains all the subsets of : D ℛ D ℛ ∩ D = 2 |D| HARD! 8

Slide 17

Slide 17 text

Slide 18

Slide 18 text

Example: Intervals 9

Slide 19

Slide 19 text

Example: Intervals Let be the elements of 𝒟 ℤ 9

Slide 20

Slide 20 text

Example: Intervals Let be the elements of 𝒟 ℤ Let   be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 9

Slide 21

Slide 21 text

o Example: Intervals Let be the elements of 𝒟 ℤ Let   be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 9

Slide 22

Slide 22 text

o Example: Intervals Let be the elements of 𝒟 ℤ Let   be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 Impossible to shatter set of three elements {c, d, e} c < d < e 9

Slide 23

Slide 23 text

Slide 24

Slide 24 text

Slide 25

Slide 25 text

Pr test error ≤ training error + 1 N d ( log ( 2N d ) + 1 ) − log ( δ 4) = 1 − δ VC dimension in ML 10

Slide 26

Slide 26 text

Pr test error ≤ training error + 1 N d ( log ( 2N d ) + 1 ) − log ( δ 4) = 1 − δ VC dimension in ML 10

Slide 27

Slide 27 text

VC dimension for data analysis 11

Slide 28

Slide 28 text

VC dimension for data analysis Dataset = Sample 11

Slide 29

Slide 29 text

VC dimension for data analysis Dataset = Sample How good an approximation can we get from a sample? 11

Slide 30

Slide 30 text

VC dimension for data analysis Dataset = Sample How good an approximation can we get from a sample? "When analyzing a random sample of size , with probability , the results are within an factor of the true results" N 1 − δ ε 11

Slide 31

Slide 31 text

Slide 32

Slide 32 text

-sample and VC dimension ε 12

Slide 33

Slide 33 text

-sample and VC dimension ε -sample for : for a subset s.t.     ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ 12

Slide 34

Slide 34 text

-sample and VC dimension ε -sample for : for a subset s.t.     ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ a range space with VC-dimension ( 𝒟 , ℛ) d 12

Slide 35

Slide 35 text

-sample and VC dimension ε -sample for : for a subset s.t.     ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ a range space with VC-dimension ( 𝒟 , ℛ) d Random sample of size N = 𝒪 ( 1 ε2 (d + log 1 δ )) 12

Slide 36

Slide 36 text

Slide 37

Slide 37 text

Example applications Betweenness Centrality Clustering Coef fi cient Set Cover Frequent Itemset Mining 13

Slide 38

Slide 38 text

Graph Pattern Mining 14

Slide 39

Slide 39 text

Graph Pattern Mining 14

Slide 40

Slide 40 text

Patterns and orbits HARD! 15

Slide 41

Slide 41 text

Patterns and orbits Pattern: connected labeled graph HARD! 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Slide 42

Slide 42 text

Patterns and orbits Pattern: connected labeled graph Pattern equality: isomorphism HARD! 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Slide 43

Slide 43 text

Patterns and orbits Pattern: connected labeled graph Pattern equality: isomorphism Automorphism: isomorphism to itself HARD! 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

Slide 44

Slide 44 text

automorphisms and their set is denoted as Aut(⌧). Given a pattern % = (+%, ⇢% ) in P and a vertex E 2 +% , the orbit ⌫% (E) of +% that is mapped to E by any automorphism of %, i.e., ⌫% (E) ⌘ {D 2 +% : 9` 2 Aut(%) s.t. `(D) = E} . The orbits of % form a partitioning of+% , for each D 2 ⌫% (E), it holds ⌫% (D) = in ⌫% (E) have the same label. In Fig. 1 we show examples of two patterns w v3 v1 v2 v3 v1 v2 O3 O2 O1 O2 O1 Fig. 1. Examples of paerns and orbits. Colors represent vertex labels, while circle paern on the le, v1 and v2 belong to the same orbit $1. On the right, each vertex Patterns and orbits Pattern: connected labeled graph Pattern equality: isomorphism Automorphism: isomorphism to itself Orbit: subset of pattern mapped to each other by automorphisms V2 V1 V3 V3 V2 V1 HARD! 15

Slide 45

Slide 45 text

Frequency of a pattern Graph Pattern Frequency 16

Slide 46

Slide 46 text

Frequency of a pattern Graph Pattern Frequency 16

Slide 47

Slide 47 text

Frequency of a pattern Graph Pattern Frequency 1 16

Slide 48

Slide 48 text

Frequency of a pattern Graph Pattern Frequency 1 16

Slide 49

Slide 49 text

Frequency of a pattern Graph Pattern Frequency 1 4 16

Slide 50

Slide 50 text

Frequency of a pattern Graph Pattern Frequency 1 4 Not anti-monotone! 16

Slide 51

Slide 51 text

Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph Pattern Frequency Image 17

Slide 52

Slide 52 text

Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph Pattern Frequency Image 17

Slide 53

Slide 53 text

Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph Pattern Frequency Image {V1} 17

Slide 54

Slide 54 text

Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph Pattern Frequency 1 Image {V1} 17

Slide 55

Slide 55 text

Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph Pattern Frequency 1 Image {V1} 17

Slide 56

Slide 56 text

Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph Pattern Frequency 1 Image {V1} {V2,V3,V4,V5} 17

Slide 57

Slide 57 text

Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph Pattern Frequency 1 4 Image {V1} {V2,V3,V4,V5} 17

Slide 58

Slide 58 text

Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph Pattern Frequency 1 4 Image {V1} {V2,V3,V4,V5} min(1,4)=1 17

Slide 59

Slide 59 text

Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph Pattern Frequency 1 4 Image {V1} {V2,V3,V4,V5} Anti-monotone! min(1,4)=1 17

Slide 60

Slide 60 text

Relative MNI frequency = image set of orbit of pattern on Relative MNI frequency of pattern in graph     ZV (q) q P V P G = (V, E) fV (P) = min q∈P { |ZV (q)| |V| } 18

Slide 61

Slide 61 text

Approx. Frequent Subgraph Mining 19

Slide 62

Slide 62 text

Approx. Frequent Subgraph Mining Given threshold , sample of vertices τ S 19

Slide 63

Slide 63 text

Approx. Frequent Subgraph Mining Given threshold , sample of vertices τ S With probability at least 1 − δ 19

Slide 64

Slide 64 text

Approx. Frequent Subgraph Mining Given threshold , sample of vertices τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ 19

Slide 65

Slide 65 text

Approx. Frequent Subgraph Mining Given threshold , sample of vertices τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ Find s.t. (P, εp ) fV (P) − fS (P) = |ZV (q)| |V| − |ZS (q)| |S| ≤ εP 19

Slide 66

Slide 66 text

Slide 67

Slide 67 text

Slide 68

Slide 68 text

Slide 69

Slide 69 text

Slide 70

Slide 70 text

Empirical VC dimension for FSG 20

Slide 71

Slide 71 text

Empirical VC dimension for FSG orbits of frequent patterns   Use range space Ri = {ZV (q) : q is an orbit of P with fV (P) ≥ τ} (V, Ri ) 20

Slide 72

Slide 72 text

Empirical VC dimension for FSG orbits of frequent patterns   Use range space Ri = {ZV (q) : q is an orbit of P with fV (P) ≥ τ} (V, Ri ) acceptable failure probability   uniform sample of of size   upper bound to the VC dimension δ ∈ (0,1) S V s d 20

Slide 73

Slide 73 text

Slide 74

Slide 74 text

Pruning 21

Slide 75

Slide 75 text

Pruning -sample guarantee: ε |Ri ∩ V| |V| − |Ri ∩ S| |S| ≤ εi 21

Slide 76

Slide 76 text

Pruning -sample guarantee: ε |Ri ∩ V| |V| − |Ri ∩ S| |S| ≤ εi Given that we can bound the error on every orbit,   we can bound the error on its minimum 21

Slide 77

Slide 77 text

Pruning -sample guarantee: ε |Ri ∩ V| |V| − |Ri ∩ S| |S| ≤ εi Given that we can bound the error on every orbit,   we can bound the error on its minimum fV (Pi ) − fS (Pi ) ≤ εi ⟹ fS (Pi ) ≥ fV (Pi ) − εi ≥ τ − εi 21

Slide 78

Slide 78 text

Slide 79

Slide 79 text

Search space 22

Slide 80

Slide 80 text

Search space 22

Slide 81

Slide 81 text

Search space 22

Slide 82

Slide 82 text

Search space 22

Slide 83

Slide 83 text

Search space 22

Slide 84

Slide 84 text

Search space 22

Slide 85

Slide 85 text

MaNIACS 1) Find image sets of the orbits of unpruned patterns with vertices 2) Use them to compute an upper bound to the VC dimension of 3) Compute such that is an -sample for 4) Prune patterns that cannot be frequent with lower bound 5) Extend unpruned patterns to get candidate patterns with vertices ZS (q) i (V, Ri ) εi S εi (V, Ri ) fS (Pi ) ≥ τ − εi i + 1 23

Slide 86

Slide 86 text

0.18 0.20 0.22 0.24 0.26 0.28 0.30 Min Frequency Threshold ø 102 103 104 105 Running Time (s) Æ=1 Æ=0.8 exact Results First sampling-based algorithm Approximation guarantees on computed frequency No false negatives 24 1K 1.4K 1.7K 2K 2.3K 2.6K 2.9K Sample Size 0.01 0.02 0.03 0.04 0.05 0.06 0.07 MaxAE Bound MaxAE "2 "3 "4 "5

Slide 87

Slide 87 text

Slide 88

Slide 88 text

Slide 89

Slide 89 text

Automatic Differentiation 25

Slide 90

Slide 90 text

Autodiff Set of techniques to evaluate the partial derivative of a computer program Chain rule to break complex expressions Originally created for neural networks and deep learning (backpropagation) Different from numerical and symbolic differentiation ∂f(g(x)) ∂x = ∂f ∂g ∂g ∂x 26

Slide 91

Slide 91 text

Alternatives 27

Slide 92

Slide 92 text

Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei ) − f(x) h 27

Slide 93

Slide 93 text

Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei ) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding 27

Slide 94

Slide 94 text

Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei ) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding Symbolic: Input=computation graph, Output=symbolic derivative 27

Slide 95

Slide 95 text

Slide 96

Slide 96 text

Slide 97

Slide 97 text

Computational graph 28

Slide 98

Slide 98 text

Forward/Reverse mode 29

Slide 99

Slide 99 text

Example Automatic Differentiation (autodiff) • Create computation graph for gradient computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 , = 1 1 + *.(012034320545) 1/% 30

Slide 100

Slide 100 text

Example Automatic Differentiation (autodiff) • Create computation graph for gradient computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) - % = 1/% à 89 85 = −1/%& 31

Slide 101

Slide 101 text

Example Automatic Differentiation (autodiff) • Create computation graph for gradient computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 - % = % + 1 à 89 85 = 1 32

Slide 102

Slide 102 text

Example Automatic Differentiation (autodiff) • Create computation graph for gradient computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 ∗ - % = *5 à 89 85 = *5 33

Slide 103

Slide 103 text

Example Automatic Differentiation (autodiff) • Create computation graph for gradient computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 ∗ ∗ −1 ∗ 89 814 - %, " = %" à 8; 81 = % 34

Slide 104

Slide 104 text

Example Automatic Differentiation (autodiff) • Create computation graph for gradient computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 ∗ ∗ −1 ∗ 89 814 ∗ 89 816 35

Slide 105

Slide 105 text

Libraries 36

Slide 106

Slide 106 text

A few highlights Machine Learning (Tensorﬂow, Pytorch are AD libraries specialized for ML) Learning protein structure (e.g., AlphaFold) Many-body Schrodinger equation (e.g., FermiNet) Stellarator coil design Di↵erentiable ray tracing Model uncertainty & sensitivity Optimization of ﬂuid simulations Example   applications Neural Networks Optimization Ray tracing Fluid simulations Many more... 37

Slide 107

Slide 107 text

Slide 108

Slide 108 text

Agent-based model Evolution over time of system of autonomous agents Mechanistic and causal model of behavior Encodes sociological assumptions Agents interact according to prede fi ned rules Agents are simulated to draw conclusions 38

Slide 109

Slide 109 text

Example: Schelling's segregation 2 types of agents: R and B Satisfaction: number of neighbors of same color Homophily parameter If τ Si < τ → relocate 39

Slide 110

Slide 110 text

Example: Schelling's segregation 2 types of agents: R and B Satisfaction: number of neighbors of same color Homophily parameter If τ Si < τ → relocate 39

Slide 111

Slide 111 text

What about data? ABM is "theory development tool" Some people use it as forecasting tool Calibration of parameters: run simulations with different parameters until model is able to reproduce summary statistics of data Manual, expensive, and error-prone process 40

Slide 112

Slide 112 text

Can we do better? 41

Slide 113

Slide 113 text

Can we do better? Yes! 41

Slide 114

Slide 114 text

Can we do better? Yes! Rewrite ABM as Probabilistic Generative Model 41

Slide 115

Slide 115 text

Can we do better? Yes! Rewrite ABM as Probabilistic Generative Model Write likelihood of parameters given data ℒ(Θ|X) 41

Slide 116

Slide 116 text

Can we do better? Yes! Rewrite ABM as Probabilistic Generative Model Write likelihood of parameters given data ℒ(Θ|X) Maximize via Auto Differentiation ̂ Θ = arg max Θ ℒ(Θ|X) 41

Slide 117

Slide 117 text

Opinion dynamics How people's belief evolve Polarization, Radicalization, Echo Chambers Data from Social Media 42

Slide 118

Slide 118 text

Opinion dynamics How people's belief evolve Polarization, Radicalization, Echo Chambers Data from Social Media 42

Slide 119

Slide 119 text

Bounded Con fi dence Model Opinion Each time agents interact they get closer if they are closer than Positive interaction xu ∈ [−1,1] ϵ+ 43

Slide 120

Slide 120 text

Bounded Con fi dence Model Opinion Each time agents interact they get closer if they are closer than Positive interaction xu ∈ [−1,1] ϵ+ 43

Slide 121

Slide 121 text

Repulsive behavior Can interactions back fi re? Each time agents interact they get further away if they were further than Negative interaction ϵ− 44

Slide 122

Slide 122 text

Repulsive behavior Can interactions back fi re? Each time agents interact they get further away if they were further than Negative interaction ϵ− 44

Slide 123

Slide 123 text

0 2 n+ = 0.6 n = 1.2 0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time Opinion Trajectories Parameter values encode different assumptions and   determine signi fi cantly different latent trajectories 45

Slide 124

Slide 124 text

Rewrite as probabilistic model Replace step function with smooth version (sigmoid)             |xu − xv | > ϵ− ⟹ S(u, v) = − 1 P((u, v) ∈ E ∣ S(u, v) = − 1) ∝ σ (|xu − xv | − ϵ−) Opinion distance Likelihood 46

Slide 125

Slide 125 text

Learning from data Assume we see presence of interactions But signs are latent And opinions of users are latent Can we learn the dynamics and parameters of the system? 47

Slide 126

Slide 126 text

ales Part B2 ALBEDO xt x0 xt+1 ↵t s u, v T t Figure 2: Translation of everage recent advances in probabilistic programming to express our models. These frameworks combine erative programming languages with primitives that stic constructs, such as sampling from a distribution. de a naturally rich environment for transforming PGABM counterparts. Once a model is written in diﬀerent algorithms can be used to solve the variable m. wn a proof-of-concept of how a traditional opinion based on bounded conﬁdence [16] can be translated orm [46]. Figure 2 shows the plate notation for such d from our work [46]), where we represent the latent users at time t with xt (x0 is the initial condition), observed interaction from the data. Similarly to Learning problem Given observable interactions     fi nd:   opinions for nodes in time and   sign of each edge   with maximum likelihood Use EM and gradient descent via automatic differentiation G = (V, E) xt V × {0,…, T} → [−1,1] s E → {−, +} 48

Slide 127

Slide 127 text

Reconstructing synthetic data Estimated x0 True x0 Estimated xt True xt 49

Slide 128

Slide 128 text

Recovering parameters 0 2 n+ = 0.6 n = 1.2 0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time. 0 2 n+ = 0.6 n = 1.2 0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time. 50 0 2 n+ = 0.6 n = 1.2 0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time.

Slide 129

Slide 129 text

Recovering parameters 51 Figure 4: Examples of synthetic data traces generated in each s

Slide 130

Slide 130 text

Real data: Reddit Comments score = upvotes Estimate position of users and subreddits in opinion space Larger estimated distance of user from subreddit lower score of user on that subreddit → 52

Slide 131

Slide 131 text

Real data: Reddit Comments score = upvotes Estimate position of users and subreddits in opinion space Larger estimated distance of user from subreddit lower score of user on that subreddit → 52

Slide 132

Slide 132 text

Call to Action Machine Learning is a treasure trove of interesting building blocks VC dimension for approximation algorithms Automatic differentiation for agent- based models Repurpose it for your own goals Be curious, be bold: hack and invent! 53

Slide 133

Slide 133 text

G. Preti, G. De Francisci Morales, M. Riondato   “MaNIACS: Approximate Mining of Frequent Subgraph Patterns through Sampling”   KDD 2021 + ACM TIST 2023 C. Monti, G. De Francisci Morales, F. Bonchi   “Learning Opinion Dynamics From Social Traces”   KDD 2020 C. Monti, M. Pangallo, G. De Francisci Morales, F. Bonchi   “On Learning Agent-Based Models from Data”   SciRep 2022 (accepted) + arXiv:2205.05052 54 [email protected] https://gdfm.me @gdfm7