Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Repurpose, Reuse, Recycle the building blocks of Machine Learning

Repurpose, Reuse, Recycle the building blocks of Machine Learning

Keynote at the Machine Learning Day @KTH, 17/5/23.

More Decks by Gianmarco De Francisci Morales

Other Decks in Research

Transcript

  1. Today's Plan Vapnik-Chervonenkis (VC) dimension From: Statistical learning theory and

    model selection To: Approximate frequent subgraph mining 4
  2. Today's Plan Vapnik-Chervonenkis (VC) dimension From: Statistical learning theory and

    model selection To: Approximate frequent subgraph mining Automatic differentiation From: Backpropagation for deep learning To: Learning agent-based models 4
  3. 5 reasons to like the VC dimension First approximation algorithm

    for frequent subgraph mining Sampling-based algorithm Approximation guarantees on frequency No false negatives, perfect recall 100x faster than exact algorithm 6
  4. VC dimension de fi nition Concept from statistical learning theory

    
 Informally: measure of model capacity HARD! 8
  5. VC dimension de fi nition Concept from statistical learning theory

    
 Informally: measure of model capacity a set of elements called points 
 a family of subsets of called ranges, 
 is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) HARD! 8
  6. VC dimension de fi nition Concept from statistical learning theory

    
 Informally: measure of model capacity a set of elements called points 
 a family of subsets of called ranges, 
 is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) The projection of on is the set of subsets ℛ D ⊆ 𝒟 ℛ ∩ D := {h ∩ D ∣ h ∈ ℛ} HARD! 8
  7. VC dimension de fi nition Concept from statistical learning theory

    
 Informally: measure of model capacity a set of elements called points 
 a family of subsets of called ranges, 
 is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) The projection of on is the set of subsets ℛ D ⊆ 𝒟 ℛ ∩ D := {h ∩ D ∣ h ∈ ℛ} is shattered by if its projection contains all the subsets of : D ℛ D ℛ ∩ D = 2 |D| HARD! 8
  8. VC dimension de fi nition Concept from statistical learning theory

    
 Informally: measure of model capacity a set of elements called points 
 a family of subsets of called ranges, 
 is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) The projection of on is the set of subsets ℛ D ⊆ 𝒟 ℛ ∩ D := {h ∩ D ∣ h ∈ ℛ} is shattered by if its projection contains all the subsets of : D ℛ D ℛ ∩ D = 2 |D| The VC dimension of is the largest cardinality of a set that is shattered by d ( 𝒟 , ℛ) ℛ HARD! 8
  9. Example: Intervals Let be the elements of 𝒟 ℤ Let

    
 be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 9
  10. o Example: Intervals Let be the elements of 𝒟 ℤ

    Let 
 be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 9
  11. o Example: Intervals Let be the elements of 𝒟 ℤ

    Let 
 be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 Impossible to shatter set of three elements {c, d, e} c < d < e 9
  12. o Example: Intervals Let be the elements of 𝒟 ℤ

    Let 
 be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 Impossible to shatter set of three elements {c, d, e} c < d < e No range s.t. R ∈ ℛ R ∩ {c, d, e} = {c, e} 9
  13. o Example: Intervals Let be the elements of 𝒟 ℤ

    Let 
 be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 Impossible to shatter set of three elements {c, d, e} c < d < e No range s.t. R ∈ ℛ R ∩ {c, d, e} = {c, e} VC dimension of this = ( 𝒟 , ℛ) 2 9
  14. Pr test error ≤ training error + 1 N d

    ( log ( 2N d ) + 1 ) − log ( δ 4) = 1 − δ VC dimension in ML 10
  15. Pr test error ≤ training error + 1 N d

    ( log ( 2N d ) + 1 ) − log ( δ 4) = 1 − δ VC dimension in ML 10
  16. VC dimension for data analysis Dataset = Sample How good

    an approximation can we get from a sample? 11
  17. VC dimension for data analysis Dataset = Sample How good

    an approximation can we get from a sample? "When analyzing a random sample of size , with probability , the results are within an factor of the true results" N 1 − δ ε 11
  18. VC dimension for data analysis Dataset = Sample How good

    an approximation can we get from a sample? "When analyzing a random sample of size , with probability , the results are within an factor of the true results" N 1 − δ ε Trade-off among sample size, accuracy, and complexity of the task 11
  19. -sample and VC dimension ε -sample for : for a

    subset s.t. 
 
 ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ 12
  20. -sample and VC dimension ε -sample for : for a

    subset s.t. 
 
 ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ a range space with VC-dimension ( 𝒟 , ℛ) d 12
  21. -sample and VC dimension ε -sample for : for a

    subset s.t. 
 
 ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ a range space with VC-dimension ( 𝒟 , ℛ) d Random sample of size N = 𝒪 ( 1 ε2 (d + log 1 δ )) 12
  22. -sample and VC dimension ε -sample for : for a

    subset s.t. 
 
 ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ a range space with VC-dimension ( 𝒟 , ℛ) d Random sample of size N = 𝒪 ( 1 ε2 (d + log 1 δ )) Is -sample for with probability ε ( 𝒟 , ℛ) 1 − δ 12
  23. Patterns and orbits Pattern: connected labeled graph HARD! 15 1

    2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
  24. Patterns and orbits Pattern: connected labeled graph Pattern equality: isomorphism

    HARD! 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
  25. Patterns and orbits Pattern: connected labeled graph Pattern equality: isomorphism

    Automorphism: isomorphism to itself HARD! 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
  26. automorphisms and their set is denoted as Aut(⌧). Given a

    pattern % = (+%, ⇢% ) in P and a vertex E 2 +% , the orbit ⌫% (E) of +% that is mapped to E by any automorphism of %, i.e., ⌫% (E) ⌘ {D 2 +% : 9` 2 Aut(%) s.t. `(D) = E} . The orbits of % form a partitioning of+% , for each D 2 ⌫% (E), it holds ⌫% (D) = in ⌫% (E) have the same label. In Fig. 1 we show examples of two patterns w v3 v1 v2 v3 v1 v2 O3 O2 O1 O2 O1 Fig. 1. Examples of pa￿erns and orbits. Colors represent vertex labels, while circle pa￿ern on the le￿, v1 and v2 belong to the same orbit $1. On the right, each vertex Patterns and orbits Pattern: connected labeled graph Pattern equality: isomorphism Automorphism: isomorphism to itself Orbit: subset of pattern mapped to each other by automorphisms V2 V1 V3 V3 V2 V1 HARD! 15
  27. Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph

    Pattern Frequency Image {V1} 17
  28. Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph

    Pattern Frequency 1 Image {V1} 17
  29. Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph

    Pattern Frequency 1 Image {V1} 17
  30. Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph

    Pattern Frequency 1 Image {V1} {V2,V3,V4,V5} 17
  31. Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph

    Pattern Frequency 1 4 Image {V1} {V2,V3,V4,V5} 17
  32. Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph

    Pattern Frequency 1 4 Image {V1} {V2,V3,V4,V5} min(1,4)=1 17
  33. Minimum Node-based Image (MNI) V2 V3 V4 V5 V1 Graph

    Pattern Frequency 1 4 Image {V1} {V2,V3,V4,V5} Anti-monotone! min(1,4)=1 17
  34. Relative MNI frequency = image set of orbit of pattern

    on Relative MNI frequency of pattern in graph 
 
 ZV (q) q P V P G = (V, E) fV (P) = min q∈P { |ZV (q)| |V| } 18
  35. Approx. Frequent Subgraph Mining Given threshold , sample of vertices

    τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ 19
  36. Approx. Frequent Subgraph Mining Given threshold , sample of vertices

    τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ Find s.t. (P, εp ) fV (P) − fS (P) = |ZV (q)| |V| − |ZS (q)| |S| ≤ εP 19
  37. Approx. Frequent Subgraph Mining Given threshold , sample of vertices

    τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ Find s.t. (P, εp ) fV (P) − fS (P) = |ZV (q)| |V| − |ZS (q)| |S| ≤ εP |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε -sample ε 19
  38. Approx. Frequent Subgraph Mining Given threshold , sample of vertices

    τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ Find s.t. (P, εp ) fV (P) − fS (P) = |ZV (q)| |V| − |ZS (q)| |S| ≤ εP |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε -sample ε 19
  39. Approx. Frequent Subgraph Mining Given threshold , sample of vertices

    τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ Find s.t. (P, εp ) fV (P) − fS (P) = |ZV (q)| |V| − |ZS (q)| |S| ≤ εP |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε -sample ε 19
  40. Approx. Frequent Subgraph Mining Given threshold , sample of vertices

    τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ Find s.t. (P, εp ) fV (P) − fS (P) = |ZV (q)| |V| − |ZS (q)| |S| ≤ εP |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε -sample ε 19
  41. Empirical VC dimension for FSG orbits of frequent patterns 


    Use range space Ri = {ZV (q) : q is an orbit of P with fV (P) ≥ τ} (V, Ri ) 20
  42. Empirical VC dimension for FSG orbits of frequent patterns 


    Use range space Ri = {ZV (q) : q is an orbit of P with fV (P) ≥ τ} (V, Ri ) acceptable failure probability 
 uniform sample of of size 
 upper bound to the VC dimension δ ∈ (0,1) S V s d 20
  43. Empirical VC dimension for FSG orbits of frequent patterns 


    Use range space Ri = {ZV (q) : q is an orbit of P with fV (P) ≥ τ} (V, Ri ) acceptable failure probability 
 uniform sample of of size 
 upper bound to the VC dimension δ ∈ (0,1) S V s d With high probability is an -sample for for S ε (V, Ri ) ε = d + log 1 δ 2s 20
  44. Pruning -sample guarantee: ε |Ri ∩ V| |V| − |Ri

    ∩ S| |S| ≤ εi Given that we can bound the error on every orbit, 
 we can bound the error on its minimum 21
  45. Pruning -sample guarantee: ε |Ri ∩ V| |V| − |Ri

    ∩ S| |S| ≤ εi Given that we can bound the error on every orbit, 
 we can bound the error on its minimum fV (Pi ) − fS (Pi ) ≤ εi ⟹ fS (Pi ) ≥ fV (Pi ) − εi ≥ τ − εi 21
  46. Pruning -sample guarantee: ε |Ri ∩ V| |V| − |Ri

    ∩ S| |S| ≤ εi Given that we can bound the error on every orbit, 
 we can bound the error on its minimum fV (Pi ) − fS (Pi ) ≤ εi ⟹ fS (Pi ) ≥ fV (Pi ) − εi ≥ τ − εi Lower bound on the frequency of a frequent pattern in the sample 21
  47. MaNIACS 1) Find image sets of the orbits of unpruned

    patterns with vertices 2) Use them to compute an upper bound to the VC dimension of 3) Compute such that is an -sample for 4) Prune patterns that cannot be frequent with lower bound 5) Extend unpruned patterns to get candidate patterns with vertices ZS (q) i (V, Ri ) εi S εi (V, Ri ) fS (Pi ) ≥ τ − εi i + 1 23
  48. 0.18 0.20 0.22 0.24 0.26 0.28 0.30 Min Frequency Threshold

    ø 102 103 104 105 Running Time (s) Æ=1 Æ=0.8 exact Results First sampling-based algorithm Approximation guarantees on computed frequency No false negatives 24 1K 1.4K 1.7K 2K 2.3K 2.6K 2.9K Sample Size 0.01 0.02 0.03 0.04 0.05 0.06 0.07 MaxAE Bound MaxAE "2 "3 "4 "5
  49. 0.18 0.20 0.22 0.24 0.26 0.28 0.30 Min Frequency Threshold

    ø 102 103 104 105 Running Time (s) Æ=1 Æ=0.8 exact Results First sampling-based algorithm Approximation guarantees on computed frequency No false negatives 24
  50. 0.18 0.20 0.22 0.24 0.26 0.28 0.30 Min Frequency Threshold

    ø 102 103 104 105 Running Time (s) Æ=1 Æ=0.8 exact Results First sampling-based algorithm Approximation guarantees on computed frequency No false negatives 24
  51. Autodiff Set of techniques to evaluate the partial derivative of

    a computer program Chain rule to break complex expressions Originally created for neural networks and deep learning (backpropagation) Different from numerical and symbolic differentiation ∂f(g(x)) ∂x = ∂f ∂g ∂g ∂x 26
  52. Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei

    ) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding 27
  53. Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei

    ) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding Symbolic: Input=computation graph, Output=symbolic derivative 27
  54. Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei

    ) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding Symbolic: Input=computation graph, Output=symbolic derivative Example: Mathematica 27
  55. Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei

    ) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding Symbolic: Input=computation graph, Output=symbolic derivative Example: Mathematica Slow (search and apply rules) and large intermediate state 27
  56. Example Automatic Differentiation (autodiff) • Create computation graph for gradient

    computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 , = 1 1 + *.(012034320545) 1/% 30
  57. Example Automatic Differentiation (autodiff) • Create computation graph for gradient

    computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) - % = 1/% à 89 85 = −1/%& 31
  58. Example Automatic Differentiation (autodiff) • Create computation graph for gradient

    computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 - % = % + 1 à 89 85 = 1 32
  59. Example Automatic Differentiation (autodiff) • Create computation graph for gradient

    computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 ∗ - % = *5 à 89 85 = *5 33
  60. Example Automatic Differentiation (autodiff) • Create computation graph for gradient

    computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 ∗ ∗ −1 ∗ 89 814 - %, " = %" à 8; 81 = % 34
  61. Example Automatic Differentiation (autodiff) • Create computation graph for gradient

    computation ∗ "# + %# ∗ "& %& "' + ∗ −1 *%+ +1 1/% − 1 %& - = 1 1 + */(123145431656) ∗ 1 ∗ ∗ −1 ∗ 89 814 ∗ 89 816 35
  62. A few highlights Machine Learning (Tensorflow, Pytorch are AD libraries

    specialized for ML) Learning protein structure (e.g., AlphaFold) Many-body Schrodinger equation (e.g., FermiNet) Stellarator coil design Di↵erentiable ray tracing Model uncertainty & sensitivity Optimization of fluid simulations Example 
 applications Neural Networks Optimization Ray tracing Fluid simulations Many more... 37
  63. A few highlights Machine Learning (Tensorflow, Pytorch are AD libraries

    specialized for ML) Learning protein structure (e.g., AlphaFold) Many-body Schrodinger equation (e.g., FermiNet) Stellarator coil design Di↵erentiable ray tracing Model uncertainty & sensitivity Optimization of fluid simulations Example 
 applications Neural Networks Optimization Ray tracing Fluid simulations Many more... 37
  64. Agent-based model Evolution over time of system of autonomous agents

    Mechanistic and causal model of behavior Encodes sociological assumptions Agents interact according to prede fi ned rules Agents are simulated to draw conclusions 38
  65. Example: Schelling's segregation 2 types of agents: R and B

    Satisfaction: number of neighbors of same color Homophily parameter If τ Si < τ → relocate 39
  66. Example: Schelling's segregation 2 types of agents: R and B

    Satisfaction: number of neighbors of same color Homophily parameter If τ Si < τ → relocate 39
  67. What about data? ABM is "theory development tool" Some people

    use it as forecasting tool Calibration of parameters: run simulations with different parameters until model is able to reproduce summary statistics of data Manual, expensive, and error-prone process 40
  68. Can we do better? Yes! Rewrite ABM as Probabilistic Generative

    Model Write likelihood of parameters given data ℒ(Θ|X) 41
  69. Can we do better? Yes! Rewrite ABM as Probabilistic Generative

    Model Write likelihood of parameters given data ℒ(Θ|X) Maximize via Auto Differentiation ̂ Θ = arg max Θ ℒ(Θ|X) 41
  70. Bounded Con fi dence Model Opinion Each time agents interact

    they get closer if they are closer than Positive interaction xu ∈ [−1,1] ϵ+ 43
  71. Bounded Con fi dence Model Opinion Each time agents interact

    they get closer if they are closer than Positive interaction xu ∈ [−1,1] ϵ+ 43
  72. Repulsive behavior Can interactions back fi re? Each time agents

    interact they get further away if they were further than Negative interaction ϵ− 44
  73. Repulsive behavior Can interactions back fi re? Each time agents

    interact they get further away if they were further than Negative interaction ϵ− 44
  74. 0 2 n+ = 0.6 n = 1.2 0 2

    n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time Opinion Trajectories Parameter values encode different assumptions and 
 determine signi fi cantly different latent trajectories 45
  75. Rewrite as probabilistic model Replace step function with smooth version

    (sigmoid) 
 
 
 
 
 
 |xu − xv | > ϵ− ⟹ S(u, v) = − 1 P((u, v) ∈ E ∣ S(u, v) = − 1) ∝ σ (|xu − xv | − ϵ−) Opinion distance Likelihood 46
  76. Learning from data Assume we see presence of interactions But

    signs are latent And opinions of users are latent Can we learn the dynamics and parameters of the system? 47
  77. ales Part B2 ALBEDO xt x0 xt+1 ↵t s u,

    v T t Figure 2: Translation of everage recent advances in probabilistic programming to express our models. These frameworks combine erative programming languages with primitives that stic constructs, such as sampling from a distribution. de a naturally rich environment for transforming PGABM counterparts. Once a model is written in different algorithms can be used to solve the variable m. wn a proof-of-concept of how a traditional opinion based on bounded confidence [16] can be translated orm [46]. Figure 2 shows the plate notation for such d from our work [46]), where we represent the latent users at time t with xt (x0 is the initial condition), observed interaction from the data. Similarly to Learning problem Given observable interactions 
 
 fi nd: 
 opinions for nodes in time and 
 sign of each edge 
 with maximum likelihood Use EM and gradient descent via automatic differentiation G = (V, E) xt V × {0,…, T} → [−1,1] s E → {−, +} 48
  78. Recovering parameters 0 2 n+ = 0.6 n = 1.2

    0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time. 0 2 n+ = 0.6 n = 1.2 0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time. 50 0 2 n+ = 0.6 n = 1.2 0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time.
  79. Real data: Reddit Comments score = upvotes Estimate position of

    users and subreddits in opinion space Larger estimated distance of user from subreddit lower score of user on that subreddit → 52
  80. Real data: Reddit Comments score = upvotes Estimate position of

    users and subreddits in opinion space Larger estimated distance of user from subreddit lower score of user on that subreddit → 52
  81. Call to Action Machine Learning is a treasure trove of

    interesting building blocks VC dimension for approximation algorithms Automatic differentiation for agent- based models Repurpose it for your own goals Be curious, be bold: hack and invent! 53
  82. G. Preti, G. De Francisci Morales, M. Riondato 
 “MaNIACS:

    Approximate Mining of Frequent Subgraph Patterns through Sampling” 
 KDD 2021 + ACM TIST 2023 C. Monti, G. De Francisci Morales, F. Bonchi 
 “Learning Opinion Dynamics From Social Traces” 
 KDD 2020 C. Monti, M. Pangallo, G. De Francisci Morales, F. Bonchi 
 “On Learning Agent-Based Models from Data” 
 SciRep 2022 (accepted) + arXiv:2205.05052 54 [email protected] https://gdfm.me @gdfm7