VC dimension de fi nition Concept from statistical learning theory

Informally: measure of model capacity a set of elements called points

a family of subsets of called ranges,

is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) The projection of on is the set of subsets ℛ D ⊆ 𝒟 ℛ ∩ D := {h ∩ D ∣ h ∈ ℛ} is shattered by if its projection contains all the subsets of : D ℛ D ℛ ∩ D = 2 |D| HARD! 8

VC dimension de fi nition Concept from statistical learning theory

Informally: measure of model capacity a set of elements called points

a family of subsets of called ranges,

is a range space 𝒟 ℛ 𝒟 ℛ ⊆ 2 𝒟 ( 𝒟 , ℛ) The projection of on is the set of subsets ℛ D ⊆ 𝒟 ℛ ∩ D := {h ∩ D ∣ h ∈ ℛ} is shattered by if its projection contains all the subsets of : D ℛ D ℛ ∩ D = 2 |D| The VC dimension of is the largest cardinality of a set that is shattered by d ( 𝒟 , ℛ) ℛ HARD! 8

o Example: Intervals Let be the elements of 𝒟 ℤ Let

be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 Impossible to shatter set of three elements {c, d, e} c < d < e 9

o Example: Intervals Let be the elements of 𝒟 ℤ Let

be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 Impossible to shatter set of three elements {c, d, e} c < d < e No range s.t. R ∈ ℛ R ∩ {c, d, e} = {c, e} 9

o Example: Intervals Let be the elements of 𝒟 ℤ Let

be the set of discrete intervals in ℛ = {[a, b] ∩ ℤ : a ≤ b} 𝒟 Shattering set of two elements of is easy 𝒟 Impossible to shatter set of three elements {c, d, e} c < d < e No range s.t. R ∈ ℛ R ∩ {c, d, e} = {c, e} VC dimension of this = ( 𝒟 , ℛ) 2 9

VC dimension for data analysis Dataset = Sample How good an approximation can we get from a sample? "When analyzing a random sample of size , with probability , the results are within an factor of the true results" N 1 − δ ε 11

VC dimension for data analysis Dataset = Sample How good an approximation can we get from a sample? "When analyzing a random sample of size , with probability , the results are within an factor of the true results" N 1 − δ ε Trade-off among sample size, accuracy, and complexity of the task 11

-sample and VC dimension ε -sample for : for a subset s.t.

ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ a range space with VC-dimension ( 𝒟 , ℛ) d Random sample of size N = 𝒪 ( 1 ε2 (d + log 1 δ )) 12

-sample and VC dimension ε -sample for : for a subset s.t.

ε ( 𝒟 , ℛ) ε ∈ (0,1) A ⊆ 𝒟 |R ∩ 𝒟 | | 𝒟 | − |R ∩ A| |A| ≤ ε, for every R ∈ ℛ a range space with VC-dimension ( 𝒟 , ℛ) d Random sample of size N = 𝒪 ( 1 ε2 (d + log 1 δ )) Is -sample for with probability ε ( 𝒟 , ℛ) 1 − δ 12

automorphisms and their set is denoted as Aut(⌧). Given a pattern % = (+%, ⇢% ) in P and a vertex E 2 +% , the orbit ⌫% (E) of +% that is mapped to E by any automorphism of %, i.e., ⌫% (E) ⌘ {D 2 +% : 9` 2 Aut(%) s.t. `(D) = E} . The orbits of % form a partitioning of+% , for each D 2 ⌫% (E), it holds ⌫% (D) = in ⌫% (E) have the same label. In Fig. 1 we show examples of two patterns w v3 v1 v2 v3 v1 v2 O3 O2 O1 O2 O1 Fig. 1. Examples of paerns and orbits. Colors represent vertex labels, while circle paern on the le, v1 and v2 belong to the same orbit $1. On the right, each vertex Patterns and orbits Pattern: connected labeled graph Pattern equality: isomorphism Automorphism: isomorphism to itself Orbit: subset of pattern mapped to each other by automorphisms V2 V1 V3 V3 V2 V1 HARD! 15

Approx. Frequent Subgraph Mining Given threshold , sample of vertices τ S With probability at least 1 − δ For every pattern with P fV (P) ≥ τ Find s.t. (P, εp ) fV (P) − fS (P) = |ZV (q)| |V| − |ZS (q)| |S| ≤ εP 19

Pruning -sample guarantee: ε |Ri ∩ V| |V| − |Ri ∩ S| |S| ≤ εi Given that we can bound the error on every orbit,

we can bound the error on its minimum fV (Pi ) − fS (Pi ) ≤ εi ⟹ fS (Pi ) ≥ fV (Pi ) − εi ≥ τ − εi Lower bound on the frequency of a frequent pattern in the sample 21

Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei ) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding Symbolic: Input=computation graph, Output=symbolic derivative 27

Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei ) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding Symbolic: Input=computation graph, Output=symbolic derivative Example: Mathematica 27

Alternatives Numerical: ∂f(x) dxi ≈ lim h→0 f(x + hei ) − f(x) h Slow (need to evaluate each dimension) and errors due to rounding Symbolic: Input=computation graph, Output=symbolic derivative Example: Mathematica Slow (search and apply rules) and large intermediate state 27

A few highlights Machine Learning (Tensorﬂow, Pytorch are AD libraries specialized for ML) Learning protein structure (e.g., AlphaFold) Many-body Schrodinger equation (e.g., FermiNet) Stellarator coil design Di↵erentiable ray tracing Model uncertainty & sensitivity Optimization of ﬂuid simulations Example

A few highlights Machine Learning (Tensorﬂow, Pytorch are AD libraries specialized for ML) Learning protein structure (e.g., AlphaFold) Many-body Schrodinger equation (e.g., FermiNet) Stellarator coil design Di↵erentiable ray tracing Model uncertainty & sensitivity Optimization of ﬂuid simulations Example

Can we do better? Yes! Rewrite ABM as Probabilistic Generative Model Write likelihood of parameters given data ℒ(Θ|X) Maximize via Auto Differentiation ̂ Θ = arg max Θ ℒ(Θ|X) 41

0 2 n+ = 0.6 n = 1.2 0 2 n+ = 0.4 n = 0.6 0 2 n+ = 1.2 n = 1.6 0 2 n+ = 0.2 n = 1.6 Figure 4: Examples of synthetic data traces generated in each scenario. Plots represent the opinion trajectories along time Opinion Trajectories Parameter values encode different assumptions and

determine signi fi cantly different latent trajectories 45

ales Part B2 ALBEDO xt x0 xt+1 ↵t s u, v T t Figure 2: Translation of everage recent advances in probabilistic programming to express our models. These frameworks combine erative programming languages with primitives that stic constructs, such as sampling from a distribution. de a naturally rich environment for transforming PGABM counterparts. Once a model is written in diﬀerent algorithms can be used to solve the variable m. wn a proof-of-concept of how a traditional opinion based on bounded conﬁdence [16] can be translated orm [46]. Figure 2 shows the plate notation for such d from our work [46]), where we represent the latent users at time t with xt (x0 is the initial condition), observed interaction from the data. Similarly to Learning problem Given observable interactions

fi nd:

opinions for nodes in time and

sign of each edge

with maximum likelihood

Use EM and gradient descent via automatic differentiation G = (V, E) xt V × {0,…, T} → [−1,1] s E → {−, +} 48