Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Causal: Week 4

Will Lowe
February 28, 2021
18

Causal: Week 4

Will Lowe

February 28, 2021
Tweet

Transcript

  1. M : Plan → Types of machine learning → Going

    further than regression → More exibility with polynomials → Over tting and regularization → Bias variance decomposition → e good part about bias → Example ML models: Ridge, lasso, trees, neural networks → Back to causal inference in the style of Frisch, Waugh, and Lovell → Double/debiased ML, the very idea
  2. M : Inference: → Supervised: Learn P(Y X, Z .

    . .), or o en just its expected value → Unsupervised: Learn P(X, Z)
  3. M : Inference: → Supervised: Learn P(Y X, Z .

    . .), or o en just its expected value → Unsupervised: Learn P(X, Z) Action (embeds an inference problem) → Reinforcement: Learn a policy P(Action State) such that the expected future discounted reward for the policy’s actions is maximized
  4. M : Inference: → Supervised: Learn P(Y X, Z .

    . .), or o en just its expected value → Unsupervised: Learn P(X, Z) Action (embeds an inference problem) → Reinforcement: Learn a policy P(Action State) such that the expected future discounted reward for the policy’s actions is maximized We’ll be interested in supervised, traditionally separated into → Regression: usually implicitly assumes symmetric constant є (or doesn’t have an opinion...) → Classi cation: ambiguous between choosing one of K classes and estimating P(Y = k X, Z . . .) In any case, both go for E[Y X, Z . . .]
  5. F You could, if you like, think of linear regression

    and logistic regression in each of these categories → It’s illuminating to do so (see the rst few chapters of Bishop, ) So what’s the di erence? → More exible forms for E[Y X, Z . . .] → Higher dimensional predictors, i.e. lots more X, Z . . . Many ML regression models will embed a more familiar model, e.g. neural networks Others will start from scratch and build E[Y X, Z . . .] di erently, e.g. classi cation trees
  6. I As an engineering tool, ML models will seldom care

    about what X, Z etc. actually are, or distinguishing one parameter among the others Indeed most are non-parametric → Reminder: ‘non-parametric’ does not mean ‘does not have parameters, it means ‘has so many parameters that I do not care to know them by name’ Unsurprisingly, this part of ML came late to causal inference
  7. E What happens when there are more variables than cases?

    → Regular regression breaks What happens when you add all the squares and cubes and interactions as predictors? → Standard errors explode; same amount of data, but more parameters to learn from it. → Generalization to new data gets worse; now we can t everything better, we t noise better ese are the same problem in di erent degrees
  8. A x t 0 1 −1 0 1 For consistency

    with Bishop ch. let’s call → the outcome tn → the regression coe cients wj ∈ w (‘weights’) → our estimate of the expected value of tn, ˆ tn = y(x, w)
  9. A Consider polynomial models of t. We’ll t / make

    predictions like this: y(x, w) = w = w + w x = w + w x + w x = w + w x + w x + ⋯ + wM xM = M j wj x j
  10. A Consider polynomial models of t. We’ll t / make

    predictions like this: y(x, w) = w = w + w x = w + w x + w x = w + w x + w x + ⋯ + wM xM = M j wj x j e exibility of this model is driven by M, which we can think of as determining the model class → Roughly: the set of functions that can be represented
  11. A x t M = 0 0 1 −1 0

    1 x t M = 1 0 1 −1 0 1 x t M = 3 0 1 −1 0 1 x t M = 9 0 1 −1 0 1
  12. A

  13. O ings are not so bad when there is more

    data (here M= ) x t N = 15 0 1 −1 0 1 x t N = 100 0 1 −1 0 1 But there isn’t always going to be more data...
  14. O ings are not so bad when there is more

    data (here M= ) x t N = 15 0 1 −1 0 1 x t N = 100 0 1 −1 0 1 But there isn’t always going to be more data... However, we can keep all M, i.e. maintain the exibility in the model class, if we can constrain the size of the weights → is calls for a hyperparameter, a parameter that controls other parameters
  15. O x t N = 15 0 1 −1 0

    1 x t N = 100 0 1 −1 0 1 When there’s lots of persuasive data: → override the hyperparameter and make use of the model exibility When there isn’t, → keep weights small, and therefore the function smooth
  16. R Here, we’re tting the model (maximizing the likelihood) using

    OLS, which minimises the sum of squared errors EOLS = N n (y(xn, w) − tn ) Note: minimising error rather than maximizing the likelihood is the way ML people think about things (hence, no minus sign) e / is there to hint that this is the log likelihood for a Normal distribution (with constant error variance, so it doesn’t matter to E)
  17. R Here, we’re tting the model (maximizing the likelihood) using

    OLS, which minimises the sum of squared errors EOLS = N n (y(xn, w) − tn ) Note: minimising error rather than maximizing the likelihood is the way ML people think about things (hence, no minus sign) Let’s keep that plan, but add an extra term to control the weights Eλ = N n (y(xn, w) − tn ) + λ M m wm and a hyperparameter λ to say how seriously we should take it as an error component e / is there to hint that this is the log likelihood for a Normal distribution (with constant error variance, so it doesn’t matter to E)
  18. C x t ln λ = −18 0 1 −1

    0 1 x t ln λ = 0 0 1 −1 0 1
  19. T ERMS ln λ −35 −30 −25 −20 0 0.5

    1 Training Test → e le extreme is λ = (no regularization) → e right extreme is all zero weights (predict of for every point) → With xed data, decreasing λ allows more of the model class’s inherent exibility to show
  20. C We can’t t λ by minimising the sum of

    squares → at would just set it to zero (why?)
  21. C We can’t t λ by minimising the sum of

    squares → at would just set it to zero (why?) One reliable option is crossvalidation (CV) → Make a grid of hyperparameter values → Randomly divide the data set into (or some other value > ) → For each hyperparameter value, train a model on white and test on red → Choose the hyperparameter value that minimizes the average error on reds run 1 run 2 run 3 run 4
  22. G : B -V One important question we can ask

    is about the expected value of E, averaged over all possible data sets coming from the same mechanisms First, let’s give E[t x] (the real regression function) a name h(x) = E[t x] and de ne ED as an average over all possible data sets en (Bishop, sec. . . and . for a derivation) we can decompose the expected error into bias ED [(y(x, w) − h(x)) ] p(x)dx variance ED [(y(x, w) − ED [y(x, w)]) ] p(x)dx noise (h(x) − t) p(x, t)dx dt
  23. B V ln λ −3 −2 −1 0 1 2

    0 0.03 0.06 0.09 0.12 0.15 (bias)2 variance (bias)2 + variance test error Error is unavoidable but bias and variance trade o
  24. M Informally, the bias of a model class is the

    set of functions that a model most naturally learns → Linear models (M= above): can learn straight lines → Quadratic models (M= above): can learn straight lines but also smooth curves → etc. We can get di erent sorts of bias by changing the whole model class → we’ll see an example of later with trees Regularization also o ers us some interesting and di erent forms
  25. M u1 u2 w1 w2 wMAP wML → wML minimizes

    Eλ= = EOLS → e origin minimizes Eλ=∞ → wMAP minimizes Eλ when we set λ sensibly to balance the two parts of the error function
  26. M u1 u2 w1 w2 wMAP wML → is bias

    shrinks all the weights, some more than others → It is sometimes helpful to de ne an e ective number of parameters, which is less than M, and possibly fractional
  27. A If we change the regularization term just a little

    (note the q) Eλ = N n (y(xn, w) − tn ) + λ M m wm q
  28. A If we change the regularization term just a little

    (note the q) Eλ = N n (y(xn, w) − tn ) + λ M m wm q q = q = w1 w2 w w1 w2 w (L regularization a.k.a. ‘ridge regression’) (L regularization, a.k.a. ‘the lasso’)
  29. T Alternatively, we can change the model class altogether, e.g.

    regression trees (from scikit-learn documentation) Blue tree ( levels) if x > 3.2 then if x > 3.9 then -0.9 else -0.5 else if x > 0.5 then 0.8 else 0.1 e green tree allows up to levels, and over ts
  30. R For regression trees, one hyperparameter is the depth of

    the tree → so constraining that adds bias and reduces variance In general we can also prevent over tting by bagging (Breiman, ) → bootstrapping the dataset → Fitting trees to each bootstrap sample → Averaging the resulting predictions or variations on that theme (Cutler et al., , e.g. Random forests) Like cross-validation, this removes variance but does not much a ect bias
  31. B Clearly regularization generates bias. Seems like a bad thing...

    But it’s necessary → e No Free Lunch theorem (Wolpert, ) says that averaged over all possible problems, no learning algorithm is better than any other → Happily we don’t deal with all possible problems, so we can and should choose a model bias to t the problem
  32. B Clearly regularization generates bias. Seems like a bad thing...

    But it’s necessary → e No Free Lunch theorem (Wolpert, ) says that averaged over all possible problems, no learning algorithm is better than any other → Happily we don’t deal with all possible problems, so we can and should choose a model bias to t the problem And helpful → It’s how we get less variance
  33. B Clearly regularization generates bias. Seems like a bad thing...

    But it’s necessary → e No Free Lunch theorem (Wolpert, ) says that averaged over all possible problems, no learning algorithm is better than any other → Happily we don’t deal with all possible problems, so we can and should choose a model bias to t the problem And helpful → It’s how we get less variance And annoying → Slows convergence is is better than the alternative, which is not being consistent and not knowing it
  34. B ML ML insight: → It’s better to working with

    a universal function approximator, and gure out how to regularize it, than to work with a model that can’t represent much of anything and hope → Most of the ML methods we’ll work with are universal approximators → Linear regression...de nitely not.
  35. B ML ML insight: → It’s better to working with

    a universal function approximator, and gure out how to regularize it, than to work with a model that can’t represent much of anything and hope → Most of the ML methods we’ll work with are universal approximators → Linear regression...de nitely not. But wait, how did all that regularization business turn y(x, w) into a universal approxiator? → It didn’t. We just didn’t say much about y(x, w) and drew it like it was in a linear regression context In real applications, y(x, w) is not even polynomial regression → It’s kernel regression, or basis function regression, neural network, random forest, etc.
  36. N A multilayer perceptron (MLP) with hidden layer f J

    ‘units’ for D-dimensional input data x is y(x, w) = J j wjϕj (x, w(j)) where ϕj is some nonlinear function of the input data, e.g. ϕj (x, w(j)) = ( + exp(− d w(j) d xd ) at’s a universal approximator (Hornik et al., ) that needs serious regularization
  37. ML We’ve some really exible models with interesting di erent

    types of bias (smooth, piecewise linear) and styles of regularization (L , L , depth constraints) Let’s go right back to the beginning → Good old multiple linear regression
  38. ML We’ve some really exible models with interesting di erent

    types of bias (smooth, piecewise linear) and styles of regularization (L , L , depth constraints) Let’s go right back to the beginning → Good old multiple linear regression
  39. E Frisch and Waugh, back in in Econometrica showed, and

    Lowell generalized in the following useful fact about regression (Lovell, , has a short accessible proof). Consider three models Y = β + XβX + Z βZ . . . ZK βZK + є (Big Model) X = βX + Z βX Z . . . ZK βX ZK + єX (X Model) Y = βY + Z βY Z . . . ZK βY ZK + єY (Y Model) and also this one, made out of residuals from the Y and X models (Y − ˆ Y) = βFWL + (X − ˆ X)βFWL + єFWL then βX = βFWL
  40. E What if K were really big and we had

    no real idea about all those Zs? → Lots of confounders → Unknown possibly non-linear relationships → Target e ect is βX, which we’ll assume here is linear
  41. E What if K were really big and we had

    no real idea about all those Zs? → Lots of confounders → Unknown possibly non-linear relationships → Target e ect is βX, which we’ll assume here is linear Double/debiased Machine Learning (Chernozhukov et al., ) Intuition: Do FWL but with fancier X and Y models X = m(Z , . . . ZK ) + єX (Fancy X Model) Y = g(Z , . . . ZK ) + єY (Fancy Y Model)
  42. D ML Chernozhukov et al. show that → Just learning

    a fancier big model is a bad idea: the e ect of X gets lost, the fancy model might throw it away, bias, etc. However using fancy m and fancy g comes with problems: → Over tting, due to exibility of the model class → Bias, due regularization to combat over tting → Slow convergence. we’re used to n , but fancy models tend to go go n ey use a mixture of → cross- tting: like cross validation but for βX estimation → cunning orthogonal score functions To get fancy models that converge (mostly) as if they were simple ones. Cool
  43. R Bishop, C. M. ( ). ‘Pattern recognition and machine

    learning’. Springer. Breiman, L. ( ). ‘Bagging predictors’. Machine Learning, ( ), – . Chernozhukov, V., Chetverikov, D., Demirer, M., Du o, E., Hansen, C., Newey, W. & Robins, J. ( ). ‘Double/debiased machine learning for treatment and structural parameters’. e Econometrics Journal, ( ), C –C . Cutler, A., Cutler, D. R. & Stevens, J. R. ( ). Random forests. In C. Zhang & Y. Ma (Eds.), Ensemble machine learning (pp. – ). Springer US. Hornik, K., Stinchcombe, M. & White, H. ( ). ‘Multilayer feedforward networks are universal approximators’. Neural Networks, , – . Lovell, M. C. ( ). ‘A simple proof of the fwl theorem’. e Journal of Economic Education, ( ), – eprint: https://doi.org/ . /JECE. . . - .
  44. R Wolpert, D. H. ( ). ‘ e lack of

    a priori distinctions between learning algorithms’. Neural Computation, , – .