Pomegranate: Fast and Flexible Probabilistic Modeling in Python

C93e0512fbfca1b61a9913bfceeac7ec?s=47 Data Intelligence
June 28, 2017
190

Pomegranate: Fast and Flexible Probabilistic Modeling in Python

Jacob Schreiber, Paul G. Allen School of Computer Science, University of Washington
Audience level: Intermediate
Topic area: Modeling
We will describe the python package pomegranate, which implements flexible probabilistic modeling. We will highlight several supported models including mixtures, hidden Markov models, and Bayesian networks. At each step we will show how the supported flexibility allows for complex models to be easily constructed. We will also demonstrate the parallel and out-of-core APIs.

C93e0512fbfca1b61a9913bfceeac7ec?s=128

Data Intelligence

June 28, 2017
Tweet

Transcript

  1. fast and flexible probabilistic modelling in python Jacob Schreiber Paul

    G. Allen School of Computer Science University of Washington jmschreiber91 @jmschrei @jmschreiber91
  2. Acknowledgements 2

  3. Overview pomegranate is more flexible than other packages, faster, is

    intuitive to use, and can do it all in parallel 3
  4. Overview: this talk 4 Overview Major Models/Model Stacks 1. General

    Mixture Models 2. Hidden Markov Models 3. Bayesian Networks 4. Bayes Classifiers Finale: Train a mixture of HMMs in parallel
  5. Overview: supported models Six Main Models: 1. Probability Distributions 2.

    General Mixture Models 3. Markov Chains 4. Hidden Markov Models 5. Bayes Classifiers / Naive Bayes 6. Bayesian Networks 5 Two Helper Models: 1. k-means++/kmeans|| 2. Factor Graphs
  6. Overview: model stacking in pomegranate 6 Distributions Bayes Classifiers Markov

    Chains General Mixture Models Hidden Markov Models Bayesian Networks D BC MC GMM HMM BN
  7. Overview: model stacking in pomegranate 7 Distributions Bayes Classifiers Markov

    Chains General Mixture Models Hidden Markov Models Bayesian Networks D BC MC GMM HMM BN
  8. The API is common to all models 8 All models

    have these methods! All models composed of distributions (like GMM, HMM...) have these methods too! model.log_probability(X) / model.probability(X) model.sample() model.fit(X, weights, inertia) model.summarize(X, weights) model.from_summaries(inertia) model.predict(X) model.predict_proba(X) model.predict_log_proba(X) Model.from_samples(X, weights)
  9. pomegranate supports many models 9 Univariate Distributions 1. UniformDistribution 2.

    BernoulliDistribution 3. NormalDistribution 4. LogNormalDistribution 5. ExponentialDistribution 6. BetaDistribution 7. GammaDistribution 8. DiscreteDistribution 9. PoissonDistribution Kernel Densities 1. GaussianKernelDensity 2. UniformKernelDensity 3. TriangleKernelDensity Multivariate Distributions 1. IndependentComponentsDistribution 2. MultivariateGaussianDistribution 3. DirichletDistribution 4. ConditionalProbabilityTable 5. JointProbabilityTable
  10. 10 mu, sig = 0, 2 a = NormalDistribution(mu, sig)

    X = [0, 1, 1, 2, 1.5, 6, 7, 8, 7] a = GaussianKernelDensity(X) Models can be created from known values
  11. 11 Models can be learned from data X = numpy.random.normal(0,

    1, 100) a = NormalDistribution.from_samples(X)
  12. 12 pomegranate can be faster than numpy Fitting a Normal

    Distribution to 1,000 samples
  13. 13 pomegranate can be faster than numpy Fitting Multivariate Gaussian

    to 10,000,000 samples of 10 dimensions
  14. 14 pomegranate uses BLAS internally

  15. 15 pomegranate will soon have GPU support

  16. 16 pomegranate uses additive summarization pomegranate reduces data to sufficient

    statistics for updates and so only has to go datasets once (for all models). Here is an example of the Normal Distribution sufficient statistics
  17. 17 pomegranate supports out-of-core learning Batches from a dataset can

    be reduced to additive summary statistics, enabling exact updates from data that can’t fit in memory.
  18. 18 Parallelization exploits additive summaries Extract summaries + + +

    + New Parameters
  19. 19 pomegranate supports semisupervised learning Summary statistics from supervised models

    can be added to summary statistics from unsupervised models to train a single model on a mixture of labeled and unlabeled data.
  20. 20 pomegranate supports semisupervised learning Supervised Accuracy: 0.93 Semisupervised Accuracy:

    0.96
  21. 21 pomegranate can be faster than scipy

  22. 22 pomegranate uses aggressive caching

  23. 23

  24. 24 Example ‘blast’ from Gossip Girl Spotted: Lonely Boy. Can't

    believe the love of his life has returned. If only she knew who he was. But everyone knows Serena. And everyone is talking. Wonder what Blair Waldorf thinks. Sure, they're BFF's, but we always thought Blair's boyfriend Nate had a thing for Serena.
  25. 25 Example ‘blast’ from Gossip Girl Why'd she leave? Why'd

    she return? Send me all the deets. And who am I? That's the secret I'll never tell. The only one. —XOXO. Gossip Girl.
  26. 26 How do we encode these ‘blasts’? Better lock it

    down with Nate, B. Clock's ticking. +1 Nate -1 Blair
  27. 27 How do we encode these ‘blasts’? This just in:

    S and B committing a crime of fashion. Who doesn't love a five-finger discount. Especially if it's the middle one. -1 Blair -1 Serena
  28. 28 Simple summations don’t work well

  29. 29 Beta distributions can model uncertainty

  30. 30 Beta distributions can model uncertainty

  31. 31 Beta distributions can model uncertainty

  32. Overview: this talk 32 Overview Major Models/Model Stacks 1. General

    Mixture Models 2. Hidden Markov Models 3. Bayesian Networks 4. Bayes Classifiers Finale: Train a mixture of HMMs in parallel
  33. GMMs can model complex distributions 33

  34. GMMs can model complex distributions 34 model = GeneralMixtureModel.from_samples(NormalDistribution, 2,

    X)
  35. GMMs can model complex distributions 35

  36. An exponential distribution is not right 36 model = ExponentialDistribution.from_samples(X)

  37. A mixture of exponentials is better 37 model = GeneralMixtureModel.from_samples(ExponentialDistribution,

    2, X)
  38. Heterogeneous mixtures natively supported 38 model = GeneralMixtureModel.from_samples([ExponentialDistribution, UniformDistribution], 2,

    X)
  39. GMMs faster than sklearn 39

  40. Overview: this talk 40 Overview Major Models/Model Stacks 1. General

    Mixture Models 2. Hidden Markov Models 3. Bayesian Networks 4. Bayes Classifiers Finale: Train a mixture of HMMs in parallel
  41. CG enrichment detection HMM 41 GACTACGACTCGCGCTCGCACGTCGCTCGACATCATCGACA

  42. CG enrichment detection HMM GACTACGACTCGCGCTCGCACGTCGCTCGACATCATCGACA 42

  43. pomegranate HMMs are feature rich 43

  44. GMM-HMM easy to define 44

  45. HMMs are faster than hmmlearn 45

  46. Overview: this talk 46 Overview Major Models/Model Stacks 1. General

    Mixture Models 2. Hidden Markov Models 3. Bayesian Networks 4. Bayes Classifiers Finale: Train a mixture of HMMs in parallel
  47. Bayesian networks 47 Bayesian networks are powerful inference tools which

    define a dependency structure between variables. Sprinkler Wet Grass Rain
  48. Bayesian networks 48 Sprinkler Wet Grass Rain Two main difficult

    tasks: (1) Inference given incomplete information (2) Learning the dependency structure from data
  49. Bayesian network structure learning 49 ??? Three primary ways: •

    “Search and score” / Exact • “Constraint Learning” / PC • Heuristics
  50. Bayesian network structure learning 50 ??? pomegranate supports: • “Search

    and score” / Exact • “Constraint Learning” / PC • Heuristics
  51. Exact structure learning is intractable ??? 51

  52. pomegranate supports four algorithms 52

  53. Constraint graphs merge data + knowledge 53 BRCA 2 BRCA

    1 LCT BLOAT LE LOA VOM AC PREG LI OC genetic conditions diseases symptoms
  54. Constraint graphs merge data + knowledge 54 genetic conditions diseases

    symptoms
  55. Modeling the global stock market 55

  56. Constraint graph published in PeerJ CS 56

  57. Overview: this talk 57 Overview Major Models/Model Stacks 1. General

    Mixture Models 2. Hidden Markov Models 3. Bayesian Networks 4. Bayes Classifiers Finale: Train a mixture of HMMs in parallel
  58. Bayes classifiers rely on Bayes’ rule 58

  59. Naive Bayes assumes independent features 59

  60. Naive Bayes produces ellipsoid boundaries 60 model = NaiveBayes.from_samples(NormalDistribution, X,

    y)
  61. Naive Bayes can be heterogenous 61

  62. Data can fall under different distributions 62

  63. Using appropriate distributions is better 63 model = NaiveBayes.from_samples(NormalDistribution, X_train,

    y_train) print "Gaussian Naive Bayes: ", (model.predict(X_test) == y_test).mean() clf = GaussianNB().fit(X_train, y_train) print "sklearn Gaussian Naive Bayes: ", (clf.predict(X_test) == y_test).mean() model = NaiveBayes.from_samples([NormalDistribution, LogNormalDistribution, ExponentialDistribution], X_train, y_train) print "Heterogeneous Naive Bayes: ", (model.predict(X_test) == y_test).mean() Gaussian Naive Bayes: 0.798 sklearn Gaussian Naive Bayes: 0.798 Heterogeneous Naive Bayes: 0.844
  64. This additional flexibility is just as fast 64

  65. Bayes classifiers don’t require independence 65 naive accuracy: 0.929 bayes

    classifier accuracy: 0.966
  66. Gaussian mixture model Bayes classifier 66

  67. Creating complex Bayes classifiers is easy 67 gmm_a = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution,

    2, X[y == 0]) gmm_b = GeneralMixtureModel.from_samples(MultivariateGaussianDistribution, 2, X[y == 1]) model_b = BayesClassifier([gmm_a, gmm_b], weights=numpy.array([1-y.mean(), y.mean()]))
  68. Creating complex Bayes classifiers is easy 68 mc_a = MarkovChain.from_samples(X[y

    == 0]) mc_b = MarkovChain.from_samples(X[y == 1]) model_b = BayesClassifier([mc_a, mc_b], weights=numpy.array([1-y.mean(), y.mean()])) hmm_a = HiddenMarkovModel.from_samples(X[y == 0]) hmm_b = HiddenMarkovModel.from_samples(X[y == 1]) model_b = BayesClassifier([hmm_a, hmm_b], weights=numpy.array([1-y.mean(), y.mean()])) bn_a = BayesianNetwork.from_samples(X[y == 0]) bn_b = BayesianNetwork.from_samples(X[y == 1]) model_b = BayesClassifier([bn_a, bn_b], weights=numpy.array([1-y.mean(), y.mean()]))
  69. Overview: this talk 69 Overview Major Models/Model Stacks 1. General

    Mixture Models 2. Hidden Markov Models 3. Bayesian Networks 4. Bayes Classifiers Finale: Train a mixture of HMMs in parallel
  70. Training a mixture of HMMs in parallel 70 Creating a

    mixture of HMMs is just as simple as passing the HMMs into a GMM as if it were any other distribution
  71. Training a mixture of HMMs in parallel 71 fit(model, X,

    n_jobs=n)
  72. Overview pomegranate is more flexible than other packages, faster, is

    intuitive to use, and can do it all in parallel 72
  73. Documentation available at Readthedocs 73

  74. Tutorials available on github 74 https://github.com/jmschrei/pomegranate/tree/master/tutorials

  75. Thank you for your time. 75