Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons learned PyMC3

Lessons learned PyMC3

I discuss the lessons learned from a few years as a core developer of an open source project.

98c35e22a5c8c92bb066efb332e30991?s=128

springcoil

May 29, 2019
Tweet

Transcript

  1. PyMC3 lessons learned Peadar Coyle - PyMC3 committer, Blogger and

    Data Scientist PyData Montreal @springcoil www.probabilisticprogrammingprimer.com
  2. Why OSS matters - it democratises innovation @patio11

  3. New release - PyMC3 3.7 PyMC3 3.7 is released! Highlights:

    - Python 3 only - @arviz_devs for plotting - Data class for handling changing data between inference and posterior predictive - Big under the hood improvements, especially to prior predictive sampling and shape handling
  4. Funding models for OSS are broken https://www.fordfoundation.org/about/library/reports-and-studies/roads-and-bridges-the-unseen-labor-be hind-our-digital-infrastructure/

  5. Github sponsors - reasons for optimism https://github.com/sponsors

  6. Our metrics

  7. Isn’t everything a machine learning problem? Lots of problems are

    small data or heteogeneous data problems. Traditional ML models such as XGBoost or Random Forests DON’T incorporate domain expertise or work well with small data.
  8. What are the applications? How do I make money? Basically

    anywhere you need to understand uncertainty, handle domain specific knowledge or handle small heterogeneous data. Marketing is a good use case, A/B testing, survey data, pricing modelling and many use cases in terms of risk modelling. What all of these problems have in common is that uncertainty quantification matters
  9. What is a PPL? A PPL is a Probabilistic Programming

    Language that treats random variables as first class citizens.
  10. Who uses Stan?

  11. Who uses PyMC3

  12. Box loop

  13. Community is important invest in it Community is extremely important.

    GSoC/ Leadership Negative Our gender split isn’t great - it’s a problem for all of OSS. Would love to know some solutions to this.
  14. Tooling • Great work by the likes of Ravin Kumar/Austin

    Rochford on the tooling side, CI/CD helps in research software too. • Good test cases and docker work too helped a lot.
  15. Docs • Value docs. Make it really easy to contribute

    to this too. • Evangelism at conferences helped improve adoption
  16. Importance of research • We publish (occasionally) and regularly read

    the literature. • We do a journal club. • We also have allowed ‘low risk’ merges to pymc3. Reduce the bar for researchers
  17. None
  18. Meet in person • Useful to meet in person. It

    helps build up relationships.
  19. Profile https://discourse.pymc.io/t/multiple-linear-regression/3139 https://docs.pymc.io/notebooks/profiling.html I had a use case like this

    with a client. Went from 20 hour of the model running to 3 minutes • Reducing the number of iterations. NUTS is a very powerful tool. • Vectorization caused a lot of the speed improvements.
  20. What’s coming next? We just had a PyMC4 summit in

    Montreal. Keep a lookout for our updates. Great support by the Tensorflow Probability team at Google. https://github.com/pymc-devs/pymc4
  21. Thank you - You may want to check out www.probabilisticprogrammingprimer.com

  22. Your job is to inform better decisions You might think

    that your job is to understand the truth about reality or whatever. All science is about making better decisions. If your inference is wrong - then your decisions will be wrong.
  23. More applications

  24. What does this mean practically? To handle large scale problems

    or ‘big data’ problems in a Bayesian Inference framework - we need to use Hamiltonian samplers. Hamiltonian samplers work well under certain conditions. These conditions are often swept under the carpet.
  25. What about regulation? Increasingly models will be deployed in regulated

    industries - and in a post GDPR world interpretability will matter more. If you work with healthcare data, finance data, insurance you should add Bayesian Statistics to your toolkit. We’ll discuss how to debug Bayesian models, using modern techniques such as NUTS. This is PyMC3 specific but the techniques apply to Rainier, Stan and BUGS.