Lessons learned PyMC3 - Speaker Deck

Slide 1

Slide 1 text

PyMC3 lessons learned Peadar Coyle - PyMC3 committer, Blogger and Data Scientist PyData Montreal @springcoil www.probabilisticprogrammingprimer.com

Slide 2

Slide 2 text

Why OSS matters - it democratises innovation @patio11

Slide 3

Slide 3 text

New release - PyMC3 3.7 PyMC3 3.7 is released! Highlights: - Python 3 only - @arviz_devs for plotting - Data class for handling changing data between inference and posterior predictive - Big under the hood improvements, especially to prior predictive sampling and shape handling

Slide 4

Slide 4 text

Funding models for OSS are broken https://www.fordfoundation.org/about/library/reports-and-studies/roads-and-bridges-the-unseen-labor-be hind-our-digital-infrastructure/

Slide 5

Slide 5 text

Github sponsors - reasons for optimism https://github.com/sponsors

Slide 6

Slide 6 text

Our metrics

Slide 7

Slide 7 text

Isn’t everything a machine learning problem? Lots of problems are small data or heteogeneous data problems. Traditional ML models such as XGBoost or Random Forests DON’T incorporate domain expertise or work well with small data.

Slide 8

Slide 8 text

What are the applications? How do I make money? Basically anywhere you need to understand uncertainty, handle domain speciﬁc knowledge or handle small heterogeneous data. Marketing is a good use case, A/B testing, survey data, pricing modelling and many use cases in terms of risk modelling. What all of these problems have in common is that uncertainty quantiﬁcation matters

Slide 9

Slide 9 text

What is a PPL? A PPL is a Probabilistic Programming Language that treats random variables as ﬁrst class citizens.

Slide 10

Slide 10 text

Who uses Stan?

Slide 11

Slide 11 text

Who uses PyMC3

Slide 12

Slide 12 text

Box loop

Slide 13

Slide 13 text

Community is important invest in it Community is extremely important. GSoC/ Leadership Negative Our gender split isn’t great - it’s a problem for all of OSS. Would love to know some solutions to this.

Slide 14

Slide 14 text

Tooling ● Great work by the likes of Ravin Kumar/Austin Rochford on the tooling side, CI/CD helps in research software too. ● Good test cases and docker work too helped a lot.

Slide 15

Slide 15 text

Docs ● Value docs. Make it really easy to contribute to this too. ● Evangelism at conferences helped improve adoption

Slide 16

Slide 16 text

Importance of research ● We publish (occasionally) and regularly read the literature. ● We do a journal club. ● We also have allowed ‘low risk’ merges to pymc3. Reduce the bar for researchers

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Meet in person ● Useful to meet in person. It helps build up relationships.

Slide 19

Slide 19 text

Proﬁle https://discourse.pymc.io/t/multiple-linear-regression/3139 https://docs.pymc.io/notebooks/profiling.html I had a use case like this with a client. Went from 20 hour of the model running to 3 minutes ● Reducing the number of iterations. NUTS is a very powerful tool. ● Vectorization caused a lot of the speed improvements.

Slide 20

Slide 20 text

What’s coming next? We just had a PyMC4 summit in Montreal. Keep a lookout for our updates. Great support by the Tensorﬂow Probability team at Google. https://github.com/pymc-devs/pymc4

Slide 21

Slide 21 text

Thank you - You may want to check out www.probabilisticprogrammingprimer.com

Slide 22

Slide 22 text

Your job is to inform better decisions You might think that your job is to understand the truth about reality or whatever. All science is about making better decisions. If your inference is wrong - then your decisions will be wrong.

Slide 23

Slide 23 text

More applications

Slide 24

Slide 24 text

What does this mean practically? To handle large scale problems or ‘big data’ problems in a Bayesian Inference framework - we need to use Hamiltonian samplers. Hamiltonian samplers work well under certain conditions. These conditions are often swept under the carpet.

Slide 25

Slide 25 text

What about regulation? Increasingly models will be deployed in regulated industries - and in a post GDPR world interpretability will matter more. If you work with healthcare data, ﬁnance data, insurance you should add Bayesian Statistics to your toolkit. We’ll discuss how to debug Bayesian models, using modern techniques such as NUTS. This is PyMC3 speciﬁc but the techniques apply to Rainier, Stan and BUGS.