Slide 1

Slide 1 text

Probabilistic Programming: Why, What, How, When Beau Cronin @beaucronin

Slide 2

Slide 2 text

40 Action-Packed Minutes ‣ Why you should care - what’s wrong with what we’ve got? ‣ What probabilistic programming is, and what programs look like ‣ How you can get started today ‣ When will all of this be ready for production use?

Slide 3

Slide 3 text

Why?

Slide 4

Slide 4 text

We use data to learn about the world Traditional! Machine Learning Hierarchical Bayesian Modeling Large Scale Small Mature & Robust Tools & frameworks Immature & Spotty Discard Structure & Knowledge Keep & Leverage Homogeneous Data Types Heterogeneous Toolkit, Theory-light Philosophical Approach Modeling, Theory-heavy Why?

Slide 5

Slide 5 text

G = {V, E} What order were these links added in? What messages flow over this link? What do we know about this user? Why?

Slide 6

Slide 6 text

x1 x2 lat1 long1 t1 t2 t3 t4 address1 1 1.2 2 34.0 118.2 2.3 3.4 1.9 10.4 516 61st St, 2 0.1 1 40.7 73.9 -1.5 4.5 8.9 2305 Tustin 3 10.5 0 37.9 122.3 4.7 -2.5 -3.4 1 Market St. 4 8.3 -1 -22.9 43.2 4.2 5.6 1.6 9.5 5 4.9 5 -37.8 -145.0 1600 Pennsyl 6 1.5 1 3.4 4.0 4.6 5.2 650 7th St., S Positive numbers Categorical values Locations Time Series Addresses Missing values Why?

Slide 7

Slide 7 text

Diverse Data Most real datasets contain compositions of these and more, but we routinely homogenize in preprocessing Lorem Ipsum Trees & Graphs Time Series Relations Locations & Addresses Images & Movies Audio Sets & Partitions Text Why?

Slide 8

Slide 8 text

Business Data Is Heterogeneous and Structured id: “abcdef” gender: “Male” dob: 1978-12-09 twitter_id: 9458201 Profile 2014-01-21 18:41:04, “https://devcenter.heroku.com/articles/quickstart”, … 2014-01-20 12:35:56, “https://devcenter.heroku.com/categories/java”, … 2014-01-20 09:12:52, “https://devcenter.heroku.com/articles/ssl-endpoint”, … Page Views Order Date Order ID Title Category ASIN/ISBN Release Date Condition Seller Per Unit Price 1/5/13 002-1139353-0278652 Under Armour Men's Resistor No Show Socks,pack of 6 Socks Apparel B003RYQJJW new The Sock Company, Inc. $21.99 1/5/13 002-1139353-0278652 Under Armour Men's Resistor No Show Socks,pack of 6 Socks Apparel B004UONNXI new The Sock Company, Inc. $21.99 1/8/13 002-2593752-8837806 CivilWarLand in Bad Decline Paperback 1573225797 1/31/97 new Amazon.com LLC $8.4 1/8/13 109-0985451-2187421 Nothing to Envy: Ordinary Lives in North Korea Paperback 385523912 9/20/10 new Amazon.com LLC $10.88 1/12/13 109-8581642-2322617 Excession Mass Market Paperback 553575376 2/1/98 new Amazon.com LLC $7.99 Transactions [ { text: “key to compelling VR is…”, retweet_count: 3, favorites_count: 5, urls: [ ], hashtags: [ ], in_reply_to: 39823792801012 … }, { text: “@John4man really liked your piece”, retweets: 0, favorites: 0, … } ] Social Posts [ 657693, 7588892, 9019482, …] Followers blocked: False want_retweets: True marked_spam: False since: 2013-09-13 Relationship

Slide 9

Slide 9 text

Every Domain Is Heterogeneous ‣ Health data: doctor notes, lab results, imaging, family history, prescriptions ‣ Quantified self: motion sensors, heart rate, GPS tracks, self- reporting, sleep patterns ‣ Autonomous vehicles: LIDAR, cameras, maps, audio, gyros, telemetry, GPS Why?

Slide 10

Slide 10 text

Mostly, no one even tries to jointly model these different kinds of data Why?

Slide 11

Slide 11 text

A probabilistic programming system is… a language + {compiler, interpreter} or that a {library, framework} for an existing language - includes random choices as native elements - and provides a clean separation between probabilistic modeling and inference - and may provide automated generation of inference solutions for a given program What?

Slide 12

Slide 12 text

Probabilistic Programming Systems Model the World ‣ Programs directly represent the data generation process ‣ Measurement processes can be modeled directly, including their imperfections and the uncertainty that comes with them ‣ Philosophy ‣ DO: capture the essential aspects of real-world processes in a model ‣ DON’T: torture the data into the right form for an algorithm What?

Slide 13

Slide 13 text

A Probability Model ✕ N Fixed Observable Unknown Constant values and ! structural assumptions Variables that discriminate between hypotheses Data and potential data What?

Slide 14

Slide 14 text

Obligatory Bayes’ Rule Pr(H | D, A) ∝ Pr(D | H, A) Pr(H | A) Data Hypotheses Pr(H | D) ∝ Pr(D | H) Pr(H) Assumptions What?

Slide 15

Slide 15 text

! ! ! fair-prior = .999 ! fair-coin? = flip(fair-prior) ! if fair-coin?: weight = 0.5 else: weight = 0.9 ! observe(repeat(flip(weight), 10)), [H, H, H, H, H, H, H, H, H, H]) ! query(fair-coin?) First example: Deciding if a coin is fair based on flips Assumptions ! Unknowns ! Observables

Slide 16

Slide 16 text

Probabilistic Programming Systems Are Diverse ‣ Library vs. stand-alone language ‣ Base language: Scala, Lisp, Python ‣ Manual, semi-, or fully-automated inference ‣ Modeling domain: directed/undirected graphical models, relational data, all programs ‣ Home field: cognitive science, programming languages, databases, Bayesian statistics, artificial intelligence What?

Slide 17

Slide 17 text

PPSs Compared Type Language Inference BLOG Stand-alone Custom Fully Auto BUGS / JAGS Stand-alone Custom Fully Auto STAN Hybrid R, Python Fully Auto PyMC Library Python Manual Infer.net Library C# Semi-auto Church Stand-alone Lisp Fully Auto Venture Stand-alone Javascript, Lisp Semi-auto Figaro Library Scala Semi-auto factorie Library Scala Semi-auto What?

Slide 18

Slide 18 text

infer.net ‣ A C# framework (also F#) ‣ Developed at MSR ‣ Under active development, with good tutorials and many well- documented examples How?

Slide 19

Slide 19 text

VariableArray controlGroup = Variable.Observed(new bool[] { false, false, true, false, false }); VariableArray treatedGroup = Variable.Observed(new bool[] { true, false, true, true, true }); Range i = controlGroup.Range; Range j = treatedGroup.Range; ! Variable isEffective = Variable.Bernoulli(0.5); ! Variable probIfTreated, probIfControl; using (Variable.If(isEffective)) { // Model if treatment is effective probIfControl = Variable.Beta(1, 1); controlGroup[i] = Variable.Bernoulli(probIfControl).ForEach(i); probIfTreated = Variable.Beta(1, 1); treatedGroup[j] = Variable.Bernoulli(probIfTreated).ForEach(j); } ! using (Variable.IfNot(isEffective)) { // Model if treatment is not effective Variable probAll = Variable.Beta(1, 1); controlGroup[i] = Variable.Bernoulli(probAll).ForEach(i); treatedGroup[j] = Variable.Bernoulli(probAll).ForEach(j); } ! InferenceEngine ie = new InferenceEngine(); Console.WriteLine("Probability treatment has an effect = " + ie.Infer(isEffective)); Infer.net example: Is a new treatment effective? http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/Clinical%20trial%20tutorial.aspx Observations Unknown Assumptions & Unknowns Query

Slide 20

Slide 20 text

PyMC ‣ Python (duh) ‣ Go watch Thomas Wiecki’s talk from PyData NY ‣ http://twiecki.github.io/blog/2013/12/12/bayesian-data-analysis-pymc3/ ‣ And read Bayesian Methods for Hackers by Cam Davidson-Pilon et al. How?

Slide 21

Slide 21 text

Church ‣ A Lisp ‣ Originally created to model cognitive development and human reasoning ‣ Active inference research, several implementations ‣ Connection between functional purity / independence vs. stochastic memoization / exchangeability ‣ Hypothesis space is possible program executions ‣ “Probabilistic Models of Cognition” How?

Slide 22

Slide 22 text

;stochastic memoization generator for class assignments ;sometimes return a previous symbol, sometimes create a new one (define class-distribution (DP-stochastic-mem 1.0 gensym)) ! ;associate a class with an object via memoization (define object->class (mem (lambda (object) (class-distribution)))) ! ;associate gaussian parameters with a class via memoization (define class->gaussian-parameters (mem (lambda (class) (list (gaussian 65 10) (gaussian 0 8))))) ! ;generate observed values for an object (define (observe object) (apply gaussian (class->gaussian-parameters (object->class object)))) ! ;generate observations for some objects (map observe '(tom dick harry bill fred)) modified from https://probmods.org/non-parametric-models.html Church example: Infinite Gaussian Mixture Model

Slide 23

Slide 23 text

(define kind-distribution (DPmem 1.0 gensym)) ! (define feature->kind (mem (lambda (feature) (kind-distribution)))) ! (define kind->class-distribution (mem (lambda (kind) (DPmem 1.0 gensym)))) ! (define feature-kind/object->class (mem (lambda (kind object) (sample (kind->class-distribution kind))))) ! (define class->parameters (mem (lambda (object-class) (first (beta 1 1))))) ! (define (observe object feature) (flip (class->parameters (feature-kind/object->class (feature->kind feature) object)))) ! (observe 'eggs 'breakfast) https://probmods.org/non-parametric-models.html Church example: Cross-categorization (BayesDB)

Slide 24

Slide 24 text

Churj? ! Jurch? How?

Slide 25

Slide 25 text

So Far ‣ Why ‣ What ‣ How ‣ When

Slide 26

Slide 26 text

What We Still Need 1. Basic CS: Improved compilers and run-times for more efficient automatic inference 2. Tooling: Debuggers, optimizers, IDEs, visualization 3. Tribal knowledge: idioms, patterns, best practices When?

Slide 27

Slide 27 text

When?

Slide 28

Slide 28 text

14 • Application • Code Libraries • Programming Language • Compiler • Hardware The Probabilistic Programming Revolution • Model • Model Libraries • Probabilistic Programming Language • Inference Engine • Hardware Traditional Programming Probabilistic Programming Code models capture how the data was generated using random variables to represent uncertainty Libraries contain common model components: Markov chains, deep belief networks, etc. PPL provides probabilistic primitives & traditional PL constructs so users can express model, queries, and data Inference engine analyzes probabilistic program and chooses appropriate solver(s) for available hardware Hardware can include multi-core, GPU, cloud-based resources, GraphLab, UPSIDE/Analog Logic results, etc. High-level programming languages facilitate building complex systems Probabilistic programming languages facilitate building rich ML applications Approved for Public Release; Distribution Unlimited

Slide 29

Slide 29 text

15 • Shorter: Reduce LOC by 100x for machine learning applications • Seismic Monitoring: 28K LOC in C vs. 25 LOC in BLOG • Microsoft MatchBox: 15K LOC in C# vs. 300 LOC in Fun • Faster: Reduce development time by 100x • Seismic Monitoring: Several years vs. 1 hour • Microsoft TrueSkill: Six months for competent developer vs. 2 hours with Infer.Net • Enable quick exploration of many models • More Informative: Develop models that are 10x more sophisticated • Enable surprising, new applications • Incorporate rich domain-knowledge • Produce more accurate answers • Require less data • Increase robustness with respect to noise • Increase ability to cope with contradiction • With less expertise: Enable 100x more programmers • Separate the model (the program) from the solvers (the compiler), enabling domain experts without machine learning PhDs to write applications The Promise of Probabilistic Programming Languages Probabilistic Programming could empower domain experts and ML experts Sources: • Bayesian Data Analysis, Gelman, 2003 • Pattern Recognition and Machine Learning, Bishop, 2007 • Science, Tanenbaum et al, 2011 DISTRIBUTION STATEMENT F. Further dissemination only as directed by DARPA, (February 20, 2013) or higher DoD authority.

Slide 30

Slide 30 text

Optimizer “What is happening when I run this?”

Slide 31

Slide 31 text

Profiler “Where is the time and memory being used?”

Slide 32

Slide 32 text

Debugger “What is the exact state of my program at each point in time?”

Slide 33

Slide 33 text

Visualization “What is the hidden structure of my data, and how certain should I be?” http://www.icg.tugraz.at/project/caleydo/

Slide 34

Slide 34 text

Probabilistic Programming Workflows? ETL data prep predictive model data sources end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Definition: Data Workflows For example, Cascading and related projects implement the following components, based on 100% open source: cascading.org adapted from Paco Nathan: Data Workflows for Machine Learning

Slide 35

Slide 35 text

Evolution of PPSs When?

Slide 36

Slide 36 text

Bottom Line ‣ Go experiment and learn! - there are several good options ‣ But be realistic about the current state of the art ‣ And keep your ear to the ground - this area is moving fast

Slide 37

Slide 37 text

Parting Questions ‣ Which projects are good fits for probabilistic programming today? ‣ Exploration and prototyping vs. scaled production deployment? ‣ How long before we have the Python, Ruby, and even PHP of PPSs? ‣ Is there a unification with the log-centric view of big data processing? ‣ Can natively stochastic hardware provide compelling performance gains? When?

Slide 38

Slide 38 text

Resources ‣ probabilistic-programming.org ‣ Probabilistic Programming and Bayesian Methods for Hackers ‣ Probabilistic Models of Cognition ‣ Mathematica Journal article ‣ Thomas Wiecki’s PyData talk on PyMC

Slide 39

Slide 39 text

People To Watch Vikash Mansinghka (MIT) ! Noah Goodman (Stanford) ! David Wingate (Lyric Labs) ! Avi Pfeffer (CRA) Rob Zinkov (USC) ! Andrew Gordon (MSR) ! John Winn (MSR) ! Dan Roy (Cambridge)

Slide 40

Slide 40 text

Languages and Systems ‣ PyMC ‣ infer.net ‣ STAN ‣ Figaro ! ‣ BLOG ‣ Church ‣ factor.ie ‣ BUGS / JAGS

Slide 41

Slide 41 text

@beaucronin