Probabilistic Programming - Strata Santa Clara 2014

89289602684a545b286cf4937f13f8fc?s=47 Beau Cronin
February 13, 2014

Probabilistic Programming - Strata Santa Clara 2014


Beau Cronin

February 13, 2014


  1. Probabilistic Programming: Why, What, How, When Beau Cronin @beaucronin

  2. 40 Action-Packed Minutes ‣ Why you should care - what’s

    wrong with what we’ve got? ‣ What probabilistic programming is, and what programs look like ‣ How you can get started today ‣ When will all of this be ready for production use?
  3. Why?

  4. We use data to learn about the world Traditional! Machine

    Learning Hierarchical Bayesian Modeling Large Scale Small Mature & Robust Tools & frameworks Immature & Spotty Discard Structure & Knowledge Keep & Leverage Homogeneous Data Types Heterogeneous Toolkit, Theory-light Philosophical Approach Modeling, Theory-heavy Why?
  5. G = {V, E} What order were these links added

    in? What messages flow over this link? What do we know about this user? Why?
  6. x1 x2 lat1 long1 t1 t2 t3 t4 address1 1

    1.2 2 34.0 118.2 2.3 3.4 1.9 10.4 516 61st St, 2 0.1 1 40.7 73.9 -1.5 4.5 8.9 2305 Tustin 3 10.5 0 37.9 122.3 4.7 -2.5 -3.4 1 Market St. 4 8.3 -1 -22.9 43.2 4.2 5.6 1.6 9.5 5 4.9 5 -37.8 -145.0 1600 Pennsyl 6 1.5 1 3.4 4.0 4.6 5.2 650 7th St., S Positive numbers Categorical values Locations Time Series Addresses Missing values Why?
  7. Diverse Data Most real datasets contain compositions of these and

    more, but we routinely homogenize in preprocessing Lorem Ipsum Trees & Graphs Time Series Relations Locations & Addresses Images & Movies Audio Sets & Partitions Text Why?
  8. Business Data Is Heterogeneous and Structured id: “abcdef” gender: “Male”

    dob: 1978-12-09 twitter_id: 9458201 Profile 2014-01-21 18:41:04, “”, … 2014-01-20 12:35:56, “”, … 2014-01-20 09:12:52, “”, … Page Views Order Date Order ID Title Category ASIN/ISBN Release Date Condition Seller Per Unit Price 1/5/13 002-1139353-0278652 Under Armour Men's Resistor No Show Socks,pack of 6 Socks Apparel B003RYQJJW new The Sock Company, Inc. $21.99 1/5/13 002-1139353-0278652 Under Armour Men's Resistor No Show Socks,pack of 6 Socks Apparel B004UONNXI new The Sock Company, Inc. $21.99 1/8/13 002-2593752-8837806 CivilWarLand in Bad Decline Paperback 1573225797 1/31/97 new LLC $8.4 1/8/13 109-0985451-2187421 Nothing to Envy: Ordinary Lives in North Korea Paperback 385523912 9/20/10 new LLC $10.88 1/12/13 109-8581642-2322617 Excession Mass Market Paperback 553575376 2/1/98 new LLC $7.99 Transactions [ { text: “key to compelling VR is…”, retweet_count: 3, favorites_count: 5, urls: [ ], hashtags: [ ], in_reply_to: 39823792801012 … }, { text: “@John4man really liked your piece”, retweets: 0, favorites: 0, … } ] Social Posts [ 657693, 7588892, 9019482, …] Followers blocked: False want_retweets: True marked_spam: False since: 2013-09-13 Relationship
  9. Every Domain Is Heterogeneous ‣ Health data: doctor notes, lab

    results, imaging, family history, prescriptions ‣ Quantified self: motion sensors, heart rate, GPS tracks, self- reporting, sleep patterns ‣ Autonomous vehicles: LIDAR, cameras, maps, audio, gyros, telemetry, GPS Why?
  10. Mostly, no one even tries to jointly model these different

    kinds of data Why?
  11. A probabilistic programming system is… a language + {compiler, interpreter}

    or that a {library, framework} for an existing language - includes random choices as native elements - and provides a clean separation between probabilistic modeling and inference - and may provide automated generation of inference solutions for a given program What?
  12. Probabilistic Programming Systems Model the World ‣ Programs directly represent

    the data generation process ‣ Measurement processes can be modeled directly, including their imperfections and the uncertainty that comes with them ‣ Philosophy ‣ DO: capture the essential aspects of real-world processes in a model ‣ DON’T: torture the data into the right form for an algorithm What?
  13. A Probability Model ✕ N Fixed Observable Unknown Constant values

    and ! structural assumptions Variables that discriminate between hypotheses Data and potential data What?
  14. Obligatory Bayes’ Rule Pr(H | D, A) ∝ Pr(D |

    H, A) Pr(H | A) Data Hypotheses Pr(H | D) ∝ Pr(D | H) Pr(H) Assumptions What?
  15. ! ! ! fair-prior = .999 ! fair-coin? = flip(fair-prior)

    ! if fair-coin?: weight = 0.5 else: weight = 0.9 ! observe(repeat(flip(weight), 10)), [H, H, H, H, H, H, H, H, H, H]) ! query(fair-coin?) First example: Deciding if a coin is fair based on flips Assumptions ! Unknowns ! Observables
  16. Probabilistic Programming Systems Are Diverse ‣ Library vs. stand-alone language

    ‣ Base language: Scala, Lisp, Python ‣ Manual, semi-, or fully-automated inference ‣ Modeling domain: directed/undirected graphical models, relational data, all programs ‣ Home field: cognitive science, programming languages, databases, Bayesian statistics, artificial intelligence What?
  17. PPSs Compared Type Language Inference BLOG Stand-alone Custom Fully Auto

    BUGS / JAGS Stand-alone Custom Fully Auto STAN Hybrid R, Python Fully Auto PyMC Library Python Manual Library C# Semi-auto Church Stand-alone Lisp Fully Auto Venture Stand-alone Javascript, Lisp Semi-auto Figaro Library Scala Semi-auto factorie Library Scala Semi-auto What?
  18. ‣ A C# framework (also F#) ‣ Developed at

    MSR ‣ Under active development, with good tutorials and many well- documented examples How?
  19. VariableArray<bool> controlGroup = Variable.Observed(new bool[] { false, false, true, false,

    false }); VariableArray<bool> treatedGroup = Variable.Observed(new bool[] { true, false, true, true, true }); Range i = controlGroup.Range; Range j = treatedGroup.Range; ! Variable<bool> isEffective = Variable.Bernoulli(0.5); ! Variable<double> probIfTreated, probIfControl; using (Variable.If(isEffective)) { // Model if treatment is effective probIfControl = Variable.Beta(1, 1); controlGroup[i] = Variable.Bernoulli(probIfControl).ForEach(i); probIfTreated = Variable.Beta(1, 1); treatedGroup[j] = Variable.Bernoulli(probIfTreated).ForEach(j); } ! using (Variable.IfNot(isEffective)) { // Model if treatment is not effective Variable<double> probAll = Variable.Beta(1, 1); controlGroup[i] = Variable.Bernoulli(probAll).ForEach(i); treatedGroup[j] = Variable.Bernoulli(probAll).ForEach(j); } ! InferenceEngine ie = new InferenceEngine(); Console.WriteLine("Probability treatment has an effect = " + ie.Infer(isEffective)); example: Is a new treatment effective? Observations Unknown Assumptions & Unknowns Query
  20. PyMC ‣ Python (duh) ‣ Go watch Thomas Wiecki’s talk

    from PyData NY ‣ ‣ And read Bayesian Methods for Hackers by Cam Davidson-Pilon et al. How?
  21. Church ‣ A Lisp ‣ Originally created to model cognitive

    development and human reasoning ‣ Active inference research, several implementations ‣ Connection between functional purity / independence vs. stochastic memoization / exchangeability ‣ Hypothesis space is possible program executions ‣ “Probabilistic Models of Cognition” How?
  22. ;stochastic memoization generator for class assignments ;sometimes return a previous

    symbol, sometimes create a new one (define class-distribution (DP-stochastic-mem 1.0 gensym)) ! ;associate a class with an object via memoization (define object->class (mem (lambda (object) (class-distribution)))) ! ;associate gaussian parameters with a class via memoization (define class->gaussian-parameters (mem (lambda (class) (list (gaussian 65 10) (gaussian 0 8))))) ! ;generate observed values for an object (define (observe object) (apply gaussian (class->gaussian-parameters (object->class object)))) ! ;generate observations for some objects (map observe '(tom dick harry bill fred)) modified from Church example: Infinite Gaussian Mixture Model
  23. (define kind-distribution (DPmem 1.0 gensym)) ! (define feature->kind (mem (lambda

    (feature) (kind-distribution)))) ! (define kind->class-distribution (mem (lambda (kind) (DPmem 1.0 gensym)))) ! (define feature-kind/object->class (mem (lambda (kind object) (sample (kind->class-distribution kind))))) ! (define class->parameters (mem (lambda (object-class) (first (beta 1 1))))) ! (define (observe object feature) (flip (class->parameters (feature-kind/object->class (feature->kind feature) object)))) ! (observe 'eggs 'breakfast) Church example: Cross-categorization (BayesDB)
  24. Churj? ! Jurch? How?

  25. So Far ‣ Why ‣ What ‣ How ‣ When

  26. What We Still Need 1. Basic CS: Improved compilers and

    run-times for more efficient automatic inference 2. Tooling: Debuggers, optimizers, IDEs, visualization 3. Tribal knowledge: idioms, patterns, best practices When?
  27. When?

  28. 14 • Application • Code Libraries • Programming Language •

    Compiler • Hardware The Probabilistic Programming Revolution • Model • Model Libraries • Probabilistic Programming Language • Inference Engine • Hardware Traditional Programming Probabilistic Programming Code models capture how the data was generated using random variables to represent uncertainty Libraries contain common model components: Markov chains, deep belief networks, etc. PPL provides probabilistic primitives & traditional PL constructs so users can express model, queries, and data Inference engine analyzes probabilistic program and chooses appropriate solver(s) for available hardware Hardware can include multi-core, GPU, cloud-based resources, GraphLab, UPSIDE/Analog Logic results, etc. High-level programming languages facilitate building complex systems Probabilistic programming languages facilitate building rich ML applications Approved for Public Release; Distribution Unlimited
  29. 15 • Shorter: Reduce LOC by 100x for machine learning

    applications • Seismic Monitoring: 28K LOC in C vs. 25 LOC in BLOG • Microsoft MatchBox: 15K LOC in C# vs. 300 LOC in Fun • Faster: Reduce development time by 100x • Seismic Monitoring: Several years vs. 1 hour • Microsoft TrueSkill: Six months for competent developer vs. 2 hours with Infer.Net • Enable quick exploration of many models • More Informative: Develop models that are 10x more sophisticated • Enable surprising, new applications • Incorporate rich domain-knowledge • Produce more accurate answers • Require less data • Increase robustness with respect to noise • Increase ability to cope with contradiction • With less expertise: Enable 100x more programmers • Separate the model (the program) from the solvers (the compiler), enabling domain experts without machine learning PhDs to write applications The Promise of Probabilistic Programming Languages Probabilistic Programming could empower domain experts and ML experts Sources: • Bayesian Data Analysis, Gelman, 2003 • Pattern Recognition and Machine Learning, Bishop, 2007 • Science, Tanenbaum et al, 2011 DISTRIBUTION STATEMENT F. Further dissemination only as directed by DARPA, (February 20, 2013) or higher DoD authority.
  30. Optimizer “What is happening when I run this?”

  31. Profiler “Where is the time and memory being used?”

  32. Debugger “What is the exact state of my program at

    each point in time?”
  33. Visualization “What is the hidden structure of my data, and

    how certain should I be?”
  34. Probabilistic Programming Workflows? ETL data prep predictive model data sources

    end uses Lingual: DW → ANSI SQL Pattern: SAS, R, etc. → PMML business logic in Java, Clojure, Scala, etc. sink taps for Memcached, HBase, MongoDB, etc. source taps for Cassandra, JDBC, Splunk, etc. Definition: Data Workflows For example, Cascading and related projects implement the following components, based on 100% open source: adapted from Paco Nathan: Data Workflows for Machine Learning
  35. Evolution of PPSs When?

  36. Bottom Line ‣ Go experiment and learn! - there are

    several good options ‣ But be realistic about the current state of the art ‣ And keep your ear to the ground - this area is moving fast
  37. Parting Questions ‣ Which projects are good fits for probabilistic

    programming today? ‣ Exploration and prototyping vs. scaled production deployment? ‣ How long before we have the Python, Ruby, and even PHP of PPSs? ‣ Is there a unification with the log-centric view of big data processing? ‣ Can natively stochastic hardware provide compelling performance gains? When?
  38. Resources ‣ ‣ Probabilistic Programming and Bayesian Methods for

    Hackers ‣ Probabilistic Models of Cognition ‣ Mathematica Journal article ‣ Thomas Wiecki’s PyData talk on PyMC
  39. People To Watch Vikash Mansinghka (MIT) ! Noah Goodman (Stanford)

    ! David Wingate (Lyric Labs) ! Avi Pfeffer (CRA) Rob Zinkov (USC) ! Andrew Gordon (MSR) ! John Winn (MSR) ! Dan Roy (Cambridge)
  40. Languages and Systems ‣ PyMC ‣ ‣ STAN ‣

    Figaro ! ‣ BLOG ‣ Church ‣ ‣ BUGS / JAGS
  41. @beaucronin