$30 off During Our Annual Pro Sale. View Details »

Probabilistic Programming - Strata Santa Clara 2014

Beau Cronin
February 13, 2014

Probabilistic Programming - Strata Santa Clara 2014

Beau Cronin

February 13, 2014
Tweet

More Decks by Beau Cronin

Other Decks in Programming

Transcript

  1. Probabilistic Programming:
    Why, What, How, When
    Beau Cronin
    @beaucronin

    View Slide

  2. 40 Action-Packed Minutes
    ‣ Why you should care - what’s wrong with what we’ve got?
    ‣ What probabilistic programming is, and what programs look like
    ‣ How you can get started today
    ‣ When will all of this be ready for production use?

    View Slide

  3. Why?

    View Slide

  4. We use data to learn about the world
    Traditional!
    Machine Learning
    Hierarchical
    Bayesian Modeling
    Large Scale Small
    Mature & Robust Tools & frameworks Immature & Spotty
    Discard
    Structure &
    Knowledge
    Keep & Leverage
    Homogeneous Data Types Heterogeneous
    Toolkit,
    Theory-light
    Philosophical
    Approach
    Modeling,
    Theory-heavy
    Why?

    View Slide

  5. G = {V, E}
    What order were these links added in?
    What messages flow over this link?
    What do we know about this user?
    Why?

    View Slide

  6. x1 x2 lat1 long1 t1 t2 t3 t4 address1
    1 1.2 2 34.0 118.2 2.3 3.4 1.9 10.4 516 61st St,
    2 0.1 1 40.7 73.9 -1.5 4.5 8.9 2305 Tustin
    3 10.5 0 37.9 122.3 4.7 -2.5 -3.4 1 Market St.
    4 8.3 -1 -22.9 43.2 4.2 5.6 1.6 9.5
    5 4.9 5 -37.8 -145.0 1600 Pennsyl
    6 1.5 1 3.4 4.0 4.6 5.2 650 7th St., S
    Positive numbers
    Categorical values
    Locations Time Series
    Addresses
    Missing values
    Why?

    View Slide

  7. Diverse Data
    Most real datasets contain compositions of these and
    more, but we routinely homogenize in preprocessing
    Lorem Ipsum
    Trees &
    Graphs
    Time
    Series
    Relations
    Locations &
    Addresses
    Images &
    Movies
    Audio
    Sets &
    Partitions
    Text
    Why?

    View Slide

  8. Business Data Is Heterogeneous and
    Structured
    id: “abcdef”
    gender: “Male”
    dob: 1978-12-09
    twitter_id: 9458201
    Profile
    2014-01-21 18:41:04, “https://devcenter.heroku.com/articles/quickstart”, …
    2014-01-20 12:35:56, “https://devcenter.heroku.com/categories/java”, …
    2014-01-20 09:12:52, “https://devcenter.heroku.com/articles/ssl-endpoint”, …
    Page Views
    Order Date Order ID Title Category ASIN/ISBN Release Date
    Condition
    Seller Per Unit Price
    1/5/13 002-1139353-0278652 Under Armour Men's Resistor No Show Socks,pack of 6 Socks
    Apparel B003RYQJJW new The Sock Company, Inc.
    $21.99
    1/5/13 002-1139353-0278652 Under Armour Men's Resistor No Show Socks,pack of 6 Socks
    Apparel B004UONNXI new The Sock Company, Inc.
    $21.99
    1/8/13 002-2593752-8837806 CivilWarLand in Bad Decline
    Paperback 1573225797 1/31/97 new Amazon.com LLC $8.4
    1/8/13 109-0985451-2187421 Nothing to Envy: Ordinary Lives in North Korea
    Paperback 385523912 9/20/10 new Amazon.com LLC
    $10.88
    1/12/13 109-8581642-2322617 Excession Mass Market Paperback
    553575376 2/1/98 new Amazon.com LLC $7.99
    Transactions
    [
    {
    text: “key to compelling VR is…”,
    retweet_count: 3,
    favorites_count: 5,
    urls: [ ],
    hashtags: [ ],
    in_reply_to: 39823792801012

    },
    {
    text: “@John4man really liked your piece”,
    retweets: 0,
    favorites: 0,

    }
    ]
    Social Posts
    [ 657693, 7588892, 9019482, …]
    Followers
    blocked: False
    want_retweets: True
    marked_spam: False
    since: 2013-09-13
    Relationship

    View Slide

  9. Every Domain Is Heterogeneous
    ‣ Health data: doctor notes, lab results, imaging, family history,
    prescriptions
    ‣ Quantified self: motion sensors, heart rate, GPS tracks, self-
    reporting, sleep patterns
    ‣ Autonomous vehicles: LIDAR, cameras, maps, audio, gyros,
    telemetry, GPS
    Why?

    View Slide

  10. Mostly, no one even tries
    to jointly model these
    different kinds of data
    Why?

    View Slide

  11. A probabilistic programming system is…
    a language + {compiler, interpreter}
    or that
    a {library, framework} for an existing language
    - includes random choices as native elements
    - and provides a clean separation between probabilistic modeling
    and inference
    - and may provide automated generation of inference solutions for a
    given program
    What?

    View Slide

  12. Probabilistic Programming
    Systems Model the World
    ‣ Programs directly represent the data generation process
    ‣ Measurement processes can be modeled directly, including their
    imperfections and the uncertainty that comes with them
    ‣ Philosophy
    ‣ DO: capture the essential aspects of real-world processes in a model
    ‣ DON’T: torture the data into the right form for an algorithm
    What?

    View Slide

  13. A Probability Model
    ✕ N
    Fixed
    Observable
    Unknown
    Constant values and !
    structural assumptions
    Variables that discriminate
    between hypotheses
    Data and potential data
    What?

    View Slide

  14. Obligatory Bayes’ Rule
    Pr(H | D, A) ∝ Pr(D | H, A) Pr(H | A)
    Data
    Hypotheses
    Pr(H | D) ∝ Pr(D | H) Pr(H)
    Assumptions
    What?

    View Slide

  15. !
    !
    !
    fair-prior = .999
    !
    fair-coin? = flip(fair-prior)
    !
    if fair-coin?:
    weight = 0.5
    else:
    weight = 0.9
    !
    observe(repeat(flip(weight), 10)),
    [H, H, H, H, H, H, H, H, H, H])
    !
    query(fair-coin?)
    First example: Deciding if a coin is fair based on flips
    Assumptions
    !
    Unknowns
    !
    Observables

    View Slide

  16. Probabilistic Programming
    Systems Are Diverse
    ‣ Library vs. stand-alone language
    ‣ Base language: Scala, Lisp, Python
    ‣ Manual, semi-, or fully-automated inference
    ‣ Modeling domain: directed/undirected graphical models, relational
    data, all programs
    ‣ Home field: cognitive science, programming languages, databases,
    Bayesian statistics, artificial intelligence
    What?

    View Slide

  17. PPSs Compared
    Type Language Inference
    BLOG Stand-alone Custom Fully Auto
    BUGS / JAGS Stand-alone Custom Fully Auto
    STAN Hybrid R, Python Fully Auto
    PyMC Library Python Manual
    Infer.net Library C# Semi-auto
    Church Stand-alone Lisp Fully Auto
    Venture Stand-alone Javascript, Lisp Semi-auto
    Figaro Library Scala Semi-auto
    factorie Library Scala Semi-auto
    What?

    View Slide

  18. infer.net
    ‣ A C# framework (also F#)
    ‣ Developed at MSR
    ‣ Under active development, with good tutorials and many well-
    documented examples
    How?

    View Slide

  19. VariableArray controlGroup =
    Variable.Observed(new bool[] { false, false, true, false, false });
    VariableArray treatedGroup =
    Variable.Observed(new bool[] { true, false, true, true, true });
    Range i = controlGroup.Range; Range j = treatedGroup.Range;
    !
    Variable isEffective = Variable.Bernoulli(0.5);
    !
    Variable probIfTreated, probIfControl;
    using (Variable.If(isEffective))
    {
    // Model if treatment is effective
    probIfControl = Variable.Beta(1, 1);
    controlGroup[i] = Variable.Bernoulli(probIfControl).ForEach(i);
    probIfTreated = Variable.Beta(1, 1);
    treatedGroup[j] = Variable.Bernoulli(probIfTreated).ForEach(j);
    }
    !
    using (Variable.IfNot(isEffective))
    {
    // Model if treatment is not effective
    Variable probAll = Variable.Beta(1, 1);
    controlGroup[i] = Variable.Bernoulli(probAll).ForEach(i);
    treatedGroup[j] = Variable.Bernoulli(probAll).ForEach(j);
    }
    !
    InferenceEngine ie = new InferenceEngine();
    Console.WriteLine("Probability treatment has an effect = " + ie.Infer(isEffective));
    Infer.net example: Is a new treatment effective?
    http://research.microsoft.com/en-us/um/cambridge/projects/infernet/docs/Clinical%20trial%20tutorial.aspx
    Observations
    Unknown
    Assumptions &
    Unknowns
    Query

    View Slide

  20. PyMC
    ‣ Python (duh)
    ‣ Go watch Thomas Wiecki’s talk from PyData NY
    ‣ http://twiecki.github.io/blog/2013/12/12/bayesian-data-analysis-pymc3/
    ‣ And read Bayesian Methods for Hackers by Cam Davidson-Pilon et al.
    How?

    View Slide

  21. Church
    ‣ A Lisp
    ‣ Originally created to model cognitive development and human reasoning
    ‣ Active inference research, several implementations
    ‣ Connection between functional purity / independence vs. stochastic
    memoization / exchangeability
    ‣ Hypothesis space is possible program executions
    ‣ “Probabilistic Models of Cognition”
    How?

    View Slide

  22. ;stochastic memoization generator for class assignments
    ;sometimes return a previous symbol, sometimes create a new one
    (define class-distribution (DP-stochastic-mem 1.0 gensym))
    !
    ;associate a class with an object via memoization
    (define object->class
    (mem (lambda (object) (class-distribution))))
    !
    ;associate gaussian parameters with a class via memoization
    (define class->gaussian-parameters
    (mem (lambda (class) (list (gaussian 65 10) (gaussian 0 8)))))
    !
    ;generate observed values for an object
    (define (observe object)
    (apply gaussian (class->gaussian-parameters (object->class object))))
    !
    ;generate observations for some objects
    (map observe '(tom dick harry bill fred))
    modified from https://probmods.org/non-parametric-models.html
    Church example: Infinite Gaussian Mixture Model

    View Slide

  23. (define kind-distribution (DPmem 1.0 gensym))
    !
    (define feature->kind
    (mem (lambda (feature) (kind-distribution))))
    !
    (define kind->class-distribution
    (mem (lambda (kind) (DPmem 1.0 gensym))))
    !
    (define feature-kind/object->class
    (mem (lambda (kind object)
    (sample (kind->class-distribution kind)))))
    !
    (define class->parameters
    (mem (lambda (object-class) (first (beta 1 1)))))
    !
    (define (observe object feature)
    (flip (class->parameters (feature-kind/object->class
    (feature->kind feature) object))))
    !
    (observe 'eggs 'breakfast)
    https://probmods.org/non-parametric-models.html
    Church example: Cross-categorization (BayesDB)

    View Slide

  24. Churj?
    !
    Jurch?
    How?

    View Slide

  25. So Far
    ‣ Why
    ‣ What
    ‣ How
    ‣ When

    View Slide

  26. What We Still Need
    1. Basic CS: Improved compilers and run-times for more efficient
    automatic inference
    2. Tooling: Debuggers, optimizers, IDEs, visualization
    3. Tribal knowledge: idioms, patterns, best practices
    When?

    View Slide

  27. When?

    View Slide

  28. 14
    • Application
    • Code Libraries
    • Programming
    Language
    • Compiler
    • Hardware
    The Probabilistic Programming Revolution
    • Model
    • Model Libraries
    • Probabilistic
    Programming
    Language
    • Inference Engine
    • Hardware
    Traditional Programming Probabilistic Programming
    Code models capture how the data was
    generated using random variables to
    represent uncertainty
    Libraries contain common model
    components: Markov chains, deep
    belief networks, etc.
    PPL provides probabilistic primitives &
    traditional PL constructs so users can
    express model, queries, and data
    Inference engine analyzes probabilistic
    program and chooses appropriate
    solver(s) for available hardware
    Hardware can include multi-core, GPU,
    cloud-based resources, GraphLab,
    UPSIDE/Analog Logic results, etc.
    High-level programming languages facilitate building complex systems
    Probabilistic programming languages facilitate building rich ML applications
    Approved for Public Release; Distribution Unlimited

    View Slide

  29. 15
    • Shorter: Reduce LOC by 100x for machine learning applications
    • Seismic Monitoring: 28K LOC in C vs. 25 LOC in BLOG
    • Microsoft MatchBox: 15K LOC in C# vs. 300 LOC in Fun
    • Faster: Reduce development time by 100x
    • Seismic Monitoring: Several years vs. 1 hour
    • Microsoft TrueSkill: Six months for competent developer vs. 2 hours with Infer.Net
    • Enable quick exploration of many models
    • More Informative: Develop models that are 10x more sophisticated
    • Enable surprising, new applications
    • Incorporate rich domain-knowledge
    • Produce more accurate answers
    • Require less data
    • Increase robustness with respect to noise
    • Increase ability to cope with contradiction
    • With less expertise: Enable 100x more programmers
    • Separate the model (the program) from the solvers (the compiler),
    enabling domain experts without machine learning PhDs to write applications
    The Promise of Probabilistic Programming Languages
    Probabilistic Programming could empower domain experts and ML experts
    Sources:
    • Bayesian Data Analysis, Gelman, 2003
    • Pattern Recognition and Machine Learning,
    Bishop, 2007
    • Science, Tanenbaum et al, 2011
    DISTRIBUTION STATEMENT F. Further dissemination only as directed by DARPA, (February 20, 2013) or higher DoD authority.

    View Slide

  30. Optimizer
    “What is happening
    when I run this?”

    View Slide

  31. Profiler
    “Where is the
    time and memory
    being used?”

    View Slide

  32. Debugger
    “What is the exact
    state of my program at
    each point in time?”

    View Slide

  33. Visualization
    “What is the hidden
    structure of my data,
    and how certain
    should I be?”
    http://www.icg.tugraz.at/project/caleydo/

    View Slide

  34. Probabilistic Programming Workflows?
    ETL
    data
    prep
    predictive
    model
    data
    sources
    end
    uses
    Lingual:
    DW → ANSI SQL
    Pattern:
    SAS, R, etc. → PMML
    business logic in Java,
    Clojure, Scala, etc.
    sink taps for
    Memcached, HBase,
    MongoDB, etc.
    source taps for
    Cassandra, JDBC,
    Splunk, etc.
    Definition: Data Workflows

    For example, Cascading and related projects implement the following
    components, based on 100% open source:
    cascading.org
    adapted from
    Paco Nathan:
    Data Workflows
    for Machine
    Learning

    View Slide

  35. Evolution of PPSs
    When?

    View Slide

  36. Bottom Line
    ‣ Go experiment and learn! - there are several good options
    ‣ But be realistic about the current state of the art
    ‣ And keep your ear to the ground - this area is moving fast

    View Slide

  37. Parting Questions
    ‣ Which projects are good fits for probabilistic programming today?
    ‣ Exploration and prototyping vs. scaled production deployment?
    ‣ How long before we have the Python, Ruby, and even PHP of PPSs?
    ‣ Is there a unification with the log-centric view of big data processing?
    ‣ Can natively stochastic hardware provide compelling performance
    gains?
    When?

    View Slide

  38. Resources
    ‣ probabilistic-programming.org
    ‣ Probabilistic Programming and Bayesian Methods for Hackers
    ‣ Probabilistic Models of Cognition
    ‣ Mathematica Journal article
    ‣ Thomas Wiecki’s PyData talk on PyMC

    View Slide

  39. People To Watch
    Vikash Mansinghka (MIT)
    !
    Noah Goodman (Stanford)
    !
    David Wingate (Lyric Labs)
    !
    Avi Pfeffer (CRA)
    Rob Zinkov (USC)
    !
    Andrew Gordon (MSR)
    !
    John Winn (MSR)
    !
    Dan Roy (Cambridge)

    View Slide

  40. Languages and Systems
    ‣ PyMC
    ‣ infer.net
    ‣ STAN
    ‣ Figaro
    !
    ‣ BLOG
    ‣ Church
    ‣ factor.ie
    ‣ BUGS / JAGS

    View Slide

  41. @beaucronin

    View Slide