$30 off During Our Annual Pro Sale. View Details »

From square to round wheels.... Moving from batch to real-time machine learning

From square to round wheels.... Moving from batch to real-time machine learning

Michael Cutler CTO @Tumra talk at Data Science London 6/09/12

Data Science London

September 07, 2012

More Decks by Data Science London

Other Decks in Technology


  1. From square to round wheels...
    Michael Cutler - 6th Sept 2012
    TUMRA LTD, Building 3, Chiswick Park,
    566 Chiswick High Road, W4 5YA
    ...moving from batch to real-time machine learning

    View Slide

  2. Batch

    View Slide

  3. Credit: http://bit.ly/Q71u4W

    View Slide

  4. In Manufacturing...
    Batch processing brought advantages :-

    Increased scale of production

    Reduced manufacturing cost

    Economies of scale (reusable parts)
    However :-

    Machinery is complex & expensive

    Each product requires some bespoke parts

    View Slide

  5. In Technology...
    Been around since the 50's in Mainframes
    Hadoop (Map/Reduce) advantages :-

    Increased scale of processing

    Reduced processing cost **

    Economies of scale (reusable code)
    However :-

    Complex & expensive **

    Most jobs requires some bespoke code

    View Slide

  6. Map/Reduce != FUN
    Sure its "just Java" but...

    Requires certain mindset

    Multi-stage algorithm complexity

    If you get stuck, R.T.F.S.
    Alleviated to an extent by tools like :-

    Pig, Hive, Cascading, Crunch
    Typically requires bespoke code / algorithms

    View Slide

  7. Continuous

    View Slide

  8. Credit: http://bit.ly/NOslqf

    View Slide

  9. In manufacturing...
    Described as:
    "a method used to manufacture, produce, or
    process materials without interruption"
    Key features :-

    Materials are processed in flows & streams

    Can run continuously (exc. maintenance)

    Latency e2e can be from seconds to hours
    Credit: Wikipedia

    View Slide

  10. In Technology...
    We have a problem... most Hadoop related
    technologies are inherently batch!!
    The trend towards real-time continuous
    computation requires :-

    New tools (Storm?)

    Better algorithms
    So what's the solution?

    View Slide

  11. Credit: Scott Simmerman

    View Slide

  12. It's a hybrid of both!

    View Slide

  13. Batch does have its place...
    Map/Reduce is great for 'boil the ocean' jobs;

    tasks that take hours or days

    typically non-interactive with users

    works well for pattern mining, clustering etc.
    However, the 'perfect' answer is useless if it
    arrives so late it's irrelevant...

    View Slide

  14. Real-time machine learning
    Quite simply "data is never at rest"...

    processed in streams not batches

    best for 'supervised learning' models

    end-to-end latency can be in seconds
    Key criteria :-

    model always has a 'best answer' available

    feedback used to train the model

    View Slide

  15. View Slide

  16. So what works well in real-time?
    Classification :-

    Easiest to implement
    Clustering :-

    Periodically batch recompute clusters

    Add new data points to the nearest centroid

    Rinse, repeat
    Collaborative filtering :-

    View Slide

  17. The machine learning gap...
    Academic Practical

    View Slide

  18. Machine learning gap...
    Academia are 'way out there' with new
    approaches and algorithms almost every day :-

    Many hard to implement in a parallel way
    We need more focus on :-

    Inherently distributed algorithms

    Practical implementations

    Speed over marginal accuracy improvements

    View Slide

  19. Mathematical navel gazing
    We need practical solutions to real-world
    Recommendations Rant!?!?!?!?!

    Most recommenders are 2D matrices

    Humans are not very 2D

    Is there an N-dimensional solution?

    View Slide

  20. Hybrid approach

    View Slide

  21. Hybrid approach

    View Slide

  22. Example Use-cases

    eCommerce optimisation

    Targeted advertising

    Financial services (risk modeling)

    Detecting anomalies in M2M data

    Automated metadata generation
    ... many more!

    View Slide

  23. Almost finished!

    View Slide

  24. Introducing TUMRA Labs
    API access to some of our real-time models :-

    Probabilistic Demographics

    Language detection **

    Sentiment analysis **

    Metadata Generation (entity extraction and
    disambiguation) **
    Free to signup and easy to get started!

    View Slide

  25. Questions?

    View Slide