Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From square to round wheels.... Moving from bat...

From square to round wheels.... Moving from batch to real-time machine learning

Michael Cutler CTO @Tumra talk at Data Science London 6/09/12

Avatar for Data Science London

Data Science London

September 07, 2012
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. From square to round wheels... Michael Cutler - 6th Sept

    2012 tumra.com @tumra TUMRA LTD, Building 3, Chiswick Park, 566 Chiswick High Road, W4 5YA ...moving from batch to real-time machine learning
  2. In Manufacturing... Batch processing brought advantages :- • Increased scale

    of production • Reduced manufacturing cost • Economies of scale (reusable parts) However :- • Machinery is complex & expensive • Each product requires some bespoke parts
  3. In Technology... Been around since the 50's in Mainframes Hadoop

    (Map/Reduce) advantages :- • Increased scale of processing • Reduced processing cost ** • Economies of scale (reusable code) However :- • Complex & expensive ** • Most jobs requires some bespoke code
  4. Map/Reduce != FUN Sure its "just Java" but... • Requires

    certain mindset • Multi-stage algorithm complexity • If you get stuck, R.T.F.S. Alleviated to an extent by tools like :- • Pig, Hive, Cascading, Crunch Typically requires bespoke code / algorithms
  5. In manufacturing... Described as: "a method used to manufacture, produce,

    or process materials without interruption" Key features :- • Materials are processed in flows & streams • Can run continuously (exc. maintenance) • Latency e2e can be from seconds to hours Credit: Wikipedia
  6. In Technology... We have a problem... most Hadoop related technologies

    are inherently batch!! The trend towards real-time continuous computation requires :- • New tools (Storm?) • Better algorithms So what's the solution?
  7. Batch does have its place... Map/Reduce is great for 'boil

    the ocean' jobs; • tasks that take hours or days • typically non-interactive with users • works well for pattern mining, clustering etc. However, the 'perfect' answer is useless if it arrives so late it's irrelevant...
  8. Real-time machine learning Quite simply "data is never at rest"...

    • processed in streams not batches • best for 'supervised learning' models • end-to-end latency can be in seconds Key criteria :- • model always has a 'best answer' available • feedback used to train the model
  9. So what works well in real-time? Classification :- • Easiest

    to implement Clustering :- • Periodically batch recompute clusters • Add new data points to the nearest centroid • Rinse, repeat Collaborative filtering :-
  10. Machine learning gap... Academia are 'way out there' with new

    approaches and algorithms almost every day :- • Many hard to implement in a parallel way We need more focus on :- • Inherently distributed algorithms • Practical implementations • Speed over marginal accuracy improvements
  11. Mathematical navel gazing We need practical solutions to real-world problems...

    Recommendations Rant!?!?!?!?! • Most recommenders are 2D matrices • Humans are not very 2D • Is there an N-dimensional solution?
  12. Example Use-cases Examples; • eCommerce optimisation • Targeted advertising •

    Financial services (risk modeling) • Detecting anomalies in M2M data • Automated metadata generation ... many more!
  13. Introducing TUMRA Labs API access to some of our real-time

    models :- • Probabilistic Demographics • Language detection ** • Sentiment analysis ** • Metadata Generation (entity extraction and disambiguation) ** Free to signup and easy to get started! http://labs.tumra.com/