Upgrade to Pro — share decks privately, control downloads, hide ads and more …

From square to round wheels.... Moving from bat...

From square to round wheels.... Moving from batch to real-time machine learning

Michael Cutler CTO @Tumra talk at Data Science London 6/09/12

Data Science London

September 07, 2012
Tweet

More Decks by Data Science London

Other Decks in Technology

Transcript

  1. From square to round wheels... Michael Cutler - 6th Sept

    2012 tumra.com @tumra TUMRA LTD, Building 3, Chiswick Park, 566 Chiswick High Road, W4 5YA ...moving from batch to real-time machine learning
  2. In Manufacturing... Batch processing brought advantages :- • Increased scale

    of production • Reduced manufacturing cost • Economies of scale (reusable parts) However :- • Machinery is complex & expensive • Each product requires some bespoke parts
  3. In Technology... Been around since the 50's in Mainframes Hadoop

    (Map/Reduce) advantages :- • Increased scale of processing • Reduced processing cost ** • Economies of scale (reusable code) However :- • Complex & expensive ** • Most jobs requires some bespoke code
  4. Map/Reduce != FUN Sure its "just Java" but... • Requires

    certain mindset • Multi-stage algorithm complexity • If you get stuck, R.T.F.S. Alleviated to an extent by tools like :- • Pig, Hive, Cascading, Crunch Typically requires bespoke code / algorithms
  5. In manufacturing... Described as: "a method used to manufacture, produce,

    or process materials without interruption" Key features :- • Materials are processed in flows & streams • Can run continuously (exc. maintenance) • Latency e2e can be from seconds to hours Credit: Wikipedia
  6. In Technology... We have a problem... most Hadoop related technologies

    are inherently batch!! The trend towards real-time continuous computation requires :- • New tools (Storm?) • Better algorithms So what's the solution?
  7. Batch does have its place... Map/Reduce is great for 'boil

    the ocean' jobs; • tasks that take hours or days • typically non-interactive with users • works well for pattern mining, clustering etc. However, the 'perfect' answer is useless if it arrives so late it's irrelevant...
  8. Real-time machine learning Quite simply "data is never at rest"...

    • processed in streams not batches • best for 'supervised learning' models • end-to-end latency can be in seconds Key criteria :- • model always has a 'best answer' available • feedback used to train the model
  9. So what works well in real-time? Classification :- • Easiest

    to implement Clustering :- • Periodically batch recompute clusters • Add new data points to the nearest centroid • Rinse, repeat Collaborative filtering :-
  10. Machine learning gap... Academia are 'way out there' with new

    approaches and algorithms almost every day :- • Many hard to implement in a parallel way We need more focus on :- • Inherently distributed algorithms • Practical implementations • Speed over marginal accuracy improvements
  11. Mathematical navel gazing We need practical solutions to real-world problems...

    Recommendations Rant!?!?!?!?! • Most recommenders are 2D matrices • Humans are not very 2D • Is there an N-dimensional solution?
  12. Example Use-cases Examples; • eCommerce optimisation • Targeted advertising •

    Financial services (risk modeling) • Detecting anomalies in M2M data • Automated metadata generation ... many more!
  13. Introducing TUMRA Labs API access to some of our real-time

    models :- • Probabilistic Demographics • Language detection ** • Sentiment analysis ** • Metadata Generation (entity extraction and disambiguation) ** Free to signup and easy to get started! http://labs.tumra.com/