From square to round wheels... Michael Cutler - 6th Sept 2012 tumra.com @tumra TUMRA LTD, Building 3, Chiswick Park, 566 Chiswick High Road, W4 5YA ...moving from batch to real-time machine learning
In Manufacturing... Batch processing brought advantages :- ● Increased scale of production ● Reduced manufacturing cost ● Economies of scale (reusable parts) However :- ● Machinery is complex & expensive ● Each product requires some bespoke parts
In Technology... Been around since the 50's in Mainframes Hadoop (Map/Reduce) advantages :- ● Increased scale of processing ● Reduced processing cost ** ● Economies of scale (reusable code) However :- ● Complex & expensive ** ● Most jobs requires some bespoke code
Map/Reduce != FUN Sure its "just Java" but... ● Requires certain mindset ● Multi-stage algorithm complexity ● If you get stuck, R.T.F.S. Alleviated to an extent by tools like :- ● Pig, Hive, Cascading, Crunch Typically requires bespoke code / algorithms
In manufacturing... Described as: "a method used to manufacture, produce, or process materials without interruption" Key features :- ● Materials are processed in flows & streams ● Can run continuously (exc. maintenance) ● Latency e2e can be from seconds to hours Credit: Wikipedia
In Technology... We have a problem... most Hadoop related technologies are inherently batch!! The trend towards real-time continuous computation requires :- ● New tools (Storm?) ● Better algorithms So what's the solution?
Batch does have its place... Map/Reduce is great for 'boil the ocean' jobs; ● tasks that take hours or days ● typically non-interactive with users ● works well for pattern mining, clustering etc. However, the 'perfect' answer is useless if it arrives so late it's irrelevant...
Real-time machine learning Quite simply "data is never at rest"... ● processed in streams not batches ● best for 'supervised learning' models ● end-to-end latency can be in seconds Key criteria :- ● model always has a 'best answer' available ● feedback used to train the model
So what works well in real-time? Classification :- ● Easiest to implement Clustering :- ● Periodically batch recompute clusters ● Add new data points to the nearest centroid ● Rinse, repeat Collaborative filtering :-
Machine learning gap... Academia are 'way out there' with new approaches and algorithms almost every day :- ● Many hard to implement in a parallel way We need more focus on :- ● Inherently distributed algorithms ● Practical implementations ● Speed over marginal accuracy improvements
Mathematical navel gazing We need practical solutions to real-world problems... Recommendations Rant!?!?!?!?! ● Most recommenders are 2D matrices ● Humans are not very 2D ● Is there an N-dimensional solution?
Introducing TUMRA Labs API access to some of our real-time models :- ● Probabilistic Demographics ● Language detection ** ● Sentiment analysis ** ● Metadata Generation (entity extraction and disambiguation) ** Free to signup and easy to get started! http://labs.tumra.com/