of production • Reduced manufacturing cost • Economies of scale (reusable parts) However :- • Machinery is complex & expensive • Each product requires some bespoke parts
certain mindset • Multi-stage algorithm complexity • If you get stuck, R.T.F.S. Alleviated to an extent by tools like :- • Pig, Hive, Cascading, Crunch Typically requires bespoke code / algorithms
or process materials without interruption" Key features :- • Materials are processed in flows & streams • Can run continuously (exc. maintenance) • Latency e2e can be from seconds to hours Credit: Wikipedia
the ocean' jobs; • tasks that take hours or days • typically non-interactive with users • works well for pattern mining, clustering etc. However, the 'perfect' answer is useless if it arrives so late it's irrelevant...
• processed in streams not batches • best for 'supervised learning' models • end-to-end latency can be in seconds Key criteria :- • model always has a 'best answer' available • feedback used to train the model
to implement Clustering :- • Periodically batch recompute clusters • Add new data points to the nearest centroid • Rinse, repeat Collaborative filtering :-
approaches and algorithms almost every day :- • Many hard to implement in a parallel way We need more focus on :- • Inherently distributed algorithms • Practical implementations • Speed over marginal accuracy improvements
models :- • Probabilistic Demographics • Language detection ** • Sentiment analysis ** • Metadata Generation (entity extraction and disambiguation) ** Free to signup and easy to get started! http://labs.tumra.com/