Upgrade to Pro — share decks privately, control downloads, hide ads and more …

History, patterns and future of Scalding

History, patterns and future of Scalding

Given an the WibiData Kiji meet up, 2014-01-22 in San Francisco, CA

P. Oscar Boykin

January 22, 2014
Tweet

More Decks by P. Oscar Boykin

Other Decks in Programming

Transcript

  1. @argyris and @posco join Twitter in May, 2011. Work with

    Avi on ads analytics and @scalding Thursday, January 23, 14
  2. Data is modeled as streams of named Tuples (of objects)

    Word Count Thursday, January 23, 14
  3. Common Items •Time separated events require a join •aggregation phase

    (.group.sum) just adds up some (record of) values. •This aggregation is associative, so we don’t need to look at history to produce today’s results. Thursday, January 23, 14
  4. Common Items •This aggregation is NOT associative, so we need

    to look at history to produce today’s results. •Models are joined with events with a custom cogroup. •The update logic lives outside of the job (in the model class?) Thursday, January 23, 14
  5. •Algebird (github.com/twitter/ algebird) includes many approximation algorithms. •MinHash gives approximate

    set similarity, useful for LSH. •HyperLogLog / CountMinSketch for scalable approximate set size, event counts. Thursday, January 23, 14
  6. •Release 0.9.0 (~ 2 weeks): •REPL contributed (thanks Wibi!) •Typed-API

    improvements (joining, implementation, combinators) •optimizing Matrix API •improved function serialization •some API warts removed Thursday, January 23, 14
  7. •Explore spark support: •Preferred option: cascading backend for spark. •Does

    this speed-up ETL (extract, transform, load) jobs significantly? •Can spark OOM issues be handled for large multi-tenant use-cases? Thursday, January 23, 14
  8. •Easier integration into larger tools/libs: •Summingbird uses scalding as a

    library: learned a lot about what is easy and not. Some patterns can be added to scalding. •Would love to make it easier to build and distribute ML/Linear Algebra libraries. How to compose? Thursday, January 23, 14