Save 37% off PRO during our Black Friday Sale! »

History, patterns and future of Scalding

History, patterns and future of Scalding

Given an the WibiData Kiji meet up, 2014-01-22 in San Francisco, CA

0caf621c9ff9879374574f6cdd41e247?s=128

P. Oscar Boykin

January 22, 2014
Tweet

Transcript

  1. @Scalding history, patterns and future Thursday, January 23, 14

  2. Oscar Boykin @posco Twitter Thursday, January 23, 14

  3. 1) History Thursday, January 23, 14

  4. Avi Bryant hates Pig Thursday, January 23, 14

  5. Avi Bryant hates Pig and @squarecog Thursday, January 23, 14

  6. Thursday, January 23, 14

  7. Thursday, January 23, 14

  8. Java can be verbose Thursday, January 23, 14

  9. Thursday, January 23, 14

  10. Thursday, January 23, 14

  11. @argyris and @posco join Twitter in May, 2011. Thursday, January

    23, 14
  12. @argyris and @posco join Twitter in May, 2011. Work with

    Avi on ads analytics and @scalding Thursday, January 23, 14
  13. Thursday, January 23, 14

  14. Started as a DSL for Cascading Thursday, January 23, 14

  15. Logic is in the constructor Word Count Thursday, January 23,

    14
  16. Functions can be called or defined inline Word Count Thursday,

    January 23, 14
  17. Read and Write data through Source objects Word Count Thursday,

    January 23, 14
  18. Data is modeled as streams of named Tuples (of objects)

    Word Count Thursday, January 23, 14
  19. Added type-safe distributed collections (and Matrix) API. Thursday, January 23,

    14
  20. Thursday, January 23, 14

  21. Thursday, January 23, 14

  22. 2) patterns Thursday, January 23, 14

  23. Click Rates Thursday, January 23, 14

  24. Thursday, January 23, 14

  25. Common Items •Time separated events require a join •aggregation phase

    (.group.sum) just adds up some (record of) values. •This aggregation is associative, so we don’t need to look at history to produce today’s results. Thursday, January 23, 14
  26. Folding (training/ updating) Thursday, January 23, 14

  27. Thursday, January 23, 14

  28. Common Items •This aggregation is NOT associative, so we need

    to look at history to produce today’s results. •Models are joined with events with a custom cogroup. •The update logic lives outside of the job (in the model class?) Thursday, January 23, 14
  29. 3) Clustering Thursday, January 23, 14

  30. Thursday, January 23, 14

  31. •Algebird (github.com/twitter/ algebird) includes many approximation algorithms. •MinHash gives approximate

    set similarity, useful for LSH. •HyperLogLog / CountMinSketch for scalable approximate set size, event counts. Thursday, January 23, 14
  32. 3) Future[Scalding] Thursday, January 23, 14

  33. •Release 0.9.0 (~ 2 weeks): •REPL contributed (thanks Wibi!) •Typed-API

    improvements (joining, implementation, combinators) •optimizing Matrix API •improved function serialization •some API warts removed Thursday, January 23, 14
  34. •Explore spark support: •Preferred option: cascading backend for spark. •Does

    this speed-up ETL (extract, transform, load) jobs significantly? •Can spark OOM issues be handled for large multi-tenant use-cases? Thursday, January 23, 14
  35. •Easier integration into larger tools/libs: •Summingbird uses scalding as a

    library: learned a lot about what is easy and not. Some patterns can be added to scalding. •Would love to make it easier to build and distribute ML/Linear Algebra libraries. How to compose? Thursday, January 23, 14
  36. Thank you to @WibiData @MacysDotCom Thursday, January 23, 14