Slide 1

Slide 1 text

@Scalding history, patterns and future Thursday, January 23, 14

Slide 2

Slide 2 text

Oscar Boykin @posco Twitter Thursday, January 23, 14

Slide 3

Slide 3 text

1) History Thursday, January 23, 14

Slide 4

Slide 4 text

Avi Bryant hates Pig Thursday, January 23, 14

Slide 5

Slide 5 text

Avi Bryant hates Pig and @squarecog Thursday, January 23, 14

Slide 6

Slide 6 text

Thursday, January 23, 14

Slide 7

Slide 7 text

Thursday, January 23, 14

Slide 8

Slide 8 text

Java can be verbose Thursday, January 23, 14

Slide 9

Slide 9 text

Thursday, January 23, 14

Slide 10

Slide 10 text

Thursday, January 23, 14

Slide 11

Slide 11 text

@argyris and @posco join Twitter in May, 2011. Thursday, January 23, 14

Slide 12

Slide 12 text

@argyris and @posco join Twitter in May, 2011. Work with Avi on ads analytics and @scalding Thursday, January 23, 14

Slide 13

Slide 13 text

Thursday, January 23, 14

Slide 14

Slide 14 text

Started as a DSL for Cascading Thursday, January 23, 14

Slide 15

Slide 15 text

Logic is in the constructor Word Count Thursday, January 23, 14

Slide 16

Slide 16 text

Functions can be called or defined inline Word Count Thursday, January 23, 14

Slide 17

Slide 17 text

Read and Write data through Source objects Word Count Thursday, January 23, 14

Slide 18

Slide 18 text

Data is modeled as streams of named Tuples (of objects) Word Count Thursday, January 23, 14

Slide 19

Slide 19 text

Added type-safe distributed collections (and Matrix) API. Thursday, January 23, 14

Slide 20

Slide 20 text

Thursday, January 23, 14

Slide 21

Slide 21 text

Thursday, January 23, 14

Slide 22

Slide 22 text

2) patterns Thursday, January 23, 14

Slide 23

Slide 23 text

Click Rates Thursday, January 23, 14

Slide 24

Slide 24 text

Thursday, January 23, 14

Slide 25

Slide 25 text

Common Items •Time separated events require a join •aggregation phase (.group.sum) just adds up some (record of) values. •This aggregation is associative, so we don’t need to look at history to produce today’s results. Thursday, January 23, 14

Slide 26

Slide 26 text

Folding (training/ updating) Thursday, January 23, 14

Slide 27

Slide 27 text

Thursday, January 23, 14

Slide 28

Slide 28 text

Common Items •This aggregation is NOT associative, so we need to look at history to produce today’s results. •Models are joined with events with a custom cogroup. •The update logic lives outside of the job (in the model class?) Thursday, January 23, 14

Slide 29

Slide 29 text

3) Clustering Thursday, January 23, 14

Slide 30

Slide 30 text

Thursday, January 23, 14

Slide 31

Slide 31 text

•Algebird (github.com/twitter/ algebird) includes many approximation algorithms. •MinHash gives approximate set similarity, useful for LSH. •HyperLogLog / CountMinSketch for scalable approximate set size, event counts. Thursday, January 23, 14

Slide 32

Slide 32 text

3) Future[Scalding] Thursday, January 23, 14

Slide 33

Slide 33 text

•Release 0.9.0 (~ 2 weeks): •REPL contributed (thanks Wibi!) •Typed-API improvements (joining, implementation, combinators) •optimizing Matrix API •improved function serialization •some API warts removed Thursday, January 23, 14

Slide 34

Slide 34 text

•Explore spark support: •Preferred option: cascading backend for spark. •Does this speed-up ETL (extract, transform, load) jobs significantly? •Can spark OOM issues be handled for large multi-tenant use-cases? Thursday, January 23, 14

Slide 35

Slide 35 text

•Easier integration into larger tools/libs: •Summingbird uses scalding as a library: learned a lot about what is easy and not. Some patterns can be added to scalding. •Would love to make it easier to build and distribute ML/Linear Algebra libraries. How to compose? Thursday, January 23, 14

Slide 36

Slide 36 text

Thank you to @WibiData @MacysDotCom Thursday, January 23, 14