Slide 1

Slide 1 text

Getting Started with
 Scalding, Storm & Summingbird Yoshimasa Niwa
 Twitter, Inc.
 Sep 6, 2014 in Tokyo 4DBMEJOHͱ4UPSNͱ4VNNJOHCJSEΛ࢖ͬͯΈΑ͏

Slide 2

Slide 2 text

Scala Matsuri 2014 4DBMBࡇΓͷεέδϡʔϧදͰ͢ɻ

Slide 3

Slide 3 text

@niw Yoshimasa Niwa

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Beautiful San Francisco αϯϑϥϯγείʹ͋Γ·͢

Slide 6

Slide 6 text

I’m not a Scala professional. ࣮͸4DBMBͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱ͸ͳ͍ΜͰ͢ɻ

Slide 7

Slide 7 text

I’m even not a data professional. ࣮͸ɺผʹσʔλॲཧͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱ΋ͳ͍ΜͰ͢ɻ

Slide 8

Slide 8 text

[[iOSApp alloc] init]; ࠷ۙ͸J04ΞϓϦΛ࡞ͬͯ·͢ɻ

Slide 9

Slide 9 text

Swift 㲗 Scala 4XJGUͷ࿩4DBMB.BUTVSJ

Slide 10

Slide 10 text

Swift 㲗 Scala 4XJGUͷ࿩4DBMB.BUTVSJ let point = (3, 2) 
 switch point { case let (x, y): println(x) } val point = (3, 2) 
 point match { case (x, y) => println(x) }

Slide 11

Slide 11 text

I’m a user of a data infra. ͍ͪσʔλΠϯϑϥͷར༻ऀ͔Βͷࢹ఺Ͱ঺հ͠·͢ɻ

Slide 12

Slide 12 text

Twitter — Scala 5XJUUFS͸4DBMBʹ૬౰ೖΕࠐΜͰ·͢ɻ

Slide 13

Slide 13 text

In 2009… ͔͞ͷ΅Δ͜ͱ೥લʜ

Slide 14

Slide 14 text

Twitter was (probably) the biggest Rails app in the world. 5XJUUFS͸ ͨͿΜ ੈքΠνͰ͔͍3BJMTͷΞϓϦͰͨ͠ɻ

Slide 15

Slide 15 text

What we’ve got. Giant patched Ruby and Rails. ύονΛ͋ͯ·ͬͨ͘3VCZͱ3BJMT Development with Grep, No IDE support. (SFQ͚͕ͩཔΓɻ*%&ͷॿ͚͸͋·Γظ଴Ͱ͖ͳ͍ɻ ͦͷ݁Ռಘͨ΋ͷ͸

Slide 16

Slide 16 text

and more. ͞Βʹɾɾɾ

Slide 17

Slide 17 text

No content

Slide 18

Slide 18 text

Nov 20, 2009

Slide 19

Slide 19 text

We can’t survive with Rails. 3BJMTͰ͸ੜ͖࢒Εͳ͍

Slide 20

Slide 20 text

Ruby → Scala 3VCZͷػೳΛগͣͭ͠ॻ͖ͳ͓ͯ͠4DBMBʹҠߦ͠Α͏

Slide 21

Slide 21 text

So long, and Thanks for the all whales. Ϋδϥ͞Μ͞Α͏ͳΒɻ

Slide 22

Slide 22 text

Aug 2, 2013 143,199TPS

Slide 23

Slide 23 text

Ummm… Yeah… TPS? 514ͬͯʁ

Slide 24

Slide 24 text

Tweet Per Second ඵ͋ͨΓͷπΠʔτ਺

Slide 25

Slide 25 text

How about analytics infra? σʔλॲཧ͸Ͳ͏ͳͬͯΔͷ͔

Slide 26

Slide 26 text

In 2009… ͔͞ͷ΅Δ͜ͱ೥લʜ

Slide 27

Slide 27 text

Hadoop + Pig )BEPPQ1JH

Slide 28

Slide 28 text

What we’ve got. Giant UDF + hugely patched Pig ڊେͳ6%'ͱύονΛΞς·ͬͨ͘1JH Less development consistency. 1JHͷδϣϒ͚ͩಛघͳӡ༻ʹʜ Many copy and pastes, no IDE support. ίʔυͷ࠶ར༻͕೉͘͠*%&ͷαϙʔτ΋ͳ͍ɻ ͦͷ݁Ռಘͨ΋ͷ͸

Slide 29

Slide 29 text

Pig → Scalding 3VCZΛ4DBMBʹҠߦͤͨ͞Α͏ʹɺ1JH͔Β4DBMEJOHʹҠߦɻ

Slide 30

Slide 30 text

Scalding 4DBMEJOHΛ࢖ͬͯΈΑ͏

Slide 31

Slide 31 text

Scalding is 4DBMEJOHͱ͸

Slide 32

Slide 32 text

Scalding is a library Based on Cascading. $BTDBEJOHͱ͍͏+BWBͷϥΠϒϥϦ͕ϕʔεɻ Abstract Hadoop low-level APIs. )BEPPQͷ௿Ϩϕϧ"1*͕ൿಗ͞Ε͍ͯ·͢ɻ Tightly integrated with Scala. ׬શʹ4DBMBͷ؀ڥͰ͢ɻ Lately introduced REPL. ࠷ۙ3&1-͕༻ҙ͞Ε·ͨ͠ɻ 4DBMEJOH͸)BEPPQͷδϣϒΛ࡞ΔͨΊͷϥΠϒϥϦͰ͢ɻ

Slide 33

Slide 33 text

Getting Started 4DBMEJOHΛ͞ΘͬͯΈΑ͏

Slide 34

Slide 34 text

No content

Slide 35

Slide 35 text

TPS Report όϧε͕དྷͨ࣌ͷ514Λௐ΂ͯΈΔɻ

Slide 36

Slide 36 text

No content

Slide 37

Slide 37 text

Scalding APIs Fields-base API and Type-safe API छྨͷ"1*͕͋Γ·͢ɻ Type-safe API is Type-safe ܕͷצҧ͍ΛίϯύΠϧ࣌ʹ஌Δ͜ͱ͕Ͱ͖·͢ɻ API uses many Scala magic. JNQMJDJUͷཛྷͰ*%&͕ਏͦ͏Ͱ͢ɻ 4DBMEJOHͷ"1*

Slide 38

Slide 38 text

No one knows the future. Ͱ΋ɺ࣮ࡍόϧεతͳͷ͸͍ͭདྷΔ͔Θ͔Βͳ͍ΜͩΑͶɻ

Slide 39

Slide 39 text

Storm 4UPSNΛ࢖ͬͯΈΑ͏

Slide 40

Slide 40 text

Storm is 4UPSNͱ͸

Slide 41

Slide 41 text

Strom is a platform. Storm is similar to Hadoop, but for realtime. 4UPSN͸)BEPPQͱࣅͨ؀ڥͰ͕͢ɺϦΞϧλΠϜॲ ཧΛఏڙ͠·͢ɻ No data loss. 4UPSN͸σʔλ͕ॲཧ͞Εͨ͜ͱΛอূ͠·͢ɻ Fault-tolerant ଱ো֐ੑʹ͍ͭΑ͍͸ͣͰ͢ɻ 4UPSN͸ϦΞϧλΠϜσʔλॲཧͷͨΊͷϓϥοτϑΥʔϜͰ͢ɻ

Slide 42

Slide 42 text

Getting Started 4UPSNΛ͞ΘͬͯݟΑ͏

Slide 43

Slide 43 text

No content

Slide 44

Slide 44 text

Realtime TPS report ͨͱ͑͹ɺόϧε͕ͦΖͦΖདྷͦ͏ͳͱ͖ʹ514Λௐ΂ͯΈΔɻ

Slide 45

Slide 45 text

No content

Slide 46

Slide 46 text

Storm terms 4UPSNͷ༻ޠ Nimbus Supervisor Worker ☀️

Slide 47

Slide 47 text

Storm terms 4UPSNͷ༻ޠ ⚡️ Spout Bolt Topology ⚡️ Bolt

Slide 48

Slide 48 text

Hmm, looks similar. ͳΜ͔ݟͨ͜ͱ͋ΔΑ͏ͳ

Slide 49

Slide 49 text

Summingbird 4VNNJOHCJSEΛ࢖ͬͯΈΑ͏

Slide 50

Slide 50 text

Summingbird is 4VNNJOHCJSEͱ͸

Slide 51

Slide 51 text

Summingbird is a library. Summingbird abstract platform. 4VNNJOHCJSE͸ϓϥοτϑΥʔϜΛந৅Խ͠·͢ɻ Summingbird works for Scalding, Storm. 4VNNJOHCJSE͸4DBMEJOH΍4UPSNͱಈ͖·͢ɻ Summingbird aggregates results from both batch and realtime jobs. 4VNNJOHCJSE͸όονॲཧͱϦΞϧλΠϜॲཧͷ݁Ռ Λ·ͱΊΔࣄ͕Ͱ͖··͢ɻ 4VNNJOHCJSE͸.BQ3FEVDFॲཧΛந৅Խ͢ΔϥΠϒϥϦͰ͢ɻ

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

Summingbird Abstraction 4VNNJOHCJSEͷந৅Խ Producer filter, map, flatMap, … Platform Scalding,
 Storm, … Plan Pipe,
 Topology, …

Slide 54

Slide 54 text

Merge results from Scalding + Storm ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ K→V

Slide 55

Slide 55 text

Merge results from Scalding + Storm ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ Batch 1 Batch 2 Batch 3 Batch 4 K→(V1, Batch 2) now (K, Batch 3)→V2 (K, Batch 4)→V3 Batch Jobs Realtime
 Jobs ⚡️

Slide 56

Slide 56 text

Merge results from Scalding + Storm ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ K→(V1, Batch 2) (K, Batch 3)→V2 (K, Batch 4)→V3 V1+V2+V3 = Vnow!

Slide 57

Slide 57 text

Summary ·ͱΊ

Slide 58

Slide 58 text

Summary Scalding to implement daily batch job by Scala. 4DBMEJOHΛ࢖ͬͯ೔ʑͷόονδϣϒΛ4DBMBͰɻ Storm to implement daily streaming processing job by Scala. 4UPSNΛ࢖ͬͯϦΞϧλΠϜॲཧ΋4DBMBͰɻ Summingbird to aggregate entire data pipelines. 4VNNJOHCJSEΛ࢖ͬͯσʔλॲཧΛ4DBMBͰॻ͚Δ ·ͱΊ

Slide 59

Slide 59 text

Enjoy.Scala &OKPZ4DBMB

Slide 60

Slide 60 text

#thankyou Questions? Tweet to @niw :) ࣭͝໰͕͋Ε͹!OJXʹπΠʔτ͍ͯͩ͘͠͞ ͓͠·͍ɻ