Getting Started with
Scalding, Storm & Summingbird
Yoshimasa Niwa
Twitter, Inc.
Sep 6, 2014 in Tokyo
4DBMEJOHͱ4UPSNͱ4VNNJOHCJSEΛͬͯΈΑ͏
Slide 2
Slide 2 text
Scala Matsuri 2014
4DBMBࡇΓͷεέδϡʔϧදͰ͢ɻ
Slide 3
Slide 3 text
@niw
Yoshimasa Niwa
Slide 4
Slide 4 text
No content
Slide 5
Slide 5 text
Beautiful San Francisco
αϯϑϥϯγείʹ͋Γ·͢
Slide 6
Slide 6 text
I’m not a Scala professional.
࣮4DBMBͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱͳ͍ΜͰ͢ɻ
Slide 7
Slide 7 text
I’m even not a data professional.
࣮ɺผʹσʔλॲཧͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱͳ͍ΜͰ͢ɻ
Slide 8
Slide 8 text
[[iOSApp alloc] init];
࠷ۙJ04ΞϓϦΛ࡞ͬͯ·͢ɻ
Slide 9
Slide 9 text
Swift 㲗 Scala
4XJGUͷ4DBMB.BUTVSJ
Slide 10
Slide 10 text
Swift 㲗 Scala
4XJGUͷ4DBMB.BUTVSJ
let point = (3, 2)
switch point {
case let (x, y):
println(x)
}
val point = (3, 2)
point match {
case (x, y) =>
println(x)
}
Slide 11
Slide 11 text
I’m a user of a data infra.
͍ͪσʔλΠϯϑϥͷར༻ऀ͔ΒͷࢹͰհ͠·͢ɻ
Slide 12
Slide 12 text
Twitter — Scala
5XJUUFS4DBMBʹ૬ೖΕࠐΜͰ·͢ɻ
Slide 13
Slide 13 text
In 2009…
͔͞ͷ΅Δ͜ͱલʜ
Slide 14
Slide 14 text
Twitter was (probably)
the biggest Rails app in the world.
5XJUUFS ͨͿΜ
ੈքΠνͰ͔͍3BJMTͷΞϓϦͰͨ͠ɻ
Slide 15
Slide 15 text
What we’ve got.
Giant patched Ruby and Rails.
ύονΛ͋ͯ·ͬͨ͘3VCZͱ3BJMT
Development with Grep, No IDE support.
(SFQ͚͕ͩཔΓɻ*%&ͷॿ͚͋·ΓظͰ͖ͳ͍ɻ
ͦͷ݁Ռಘͨͷ
Slide 16
Slide 16 text
and more.
͞Βʹɾɾɾ
Slide 17
Slide 17 text
No content
Slide 18
Slide 18 text
Nov 20, 2009
Slide 19
Slide 19 text
We can’t survive with Rails.
3BJMTͰੜ͖Εͳ͍
Slide 20
Slide 20 text
Ruby → Scala
3VCZͷػೳΛগͣͭ͠ॻ͖ͳ͓ͯ͠4DBMBʹҠߦ͠Α͏
Slide 21
Slide 21 text
So long, and Thanks for the all whales.
Ϋδϥ͞Μ͞Α͏ͳΒɻ
Slide 22
Slide 22 text
Aug 2, 2013
143,199TPS
Slide 23
Slide 23 text
Ummm… Yeah… TPS?
514ͬͯʁ
Slide 24
Slide 24 text
Tweet Per Second
ඵ͋ͨΓͷπΠʔτ
Slide 25
Slide 25 text
How about analytics infra?
σʔλॲཧͲ͏ͳͬͯΔͷ͔
Slide 26
Slide 26 text
In 2009…
͔͞ͷ΅Δ͜ͱલʜ
Slide 27
Slide 27 text
Hadoop + Pig
)BEPPQ1JH
Slide 28
Slide 28 text
What we’ve got.
Giant UDF + hugely patched Pig
ڊେͳ6%'ͱύονΛΞς·ͬͨ͘1JH
Less development consistency.
1JHͷδϣϒ͚ͩಛघͳӡ༻ʹʜ
Many copy and pastes, no IDE support.
ίʔυͷ࠶ར༻͕͘͠*%&ͷαϙʔτͳ͍ɻ
ͦͷ݁Ռಘͨͷ
Scalding is a library
Based on Cascading.
$BTDBEJOHͱ͍͏+BWBͷϥΠϒϥϦ͕ϕʔεɻ
Abstract Hadoop low-level APIs.
)BEPPQͷϨϕϧ"1*͕ൿಗ͞Ε͍ͯ·͢ɻ
Tightly integrated with Scala.
શʹ4DBMBͷڥͰ͢ɻ
Lately introduced REPL.
࠷ۙ3&1-͕༻ҙ͞Ε·ͨ͠ɻ
4DBMEJOH)BEPPQͷδϣϒΛ࡞ΔͨΊͷϥΠϒϥϦͰ͢ɻ
Slide 33
Slide 33 text
Getting Started
4DBMEJOHΛ͞ΘͬͯΈΑ͏
Slide 34
Slide 34 text
No content
Slide 35
Slide 35 text
TPS Report
όϧε͕དྷͨ࣌ͷ514ΛௐͯΈΔɻ
Slide 36
Slide 36 text
No content
Slide 37
Slide 37 text
Scalding APIs
Fields-base API and Type-safe API
छྨͷ"1*͕͋Γ·͢ɻ
Type-safe API is Type-safe
ܕͷצҧ͍ΛίϯύΠϧ࣌ʹΔ͜ͱ͕Ͱ͖·͢ɻ
API uses many Scala magic.
JNQMJDJUͷཛྷͰ*%&͕ਏͦ͏Ͱ͢ɻ
4DBMEJOHͷ"1*
Slide 38
Slide 38 text
No one knows the future.
Ͱɺ࣮ࡍόϧεతͳͷ͍ͭདྷΔ͔Θ͔Βͳ͍ΜͩΑͶɻ
Slide 39
Slide 39 text
Storm
4UPSNΛͬͯΈΑ͏
Slide 40
Slide 40 text
Storm is
4UPSNͱ
Slide 41
Slide 41 text
Strom is a platform.
Storm is similar to Hadoop, but for realtime.
4UPSN)BEPPQͱࣅͨڥͰ͕͢ɺϦΞϧλΠϜॲ
ཧΛఏڙ͠·͢ɻ
No data loss.
4UPSNσʔλ͕ॲཧ͞Εͨ͜ͱΛอূ͠·͢ɻ
Fault-tolerant
োੑʹ͍ͭΑ͍ͣͰ͢ɻ
4UPSNϦΞϧλΠϜσʔλॲཧͷͨΊͷϓϥοτϑΥʔϜͰ͢ɻ
Summingbird is a library.
Summingbird abstract platform.
4VNNJOHCJSEϓϥοτϑΥʔϜΛநԽ͠·͢ɻ
Summingbird works for Scalding, Storm.
4VNNJOHCJSE4DBMEJOH4UPSNͱಈ͖·͢ɻ
Summingbird aggregates results from both
batch and realtime jobs.
4VNNJOHCJSEόονॲཧͱϦΞϧλΠϜॲཧͷ݁Ռ
Λ·ͱΊΔࣄ͕Ͱ͖··͢ɻ
4VNNJOHCJSE.BQ3FEVDFॲཧΛநԽ͢ΔϥΠϒϥϦͰ͢ɻ
Summary
Scalding to implement daily batch job by
Scala.
4DBMEJOHΛͬͯʑͷόονδϣϒΛ4DBMBͰɻ
Storm to implement daily streaming
processing job by Scala.
4UPSNΛͬͯϦΞϧλΠϜॲཧ4DBMBͰɻ
Summingbird to aggregate entire data
pipelines.
4VNNJOHCJSEΛͬͯσʔλॲཧΛ4DBMBͰॻ͚Δ
·ͱΊ
Slide 59
Slide 59 text
Enjoy.Scala
&OKPZ4DBMB
Slide 60
Slide 60 text
#thankyou
Questions? Tweet to @niw :)
࣭͕͋͝Ε!OJXʹπΠʔτ͍ͯͩ͘͠͞
͓͠·͍ɻ