Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting Started with Scalding, Storm and Summingbird

Yoshimasa Niwa
September 06, 2014

Getting Started with Scalding, Storm and Summingbird

Scalding と Storm と Summingbird で始める Scala でデータ処理

Sep 6, 2104 at Scala Matsuri 2014 in Tokyo, Japan.

Yoshimasa Niwa

September 06, 2014
Tweet

More Decks by Yoshimasa Niwa

Other Decks in Programming

Transcript

  1. Getting Started with
 Scalding, Storm & Summingbird Yoshimasa Niwa
 Twitter,

    Inc.
 Sep 6, 2014 in Tokyo 4DBMEJOHͱ4UPSNͱ4VNNJOHCJSEΛ࢖ͬͯΈΑ͏
  2. Swift 㲗 Scala 4XJGUͷ࿩4DBMB.BUTVSJ let point = (3, 2) 


    switch point { case let (x, y): println(x) } val point = (3, 2) 
 point match { case (x, y) => println(x) }
  3. Twitter was (probably) the biggest Rails app in the world.

    5XJUUFS͸ ͨͿΜ ੈքΠνͰ͔͍3BJMTͷΞϓϦͰͨ͠ɻ
  4. What we’ve got. Giant patched Ruby and Rails. ύονΛ͋ͯ·ͬͨ͘3VCZͱ3BJMT Development

    with Grep, No IDE support. (SFQ͚͕ͩཔΓɻ*%&ͷॿ͚͸͋·Γظ଴Ͱ͖ͳ͍ɻ ͦͷ݁Ռಘͨ΋ͷ͸
  5. What we’ve got. Giant UDF + hugely patched Pig ڊେͳ6%'ͱύονΛΞς·ͬͨ͘1JH

    Less development consistency. 1JHͷδϣϒ͚ͩಛघͳӡ༻ʹʜ Many copy and pastes, no IDE support. ίʔυͷ࠶ར༻͕೉͘͠*%&ͷαϙʔτ΋ͳ͍ɻ ͦͷ݁Ռಘͨ΋ͷ͸
  6. Scalding is a library Based on Cascading. $BTDBEJOHͱ͍͏+BWBͷϥΠϒϥϦ͕ϕʔεɻ Abstract Hadoop

    low-level APIs. )BEPPQͷ௿Ϩϕϧ"1*͕ൿಗ͞Ε͍ͯ·͢ɻ Tightly integrated with Scala. ׬શʹ4DBMBͷ؀ڥͰ͢ɻ Lately introduced REPL. ࠷ۙ3&1-͕༻ҙ͞Ε·ͨ͠ɻ 4DBMEJOH͸)BEPPQͷδϣϒΛ࡞ΔͨΊͷϥΠϒϥϦͰ͢ɻ
  7. Scalding APIs Fields-base API and Type-safe API छྨͷ"1*͕͋Γ·͢ɻ Type-safe API

    is Type-safe ܕͷצҧ͍ΛίϯύΠϧ࣌ʹ஌Δ͜ͱ͕Ͱ͖·͢ɻ API uses many Scala magic. JNQMJDJUͷཛྷͰ*%&͕ਏͦ͏Ͱ͢ɻ 4DBMEJOHͷ"1*
  8. Strom is a platform. Storm is similar to Hadoop, but

    for realtime. 4UPSN͸)BEPPQͱࣅͨ؀ڥͰ͕͢ɺϦΞϧλΠϜॲ ཧΛఏڙ͠·͢ɻ No data loss. 4UPSN͸σʔλ͕ॲཧ͞Εͨ͜ͱΛอূ͠·͢ɻ Fault-tolerant ଱ো֐ੑʹ͍ͭΑ͍͸ͣͰ͢ɻ 4UPSN͸ϦΞϧλΠϜσʔλॲཧͷͨΊͷϓϥοτϑΥʔϜͰ͢ɻ
  9. Summingbird is a library. Summingbird abstract platform. 4VNNJOHCJSE͸ϓϥοτϑΥʔϜΛந৅Խ͠·͢ɻ Summingbird works

    for Scalding, Storm. 4VNNJOHCJSE͸4DBMEJOH΍4UPSNͱಈ͖·͢ɻ Summingbird aggregates results from both batch and realtime jobs. 4VNNJOHCJSE͸όονॲཧͱϦΞϧλΠϜॲཧͷ݁Ռ Λ·ͱΊΔࣄ͕Ͱ͖··͢ɻ 4VNNJOHCJSE͸.BQ3FEVDFॲཧΛந৅Խ͢ΔϥΠϒϥϦͰ͢ɻ
  10. Merge results from Scalding + Storm ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ Batch 1 Batch

    2 Batch 3 Batch 4 K→(V1, Batch 2) now (K, Batch 3)→V2 (K, Batch 4)→V3 Batch Jobs Realtime
 Jobs ⚡️
  11. Summary Scalding to implement daily batch job by Scala. 4DBMEJOHΛ࢖ͬͯ೔ʑͷόονδϣϒΛ4DBMBͰɻ

    Storm to implement daily streaming processing job by Scala. 4UPSNΛ࢖ͬͯϦΞϧλΠϜॲཧ΋4DBMBͰɻ Summingbird to aggregate entire data pipelines. 4VNNJOHCJSEΛ࢖ͬͯσʔλॲཧΛ4DBMBͰॻ͚Δ ·ͱΊ