Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting Started with Scalding, Storm and Summingbird

Yoshimasa Niwa
September 06, 2014

Getting Started with Scalding, Storm and Summingbird

Scalding と Storm と Summingbird で始める Scala でデータ処理

Sep 6, 2104 at Scala Matsuri 2014 in Tokyo, Japan.

Yoshimasa Niwa

September 06, 2014
Tweet

More Decks by Yoshimasa Niwa

Other Decks in Programming

Transcript

  1. Getting Started with

    Scalding, Storm & Summingbird
    Yoshimasa Niwa

    Twitter, Inc.

    Sep 6, 2014 in Tokyo
    4DBMEJOHͱ4UPSNͱ4VNNJOHCJSEΛ࢖ͬͯΈΑ͏

    View full-size slide

  2. Scala Matsuri 2014
    4DBMBࡇΓͷεέδϡʔϧදͰ͢ɻ

    View full-size slide

  3. @niw
    Yoshimasa Niwa

    View full-size slide

  4. Beautiful San Francisco
    αϯϑϥϯγείʹ͋Γ·͢

    View full-size slide

  5. I’m not a Scala professional.
    ࣮͸4DBMBͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱ͸ͳ͍ΜͰ͢ɻ

    View full-size slide

  6. I’m even not a data professional.
    ࣮͸ɺผʹσʔλॲཧͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱ΋ͳ͍ΜͰ͢ɻ

    View full-size slide

  7. [[iOSApp alloc] init];
    ࠷ۙ͸J04ΞϓϦΛ࡞ͬͯ·͢ɻ

    View full-size slide

  8. Swift 㲗 Scala
    4XJGUͷ࿩4DBMB.BUTVSJ

    View full-size slide

  9. Swift 㲗 Scala
    4XJGUͷ࿩4DBMB.BUTVSJ
    let point = (3, 2)

    switch point {
    case let (x, y):
    println(x)
    }
    val point = (3, 2)

    point match {
    case (x, y) =>
    println(x)
    }

    View full-size slide

  10. I’m a user of a data infra.
    ͍ͪσʔλΠϯϑϥͷར༻ऀ͔Βͷࢹ఺Ͱ঺հ͠·͢ɻ

    View full-size slide

  11. Twitter — Scala
    5XJUUFS͸4DBMBʹ૬౰ೖΕࠐΜͰ·͢ɻ

    View full-size slide

  12. In 2009…
    ͔͞ͷ΅Δ͜ͱ೥લʜ

    View full-size slide

  13. Twitter was (probably)
    the biggest Rails app in the world.
    5XJUUFS͸ ͨͿΜ
    ੈքΠνͰ͔͍3BJMTͷΞϓϦͰͨ͠ɻ

    View full-size slide

  14. What we’ve got.
    Giant patched Ruby and Rails.
    ύονΛ͋ͯ·ͬͨ͘3VCZͱ3BJMT
    Development with Grep, No IDE support.
    (SFQ͚͕ͩཔΓɻ*%&ͷॿ͚͸͋·Γظ଴Ͱ͖ͳ͍ɻ
    ͦͷ݁Ռಘͨ΋ͷ͸

    View full-size slide

  15. and more.
    ͞Βʹɾɾɾ

    View full-size slide

  16. Nov 20, 2009

    View full-size slide

  17. We can’t survive with Rails.
    3BJMTͰ͸ੜ͖࢒Εͳ͍

    View full-size slide

  18. Ruby → Scala
    3VCZͷػೳΛগͣͭ͠ॻ͖ͳ͓ͯ͠4DBMBʹҠߦ͠Α͏

    View full-size slide

  19. So long, and Thanks for the all whales.
    Ϋδϥ͞Μ͞Α͏ͳΒɻ

    View full-size slide

  20. Aug 2, 2013
    143,199TPS

    View full-size slide

  21. Ummm… Yeah… TPS?
    514ͬͯʁ

    View full-size slide

  22. Tweet Per Second
    ඵ͋ͨΓͷπΠʔτ਺

    View full-size slide

  23. How about analytics infra?
    σʔλॲཧ͸Ͳ͏ͳͬͯΔͷ͔

    View full-size slide

  24. In 2009…
    ͔͞ͷ΅Δ͜ͱ೥લʜ

    View full-size slide

  25. Hadoop + Pig
    )BEPPQ1JH

    View full-size slide

  26. What we’ve got.
    Giant UDF + hugely patched Pig
    ڊେͳ6%'ͱύονΛΞς·ͬͨ͘1JH
    Less development consistency.
    1JHͷδϣϒ͚ͩಛघͳӡ༻ʹʜ
    Many copy and pastes, no IDE support.
    ίʔυͷ࠶ར༻͕೉͘͠*%&ͷαϙʔτ΋ͳ͍ɻ
    ͦͷ݁Ռಘͨ΋ͷ͸

    View full-size slide

  27. Pig → Scalding
    3VCZΛ4DBMBʹҠߦͤͨ͞Α͏ʹɺ1JH͔Β4DBMEJOHʹҠߦɻ

    View full-size slide

  28. Scalding
    4DBMEJOHΛ࢖ͬͯΈΑ͏

    View full-size slide

  29. Scalding is
    4DBMEJOHͱ͸

    View full-size slide

  30. Scalding is a library
    Based on Cascading.
    $BTDBEJOHͱ͍͏+BWBͷϥΠϒϥϦ͕ϕʔεɻ
    Abstract Hadoop low-level APIs.
    )BEPPQͷ௿Ϩϕϧ"1*͕ൿಗ͞Ε͍ͯ·͢ɻ
    Tightly integrated with Scala.
    ׬શʹ4DBMBͷ؀ڥͰ͢ɻ
    Lately introduced REPL.
    ࠷ۙ3&1-͕༻ҙ͞Ε·ͨ͠ɻ
    4DBMEJOH͸)BEPPQͷδϣϒΛ࡞ΔͨΊͷϥΠϒϥϦͰ͢ɻ

    View full-size slide

  31. Getting Started
    4DBMEJOHΛ͞ΘͬͯΈΑ͏

    View full-size slide

  32. TPS Report
    όϧε͕དྷͨ࣌ͷ514Λௐ΂ͯΈΔɻ

    View full-size slide

  33. Scalding APIs
    Fields-base API and Type-safe API
    छྨͷ"1*͕͋Γ·͢ɻ
    Type-safe API is Type-safe
    ܕͷצҧ͍ΛίϯύΠϧ࣌ʹ஌Δ͜ͱ͕Ͱ͖·͢ɻ
    API uses many Scala magic.
    JNQMJDJUͷཛྷͰ*%&͕ਏͦ͏Ͱ͢ɻ
    4DBMEJOHͷ"1*

    View full-size slide

  34. No one knows the future.
    Ͱ΋ɺ࣮ࡍόϧεతͳͷ͸͍ͭདྷΔ͔Θ͔Βͳ͍ΜͩΑͶɻ

    View full-size slide

  35. Storm
    4UPSNΛ࢖ͬͯΈΑ͏

    View full-size slide

  36. Storm is
    4UPSNͱ͸

    View full-size slide

  37. Strom is a platform.
    Storm is similar to Hadoop, but for realtime.
    4UPSN͸)BEPPQͱࣅͨ؀ڥͰ͕͢ɺϦΞϧλΠϜॲ
    ཧΛఏڙ͠·͢ɻ
    No data loss.
    4UPSN͸σʔλ͕ॲཧ͞Εͨ͜ͱΛอূ͠·͢ɻ
    Fault-tolerant
    ଱ো֐ੑʹ͍ͭΑ͍͸ͣͰ͢ɻ
    4UPSN͸ϦΞϧλΠϜσʔλॲཧͷͨΊͷϓϥοτϑΥʔϜͰ͢ɻ

    View full-size slide

  38. Getting Started
    4UPSNΛ͞ΘͬͯݟΑ͏

    View full-size slide

  39. Realtime TPS report
    ͨͱ͑͹ɺόϧε͕ͦΖͦΖདྷͦ͏ͳͱ͖ʹ514Λௐ΂ͯΈΔɻ

    View full-size slide

  40. Storm terms
    4UPSNͷ༻ޠ
    Nimbus Supervisor Worker
    ☀️









    View full-size slide

  41. Storm terms
    4UPSNͷ༻ޠ
    ⚡️

    Spout Bolt
    Topology
    ⚡️
    Bolt

    View full-size slide

  42. Hmm, looks similar.
    ͳΜ͔ݟͨ͜ͱ͋ΔΑ͏ͳ

    View full-size slide

  43. Summingbird
    4VNNJOHCJSEΛ࢖ͬͯΈΑ͏

    View full-size slide

  44. Summingbird is
    4VNNJOHCJSEͱ͸

    View full-size slide

  45. Summingbird is a library.
    Summingbird abstract platform.
    4VNNJOHCJSE͸ϓϥοτϑΥʔϜΛந৅Խ͠·͢ɻ
    Summingbird works for Scalding, Storm.
    4VNNJOHCJSE͸4DBMEJOH΍4UPSNͱಈ͖·͢ɻ
    Summingbird aggregates results from both
    batch and realtime jobs.
    4VNNJOHCJSE͸όονॲཧͱϦΞϧλΠϜॲཧͷ݁Ռ
    Λ·ͱΊΔࣄ͕Ͱ͖··͢ɻ
    4VNNJOHCJSE͸.BQ3FEVDFॲཧΛந৅Խ͢ΔϥΠϒϥϦͰ͢ɻ

    View full-size slide

  46. Summingbird Abstraction
    4VNNJOHCJSEͷந৅Խ

    Producer
    filter, map,
    flatMap, …

    Platform
    Scalding,

    Storm, …

    Plan
    Pipe,

    Topology, …

    View full-size slide

  47. Merge results from Scalding + Storm
    ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ
    K→V

    View full-size slide

  48. Merge results from Scalding + Storm
    ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ
    Batch 1 Batch 2 Batch 3 Batch 4

    K→(V1, Batch 2)
    now
    (K, Batch 3)→V2
    (K, Batch 4)→V3
    Batch
    Jobs
    Realtime

    Jobs
    ⚡️

    View full-size slide

  49. Merge results from Scalding + Storm
    ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ
    K→(V1, Batch 2)
    (K, Batch 3)→V2
    (K, Batch 4)→V3
    V1+V2+V3 = Vnow!

    View full-size slide

  50. Summary
    ·ͱΊ

    View full-size slide

  51. Summary
    Scalding to implement daily batch job by
    Scala.
    4DBMEJOHΛ࢖ͬͯ೔ʑͷόονδϣϒΛ4DBMBͰɻ
    Storm to implement daily streaming
    processing job by Scala.
    4UPSNΛ࢖ͬͯϦΞϧλΠϜॲཧ΋4DBMBͰɻ
    Summingbird to aggregate entire data
    pipelines.
    4VNNJOHCJSEΛ࢖ͬͯσʔλॲཧΛ4DBMBͰॻ͚Δ
    ·ͱΊ

    View full-size slide

  52. Enjoy.Scala
    &OKPZ4DBMB

    View full-size slide

  53. #thankyou
    Questions? Tweet to @niw :)
    ࣭͝໰͕͋Ε͹!OJXʹπΠʔτ͍ͯͩ͘͠͞

    ͓͠·͍ɻ

    View full-size slide