Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Getting Started with Scalding, Storm and Summingbird

Yoshimasa Niwa
September 06, 2014

Getting Started with Scalding, Storm and Summingbird

Scalding と Storm と Summingbird で始める Scala でデータ処理

Sep 6, 2104 at Scala Matsuri 2014 in Tokyo, Japan.

Yoshimasa Niwa

September 06, 2014
Tweet

More Decks by Yoshimasa Niwa

Other Decks in Programming

Transcript

  1. Getting Started with

    Scalding, Storm & Summingbird
    Yoshimasa Niwa

    Twitter, Inc.

    Sep 6, 2014 in Tokyo
    4DBMEJOHͱ4UPSNͱ4VNNJOHCJSEΛ࢖ͬͯΈΑ͏

    View Slide

  2. Scala Matsuri 2014
    4DBMBࡇΓͷεέδϡʔϧදͰ͢ɻ

    View Slide

  3. @niw
    Yoshimasa Niwa

    View Slide

  4. View Slide

  5. Beautiful San Francisco
    αϯϑϥϯγείʹ͋Γ·͢

    View Slide

  6. I’m not a Scala professional.
    ࣮͸4DBMBͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱ͸ͳ͍ΜͰ͢ɻ

    View Slide

  7. I’m even not a data professional.
    ࣮͸ɺผʹσʔλॲཧͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱ΋ͳ͍ΜͰ͢ɻ

    View Slide

  8. [[iOSApp alloc] init];
    ࠷ۙ͸J04ΞϓϦΛ࡞ͬͯ·͢ɻ

    View Slide

  9. Swift 㲗 Scala
    4XJGUͷ࿩4DBMB.BUTVSJ

    View Slide

  10. Swift 㲗 Scala
    4XJGUͷ࿩4DBMB.BUTVSJ
    let point = (3, 2)

    switch point {
    case let (x, y):
    println(x)
    }
    val point = (3, 2)

    point match {
    case (x, y) =>
    println(x)
    }

    View Slide

  11. I’m a user of a data infra.
    ͍ͪσʔλΠϯϑϥͷར༻ऀ͔Βͷࢹ఺Ͱ঺հ͠·͢ɻ

    View Slide

  12. Twitter — Scala
    5XJUUFS͸4DBMBʹ૬౰ೖΕࠐΜͰ·͢ɻ

    View Slide

  13. In 2009…
    ͔͞ͷ΅Δ͜ͱ೥લʜ

    View Slide

  14. Twitter was (probably)
    the biggest Rails app in the world.
    5XJUUFS͸ ͨͿΜ
    ੈքΠνͰ͔͍3BJMTͷΞϓϦͰͨ͠ɻ

    View Slide

  15. What we’ve got.
    Giant patched Ruby and Rails.
    ύονΛ͋ͯ·ͬͨ͘3VCZͱ3BJMT
    Development with Grep, No IDE support.
    (SFQ͚͕ͩཔΓɻ*%&ͷॿ͚͸͋·Γظ଴Ͱ͖ͳ͍ɻ
    ͦͷ݁Ռಘͨ΋ͷ͸

    View Slide

  16. and more.
    ͞Βʹɾɾɾ

    View Slide

  17. View Slide

  18. Nov 20, 2009

    View Slide

  19. We can’t survive with Rails.
    3BJMTͰ͸ੜ͖࢒Εͳ͍

    View Slide

  20. Ruby → Scala
    3VCZͷػೳΛগͣͭ͠ॻ͖ͳ͓ͯ͠4DBMBʹҠߦ͠Α͏

    View Slide

  21. So long, and Thanks for the all whales.
    Ϋδϥ͞Μ͞Α͏ͳΒɻ

    View Slide

  22. Aug 2, 2013
    143,199TPS

    View Slide

  23. Ummm… Yeah… TPS?
    514ͬͯʁ

    View Slide

  24. Tweet Per Second
    ඵ͋ͨΓͷπΠʔτ਺

    View Slide

  25. How about analytics infra?
    σʔλॲཧ͸Ͳ͏ͳͬͯΔͷ͔

    View Slide

  26. In 2009…
    ͔͞ͷ΅Δ͜ͱ೥લʜ

    View Slide

  27. Hadoop + Pig
    )BEPPQ1JH

    View Slide

  28. What we’ve got.
    Giant UDF + hugely patched Pig
    ڊେͳ6%'ͱύονΛΞς·ͬͨ͘1JH
    Less development consistency.
    1JHͷδϣϒ͚ͩಛघͳӡ༻ʹʜ
    Many copy and pastes, no IDE support.
    ίʔυͷ࠶ར༻͕೉͘͠*%&ͷαϙʔτ΋ͳ͍ɻ
    ͦͷ݁Ռಘͨ΋ͷ͸

    View Slide

  29. Pig → Scalding
    3VCZΛ4DBMBʹҠߦͤͨ͞Α͏ʹɺ1JH͔Β4DBMEJOHʹҠߦɻ

    View Slide

  30. Scalding
    4DBMEJOHΛ࢖ͬͯΈΑ͏

    View Slide

  31. Scalding is
    4DBMEJOHͱ͸

    View Slide

  32. Scalding is a library
    Based on Cascading.
    $BTDBEJOHͱ͍͏+BWBͷϥΠϒϥϦ͕ϕʔεɻ
    Abstract Hadoop low-level APIs.
    )BEPPQͷ௿Ϩϕϧ"1*͕ൿಗ͞Ε͍ͯ·͢ɻ
    Tightly integrated with Scala.
    ׬શʹ4DBMBͷ؀ڥͰ͢ɻ
    Lately introduced REPL.
    ࠷ۙ3&1-͕༻ҙ͞Ε·ͨ͠ɻ
    4DBMEJOH͸)BEPPQͷδϣϒΛ࡞ΔͨΊͷϥΠϒϥϦͰ͢ɻ

    View Slide

  33. Getting Started
    4DBMEJOHΛ͞ΘͬͯΈΑ͏

    View Slide

  34. View Slide

  35. TPS Report
    όϧε͕དྷͨ࣌ͷ514Λௐ΂ͯΈΔɻ

    View Slide

  36. View Slide

  37. Scalding APIs
    Fields-base API and Type-safe API
    छྨͷ"1*͕͋Γ·͢ɻ
    Type-safe API is Type-safe
    ܕͷצҧ͍ΛίϯύΠϧ࣌ʹ஌Δ͜ͱ͕Ͱ͖·͢ɻ
    API uses many Scala magic.
    JNQMJDJUͷཛྷͰ*%&͕ਏͦ͏Ͱ͢ɻ
    4DBMEJOHͷ"1*

    View Slide

  38. No one knows the future.
    Ͱ΋ɺ࣮ࡍόϧεతͳͷ͸͍ͭདྷΔ͔Θ͔Βͳ͍ΜͩΑͶɻ

    View Slide

  39. Storm
    4UPSNΛ࢖ͬͯΈΑ͏

    View Slide

  40. Storm is
    4UPSNͱ͸

    View Slide

  41. Strom is a platform.
    Storm is similar to Hadoop, but for realtime.
    4UPSN͸)BEPPQͱࣅͨ؀ڥͰ͕͢ɺϦΞϧλΠϜॲ
    ཧΛఏڙ͠·͢ɻ
    No data loss.
    4UPSN͸σʔλ͕ॲཧ͞Εͨ͜ͱΛอূ͠·͢ɻ
    Fault-tolerant
    ଱ো֐ੑʹ͍ͭΑ͍͸ͣͰ͢ɻ
    4UPSN͸ϦΞϧλΠϜσʔλॲཧͷͨΊͷϓϥοτϑΥʔϜͰ͢ɻ

    View Slide

  42. Getting Started
    4UPSNΛ͞ΘͬͯݟΑ͏

    View Slide

  43. View Slide

  44. Realtime TPS report
    ͨͱ͑͹ɺόϧε͕ͦΖͦΖདྷͦ͏ͳͱ͖ʹ514Λௐ΂ͯΈΔɻ

    View Slide

  45. View Slide

  46. Storm terms
    4UPSNͷ༻ޠ
    Nimbus Supervisor Worker
    ☀️









    View Slide

  47. Storm terms
    4UPSNͷ༻ޠ
    ⚡️

    Spout Bolt
    Topology
    ⚡️
    Bolt

    View Slide

  48. Hmm, looks similar.
    ͳΜ͔ݟͨ͜ͱ͋ΔΑ͏ͳ

    View Slide

  49. Summingbird
    4VNNJOHCJSEΛ࢖ͬͯΈΑ͏

    View Slide

  50. Summingbird is
    4VNNJOHCJSEͱ͸

    View Slide

  51. Summingbird is a library.
    Summingbird abstract platform.
    4VNNJOHCJSE͸ϓϥοτϑΥʔϜΛந৅Խ͠·͢ɻ
    Summingbird works for Scalding, Storm.
    4VNNJOHCJSE͸4DBMEJOH΍4UPSNͱಈ͖·͢ɻ
    Summingbird aggregates results from both
    batch and realtime jobs.
    4VNNJOHCJSE͸όονॲཧͱϦΞϧλΠϜॲཧͷ݁Ռ
    Λ·ͱΊΔࣄ͕Ͱ͖··͢ɻ
    4VNNJOHCJSE͸.BQ3FEVDFॲཧΛந৅Խ͢ΔϥΠϒϥϦͰ͢ɻ

    View Slide

  52. View Slide

  53. Summingbird Abstraction
    4VNNJOHCJSEͷந৅Խ

    Producer
    filter, map,
    flatMap, …

    Platform
    Scalding,

    Storm, …

    Plan
    Pipe,

    Topology, …

    View Slide

  54. Merge results from Scalding + Storm
    ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ
    K→V

    View Slide

  55. Merge results from Scalding + Storm
    ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ
    Batch 1 Batch 2 Batch 3 Batch 4

    K→(V1, Batch 2)
    now
    (K, Batch 3)→V2
    (K, Batch 4)→V3
    Batch
    Jobs
    Realtime

    Jobs
    ⚡️

    View Slide

  56. Merge results from Scalding + Storm
    ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ
    K→(V1, Batch 2)
    (K, Batch 3)→V2
    (K, Batch 4)→V3
    V1+V2+V3 = Vnow!

    View Slide

  57. Summary
    ·ͱΊ

    View Slide

  58. Summary
    Scalding to implement daily batch job by
    Scala.
    4DBMEJOHΛ࢖ͬͯ೔ʑͷόονδϣϒΛ4DBMBͰɻ
    Storm to implement daily streaming
    processing job by Scala.
    4UPSNΛ࢖ͬͯϦΞϧλΠϜॲཧ΋4DBMBͰɻ
    Summingbird to aggregate entire data
    pipelines.
    4VNNJOHCJSEΛ࢖ͬͯσʔλॲཧΛ4DBMBͰॻ͚Δ
    ·ͱΊ

    View Slide

  59. Enjoy.Scala
    &OKPZ4DBMB

    View Slide

  60. #thankyou
    Questions? Tweet to @niw :)
    ࣭͝໰͕͋Ε͹!OJXʹπΠʔτ͍ͯͩ͘͠͞

    ͓͠·͍ɻ

    View Slide