Getting Started with Scalding, Storm and Summingbird

E4e01fb8b7105e61e3876224139503ab?s=47 Yoshimasa Niwa
September 06, 2014

Getting Started with Scalding, Storm and Summingbird

Scalding と Storm と Summingbird で始める Scala でデータ処理

Sep 6, 2104 at Scala Matsuri 2014 in Tokyo, Japan.

E4e01fb8b7105e61e3876224139503ab?s=128

Yoshimasa Niwa

September 06, 2014
Tweet

Transcript

  1. Getting Started with
 Scalding, Storm & Summingbird Yoshimasa Niwa
 Twitter,

    Inc.
 Sep 6, 2014 in Tokyo 4DBMEJOHͱ4UPSNͱ4VNNJOHCJSEΛ࢖ͬͯΈΑ͏
  2. Scala Matsuri 2014 4DBMBࡇΓͷεέδϡʔϧදͰ͢ɻ

  3. @niw Yoshimasa Niwa

  4. None
  5. Beautiful San Francisco αϯϑϥϯγείʹ͋Γ·͢

  6. I’m not a Scala professional. ࣮͸4DBMBͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱ͸ͳ͍ΜͰ͢ɻ

  7. I’m even not a data professional. ࣮͸ɺผʹσʔλॲཧͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱ΋ͳ͍ΜͰ͢ɻ

  8. [[iOSApp alloc] init]; ࠷ۙ͸J04ΞϓϦΛ࡞ͬͯ·͢ɻ

  9. Swift 㲗 Scala 4XJGUͷ࿩4DBMB.BUTVSJ

  10. Swift 㲗 Scala 4XJGUͷ࿩4DBMB.BUTVSJ let point = (3, 2) 


    switch point { case let (x, y): println(x) } val point = (3, 2) 
 point match { case (x, y) => println(x) }
  11. I’m a user of a data infra. ͍ͪσʔλΠϯϑϥͷར༻ऀ͔Βͷࢹ఺Ͱ঺հ͠·͢ɻ

  12. Twitter — Scala 5XJUUFS͸4DBMBʹ૬౰ೖΕࠐΜͰ·͢ɻ

  13. In 2009… ͔͞ͷ΅Δ͜ͱ೥લʜ

  14. Twitter was (probably) the biggest Rails app in the world.

    5XJUUFS͸ ͨͿΜ ੈքΠνͰ͔͍3BJMTͷΞϓϦͰͨ͠ɻ
  15. What we’ve got. Giant patched Ruby and Rails. ύονΛ͋ͯ·ͬͨ͘3VCZͱ3BJMT Development

    with Grep, No IDE support. (SFQ͚͕ͩཔΓɻ*%&ͷॿ͚͸͋·Γظ଴Ͱ͖ͳ͍ɻ ͦͷ݁Ռಘͨ΋ͷ͸
  16. and more. ͞Βʹɾɾɾ

  17. None
  18. Nov 20, 2009

  19. We can’t survive with Rails. 3BJMTͰ͸ੜ͖࢒Εͳ͍

  20. Ruby → Scala 3VCZͷػೳΛগͣͭ͠ॻ͖ͳ͓ͯ͠4DBMBʹҠߦ͠Α͏

  21. So long, and Thanks for the all whales. Ϋδϥ͞Μ͞Α͏ͳΒɻ

  22. Aug 2, 2013 143,199TPS

  23. Ummm… Yeah… TPS? 514ͬͯʁ

  24. Tweet Per Second ඵ͋ͨΓͷπΠʔτ਺

  25. How about analytics infra? σʔλॲཧ͸Ͳ͏ͳͬͯΔͷ͔

  26. In 2009… ͔͞ͷ΅Δ͜ͱ೥લʜ

  27. Hadoop + Pig )BEPPQ 1JH

  28. What we’ve got. Giant UDF + hugely patched Pig ڊେͳ6%'ͱύονΛΞς·ͬͨ͘1JH

    Less development consistency. 1JHͷδϣϒ͚ͩಛघͳӡ༻ʹʜ Many copy and pastes, no IDE support. ίʔυͷ࠶ར༻͕೉͘͠*%&ͷαϙʔτ΋ͳ͍ɻ ͦͷ݁Ռಘͨ΋ͷ͸
  29. Pig → Scalding 3VCZΛ4DBMBʹҠߦͤͨ͞Α͏ʹɺ1JH͔Β4DBMEJOHʹҠߦɻ

  30. Scalding 4DBMEJOHΛ࢖ͬͯΈΑ͏

  31. Scalding is 4DBMEJOHͱ͸

  32. Scalding is a library Based on Cascading. $BTDBEJOHͱ͍͏+BWBͷϥΠϒϥϦ͕ϕʔεɻ Abstract Hadoop

    low-level APIs. )BEPPQͷ௿Ϩϕϧ"1*͕ൿಗ͞Ε͍ͯ·͢ɻ Tightly integrated with Scala. ׬શʹ4DBMBͷ؀ڥͰ͢ɻ Lately introduced REPL. ࠷ۙ3&1-͕༻ҙ͞Ε·ͨ͠ɻ 4DBMEJOH͸)BEPPQͷδϣϒΛ࡞ΔͨΊͷϥΠϒϥϦͰ͢ɻ
  33. Getting Started 4DBMEJOHΛ͞ΘͬͯΈΑ͏

  34. None
  35. TPS Report όϧε͕དྷͨ࣌ͷ514Λௐ΂ͯΈΔɻ

  36. None
  37. Scalding APIs Fields-base API and Type-safe API छྨͷ"1*͕͋Γ·͢ɻ Type-safe API

    is Type-safe ܕͷצҧ͍ΛίϯύΠϧ࣌ʹ஌Δ͜ͱ͕Ͱ͖·͢ɻ API uses many Scala magic. JNQMJDJUͷཛྷͰ*%&͕ਏͦ͏Ͱ͢ɻ 4DBMEJOHͷ"1*
  38. No one knows the future. Ͱ΋ɺ࣮ࡍόϧεతͳͷ͸͍ͭདྷΔ͔Θ͔Βͳ͍ΜͩΑͶɻ

  39. Storm 4UPSNΛ࢖ͬͯΈΑ͏

  40. Storm is 4UPSNͱ͸

  41. Strom is a platform. Storm is similar to Hadoop, but

    for realtime. 4UPSN͸)BEPPQͱࣅͨ؀ڥͰ͕͢ɺϦΞϧλΠϜॲ ཧΛఏڙ͠·͢ɻ No data loss. 4UPSN͸σʔλ͕ॲཧ͞Εͨ͜ͱΛอূ͠·͢ɻ Fault-tolerant ଱ো֐ੑʹ͍ͭΑ͍͸ͣͰ͢ɻ 4UPSN͸ϦΞϧλΠϜσʔλॲཧͷͨΊͷϓϥοτϑΥʔϜͰ͢ɻ
  42. Getting Started 4UPSNΛ͞ΘͬͯݟΑ͏

  43. None
  44. Realtime TPS report ͨͱ͑͹ɺόϧε͕ͦΖͦΖདྷͦ͏ͳͱ͖ʹ514Λௐ΂ͯΈΔɻ

  45. None
  46. Storm terms 4UPSNͷ༻ޠ Nimbus Supervisor Worker ☀️

  47. Storm terms 4UPSNͷ༻ޠ ⚡️ Spout Bolt Topology ⚡️ Bolt

  48. Hmm, looks similar. ͳΜ͔ݟͨ͜ͱ͋ΔΑ͏ͳ

  49. Summingbird 4VNNJOHCJSEΛ࢖ͬͯΈΑ͏

  50. Summingbird is 4VNNJOHCJSEͱ͸

  51. Summingbird is a library. Summingbird abstract platform. 4VNNJOHCJSE͸ϓϥοτϑΥʔϜΛந৅Խ͠·͢ɻ Summingbird works

    for Scalding, Storm. 4VNNJOHCJSE͸4DBMEJOH΍4UPSNͱಈ͖·͢ɻ Summingbird aggregates results from both batch and realtime jobs. 4VNNJOHCJSE͸όονॲཧͱϦΞϧλΠϜॲཧͷ݁Ռ Λ·ͱΊΔࣄ͕Ͱ͖··͢ɻ 4VNNJOHCJSE͸.BQ3FEVDFॲཧΛந৅Խ͢ΔϥΠϒϥϦͰ͢ɻ
  52. None
  53. Summingbird Abstraction 4VNNJOHCJSEͷந৅Խ Producer filter, map, flatMap, … Platform Scalding,


    Storm, … Plan Pipe,
 Topology, …
  54. Merge results from Scalding + Storm ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ K→V

  55. Merge results from Scalding + Storm ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ Batch 1 Batch

    2 Batch 3 Batch 4 K→(V1, Batch 2) now (K, Batch 3)→V2 (K, Batch 4)→V3 Batch Jobs Realtime
 Jobs ⚡️
  56. Merge results from Scalding + Storm ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ K→(V1, Batch 2)

    (K, Batch 3)→V2 (K, Batch 4)→V3 V1+V2+V3 = Vnow!
  57. Summary ·ͱΊ

  58. Summary Scalding to implement daily batch job by Scala. 4DBMEJOHΛ࢖ͬͯ೔ʑͷόονδϣϒΛ4DBMBͰɻ

    Storm to implement daily streaming processing job by Scala. 4UPSNΛ࢖ͬͯϦΞϧλΠϜॲཧ΋4DBMBͰɻ Summingbird to aggregate entire data pipelines. 4VNNJOHCJSEΛ࢖ͬͯσʔλॲཧΛ4DBMBͰॻ͚Δ ·ͱΊ
  59. Enjoy.Scala &OKPZ4DBMB

  60. #thankyou Questions? Tweet to @niw :) ࣭͝໰͕͋Ε͹!OJXʹπΠʔτ͍ͯͩ͘͠͞ ͓͠·͍ɻ