$30 off During Our Annual Pro Sale. View Details »

Getting Started with Scalding, Storm and Summingbird

Yoshimasa Niwa
September 06, 2014

Getting Started with Scalding, Storm and Summingbird

Scalding と Storm と Summingbird で始める Scala でデータ処理

Sep 6, 2104 at Scala Matsuri 2014 in Tokyo, Japan.

Yoshimasa Niwa

September 06, 2014
Tweet

More Decks by Yoshimasa Niwa

Other Decks in Programming

Transcript

  1. Getting Started with
 Scalding, Storm & Summingbird Yoshimasa Niwa
 Twitter,

    Inc.
 Sep 6, 2014 in Tokyo 4DBMEJOHͱ4UPSNͱ4VNNJOHCJSEΛ࢖ͬͯΈΑ͏
  2. Scala Matsuri 2014 4DBMBࡇΓͷεέδϡʔϧදͰ͢ɻ

  3. @niw Yoshimasa Niwa

  4. None
  5. Beautiful San Francisco αϯϑϥϯγείʹ͋Γ·͢

  6. I’m not a Scala professional. ࣮͸4DBMBͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱ͸ͳ͍ΜͰ͢ɻ

  7. I’m even not a data professional. ࣮͸ɺผʹσʔλॲཧͷϓϩϑΣογϣφϧͱ͍͏Θ͚Ͱ΋ͳ͍ΜͰ͢ɻ

  8. [[iOSApp alloc] init]; ࠷ۙ͸J04ΞϓϦΛ࡞ͬͯ·͢ɻ

  9. Swift 㲗 Scala 4XJGUͷ࿩4DBMB.BUTVSJ

  10. Swift 㲗 Scala 4XJGUͷ࿩4DBMB.BUTVSJ let point = (3, 2) 


    switch point { case let (x, y): println(x) } val point = (3, 2) 
 point match { case (x, y) => println(x) }
  11. I’m a user of a data infra. ͍ͪσʔλΠϯϑϥͷར༻ऀ͔Βͷࢹ఺Ͱ঺հ͠·͢ɻ

  12. Twitter — Scala 5XJUUFS͸4DBMBʹ૬౰ೖΕࠐΜͰ·͢ɻ

  13. In 2009… ͔͞ͷ΅Δ͜ͱ೥લʜ

  14. Twitter was (probably) the biggest Rails app in the world.

    5XJUUFS͸ ͨͿΜ ੈքΠνͰ͔͍3BJMTͷΞϓϦͰͨ͠ɻ
  15. What we’ve got. Giant patched Ruby and Rails. ύονΛ͋ͯ·ͬͨ͘3VCZͱ3BJMT Development

    with Grep, No IDE support. (SFQ͚͕ͩཔΓɻ*%&ͷॿ͚͸͋·Γظ଴Ͱ͖ͳ͍ɻ ͦͷ݁Ռಘͨ΋ͷ͸
  16. and more. ͞Βʹɾɾɾ

  17. None
  18. Nov 20, 2009

  19. We can’t survive with Rails. 3BJMTͰ͸ੜ͖࢒Εͳ͍

  20. Ruby → Scala 3VCZͷػೳΛগͣͭ͠ॻ͖ͳ͓ͯ͠4DBMBʹҠߦ͠Α͏

  21. So long, and Thanks for the all whales. Ϋδϥ͞Μ͞Α͏ͳΒɻ

  22. Aug 2, 2013 143,199TPS

  23. Ummm… Yeah… TPS? 514ͬͯʁ

  24. Tweet Per Second ඵ͋ͨΓͷπΠʔτ਺

  25. How about analytics infra? σʔλॲཧ͸Ͳ͏ͳͬͯΔͷ͔

  26. In 2009… ͔͞ͷ΅Δ͜ͱ೥લʜ

  27. Hadoop + Pig )BEPPQ 1JH

  28. What we’ve got. Giant UDF + hugely patched Pig ڊେͳ6%'ͱύονΛΞς·ͬͨ͘1JH

    Less development consistency. 1JHͷδϣϒ͚ͩಛघͳӡ༻ʹʜ Many copy and pastes, no IDE support. ίʔυͷ࠶ར༻͕೉͘͠*%&ͷαϙʔτ΋ͳ͍ɻ ͦͷ݁Ռಘͨ΋ͷ͸
  29. Pig → Scalding 3VCZΛ4DBMBʹҠߦͤͨ͞Α͏ʹɺ1JH͔Β4DBMEJOHʹҠߦɻ

  30. Scalding 4DBMEJOHΛ࢖ͬͯΈΑ͏

  31. Scalding is 4DBMEJOHͱ͸

  32. Scalding is a library Based on Cascading. $BTDBEJOHͱ͍͏+BWBͷϥΠϒϥϦ͕ϕʔεɻ Abstract Hadoop

    low-level APIs. )BEPPQͷ௿Ϩϕϧ"1*͕ൿಗ͞Ε͍ͯ·͢ɻ Tightly integrated with Scala. ׬શʹ4DBMBͷ؀ڥͰ͢ɻ Lately introduced REPL. ࠷ۙ3&1-͕༻ҙ͞Ε·ͨ͠ɻ 4DBMEJOH͸)BEPPQͷδϣϒΛ࡞ΔͨΊͷϥΠϒϥϦͰ͢ɻ
  33. Getting Started 4DBMEJOHΛ͞ΘͬͯΈΑ͏

  34. None
  35. TPS Report όϧε͕དྷͨ࣌ͷ514Λௐ΂ͯΈΔɻ

  36. None
  37. Scalding APIs Fields-base API and Type-safe API छྨͷ"1*͕͋Γ·͢ɻ Type-safe API

    is Type-safe ܕͷצҧ͍ΛίϯύΠϧ࣌ʹ஌Δ͜ͱ͕Ͱ͖·͢ɻ API uses many Scala magic. JNQMJDJUͷཛྷͰ*%&͕ਏͦ͏Ͱ͢ɻ 4DBMEJOHͷ"1*
  38. No one knows the future. Ͱ΋ɺ࣮ࡍόϧεతͳͷ͸͍ͭདྷΔ͔Θ͔Βͳ͍ΜͩΑͶɻ

  39. Storm 4UPSNΛ࢖ͬͯΈΑ͏

  40. Storm is 4UPSNͱ͸

  41. Strom is a platform. Storm is similar to Hadoop, but

    for realtime. 4UPSN͸)BEPPQͱࣅͨ؀ڥͰ͕͢ɺϦΞϧλΠϜॲ ཧΛఏڙ͠·͢ɻ No data loss. 4UPSN͸σʔλ͕ॲཧ͞Εͨ͜ͱΛอূ͠·͢ɻ Fault-tolerant ଱ো֐ੑʹ͍ͭΑ͍͸ͣͰ͢ɻ 4UPSN͸ϦΞϧλΠϜσʔλॲཧͷͨΊͷϓϥοτϑΥʔϜͰ͢ɻ
  42. Getting Started 4UPSNΛ͞ΘͬͯݟΑ͏

  43. None
  44. Realtime TPS report ͨͱ͑͹ɺόϧε͕ͦΖͦΖདྷͦ͏ͳͱ͖ʹ514Λௐ΂ͯΈΔɻ

  45. None
  46. Storm terms 4UPSNͷ༻ޠ Nimbus Supervisor Worker ☀️

  47. Storm terms 4UPSNͷ༻ޠ ⚡️ Spout Bolt Topology ⚡️ Bolt

  48. Hmm, looks similar. ͳΜ͔ݟͨ͜ͱ͋ΔΑ͏ͳ

  49. Summingbird 4VNNJOHCJSEΛ࢖ͬͯΈΑ͏

  50. Summingbird is 4VNNJOHCJSEͱ͸

  51. Summingbird is a library. Summingbird abstract platform. 4VNNJOHCJSE͸ϓϥοτϑΥʔϜΛந৅Խ͠·͢ɻ Summingbird works

    for Scalding, Storm. 4VNNJOHCJSE͸4DBMEJOH΍4UPSNͱಈ͖·͢ɻ Summingbird aggregates results from both batch and realtime jobs. 4VNNJOHCJSE͸όονॲཧͱϦΞϧλΠϜॲཧͷ݁Ռ Λ·ͱΊΔࣄ͕Ͱ͖··͢ɻ 4VNNJOHCJSE͸.BQ3FEVDFॲཧΛந৅Խ͢ΔϥΠϒϥϦͰ͢ɻ
  52. None
  53. Summingbird Abstraction 4VNNJOHCJSEͷந৅Խ Producer filter, map, flatMap, … Platform Scalding,


    Storm, … Plan Pipe,
 Topology, …
  54. Merge results from Scalding + Storm ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ K→V

  55. Merge results from Scalding + Storm ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ Batch 1 Batch

    2 Batch 3 Batch 4 K→(V1, Batch 2) now (K, Batch 3)→V2 (K, Batch 4)→V3 Batch Jobs Realtime
 Jobs ⚡️
  56. Merge results from Scalding + Storm ಉ͡σʔλॲཧ͕࢖͍·Θͤͯ݁ՌΛ྆ํ͔ΒऔಘͰ͖·͢ɻ K→(V1, Batch 2)

    (K, Batch 3)→V2 (K, Batch 4)→V3 V1+V2+V3 = Vnow!
  57. Summary ·ͱΊ

  58. Summary Scalding to implement daily batch job by Scala. 4DBMEJOHΛ࢖ͬͯ೔ʑͷόονδϣϒΛ4DBMBͰɻ

    Storm to implement daily streaming processing job by Scala. 4UPSNΛ࢖ͬͯϦΞϧλΠϜॲཧ΋4DBMBͰɻ Summingbird to aggregate entire data pipelines. 4VNNJOHCJSEΛ࢖ͬͯσʔλॲཧΛ4DBMBͰॻ͚Δ ·ͱΊ
  59. Enjoy.Scala &OKPZ4DBMB

  60. #thankyou Questions? Tweet to @niw :) ࣭͝໰͕͋Ε͹!OJXʹπΠʔτ͍ͯͩ͘͠͞ ͓͠·͍ɻ