What we’ve got. Giant UDF + hugely patched Pig ڊେͳ6%'ͱύονΛΞς·ͬͨ͘1JH Less development consistency. 1JHͷδϣϒ͚ͩಛघͳӡ༻ʹʜ Many copy and pastes, no IDE support. ίʔυͷ࠶ར༻͕͘͠*%&ͷαϙʔτͳ͍ɻ ͦͷ݁Ռಘͨͷ
Scalding is a library Based on Cascading. $BTDBEJOHͱ͍͏+BWBͷϥΠϒϥϦ͕ϕʔεɻ Abstract Hadoop low-level APIs. )BEPPQͷϨϕϧ"1*͕ൿಗ͞Ε͍ͯ·͢ɻ Tightly integrated with Scala. શʹ4DBMBͷڥͰ͢ɻ Lately introduced REPL. ࠷ۙ3&1-͕༻ҙ͞Ε·ͨ͠ɻ 4DBMEJOH)BEPPQͷδϣϒΛ࡞ΔͨΊͷϥΠϒϥϦͰ͢ɻ
Scalding APIs Fields-base API and Type-safe API छྨͷ"1*͕͋Γ·͢ɻ Type-safe API is Type-safe ܕͷצҧ͍ΛίϯύΠϧ࣌ʹΔ͜ͱ͕Ͱ͖·͢ɻ API uses many Scala magic. JNQMJDJUͷཛྷͰ*%&͕ਏͦ͏Ͱ͢ɻ 4DBMEJOHͷ"1*
Strom is a platform. Storm is similar to Hadoop, but for realtime. 4UPSN)BEPPQͱࣅͨڥͰ͕͢ɺϦΞϧλΠϜॲ ཧΛఏڙ͠·͢ɻ No data loss. 4UPSNσʔλ͕ॲཧ͞Εͨ͜ͱΛอূ͠·͢ɻ Fault-tolerant োੑʹ͍ͭΑ͍ͣͰ͢ɻ 4UPSNϦΞϧλΠϜσʔλॲཧͷͨΊͷϓϥοτϑΥʔϜͰ͢ɻ
Summingbird is a library. Summingbird abstract platform. 4VNNJOHCJSEϓϥοτϑΥʔϜΛநԽ͠·͢ɻ Summingbird works for Scalding, Storm. 4VNNJOHCJSE4DBMEJOH4UPSNͱಈ͖·͢ɻ Summingbird aggregates results from both batch and realtime jobs. 4VNNJOHCJSEόονॲཧͱϦΞϧλΠϜॲཧͷ݁Ռ Λ·ͱΊΔࣄ͕Ͱ͖··͢ɻ 4VNNJOHCJSE.BQ3FEVDFॲཧΛநԽ͢ΔϥΠϒϥϦͰ͢ɻ
Summary Scalding to implement daily batch job by Scala. 4DBMEJOHΛͬͯʑͷόονδϣϒΛ4DBMBͰɻ Storm to implement daily streaming processing job by Scala. 4UPSNΛͬͯϦΞϧλΠϜॲཧ4DBMBͰɻ Summingbird to aggregate entire data pipelines. 4VNNJOHCJSEΛͬͯσʔλॲཧΛ4DBMBͰॻ͚Δ ·ͱΊ