“Running Apache Samza on Kubernetes” Recap : KubeCon2019@NA

ç “Running Apache Samza on Kubernetes” Recap : KubeCon2019@NA Kubernetes
Meetup Tokyo #26@Yahoo @yosshi_

• 吉村翔太 • NTTコミュニケーションズ所属 • データサイエンスチーム • インフラエンジニア/データエンジニアリング •
Kurbernetes 、Prometheus　 etc • 趣味：ボードゲーム • コミュニティ活動 “Cloud Native Developers JP” @yosshi_ 自己紹介

簡単なsamzaの話

取り上げるセッション参考< https://sched.co/Uacc >

• 簡単なsamzaの話 • 最近のHadoopの動向 • もう少し、詳しくsamzaの話本日の目次

About Apache Samza 2018年11月にApache Samza 1.0.0をリリース

Samza on Kubernetes • YARN(Hadoopのリソース管理)に加えて、 Kubernetesにも対応

Kubernetes上で動く他のストリーミング処理 • Spark on Kubernetes • Flink on Kubernetes

参考：SparkとFlinkの資料 • Spark – ドキュメント • https://github.com/GoogleCloudPlatform/spark-on-k8s-operator – KubeCon2019@NAのセッション •
Kubernetizing Big Data and ML Workloads at Uber - Mayank Bansal & Min Cai, Uber https://sched.co/Uaad • Flink – ドキュメント • https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/kubernetes.html – KubeCon2019@NAのセッション • Managing Apache Flink on Kubernetes - FlinkK8sOperator - Anand Swaminathan, Lyft https://sched.co/UabA

最近のHadoopの動向

Hadoopの動向(2018年9月まで) • オンプレでHadoopを使う場合は、だいたい以下のディストリビューションのどれかを使うディストリビューションを使わないことも出来はするが・・・厳しい

Hadoopの動向(2018年10月以降の世界) • 2018年10月にClouderaとHortonworksの合併が発表 • 2019年1月合併完了

Unity1.0 Unity2.0 ディストリビューションの統合合併後3年間は保守する方向らしい(2022年？) 統合していくらしい 2021年くらいに全ての機能が統合？

Cloudera Data Platform（CDP） • 現在、AWSでサポート、 AzureおよびGCPでも近々にサポートが開始される予定 “Unity1.0”の実態？ AWS版のアーキテクチャはEKS（Elastic Kubernetes Servic）とS3ベース
オンプレのとき Kubernetesはどうするんだろ？

参考：データ分析組織に必要なスキルセットインフラ業務知識データサイエンスデータエンジニアリング IPAの”ITSS+”とかが参考になる統計 R
Python SQL Hadoop Spark

Kubernetes前提のデータ分析の組織 Kuberntesのクラスタ Hadoopのクラスタデータエンジニアデータサイエンス業務データエンジニア Hadoopのクラスタ Kuberntesのクラスタ業務
データサイエンス理想現実別々のチーム同じ人がやる

CDP 今後のオンプレでのHadoop環境の選択今、使ってる人たち保守期限まで粘る信じて待つ第３の道 ex) 自分たちで頑張る

今後のオンプレでのデータパイプライン • ディストリビューション買っても、k8s前提なら結局バッチ/ストリーミング処理 Pub/Sub 永続化 Kafka Kubernetes ストレージ or
DB 2,3年くらい待つくらいなら自分たちで検証しても

もう少し、詳しくsamzaの話

About Apache Samza （2回目） • Samza開発者のひとりであるLinkedInのChris Riccominiの話 – 「KafkaがHDFSなら、SamzaはMapReduceにあたる存在」 •
Samzaのネーミング – フランツ・カフカ(Franz Kafka)の小説の「変身」の主人公であるグレゴール・ザムザ(Gregor Samsa)

About Apache Kafka（1/2） • Linkedinで開発され、2011年にOSS化 • 大事な役割 – Message queue
– Message hub Kafkaのない世界 Kafkaのある世界参考< https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying/ >

About Apache Kafka（2/2） • 大事な機能 – Partition – Offset 参考<
https://kafka.apache.org/intro > Offset Partition

Typical Use Cases of Samza

Samza Features 重複は許容

Samza Concept Overview • Samza processes streams. A stream is
composed of immutable messages of a similar type or category. In Kafka a stream is a topic.

Advanced Concept Overview (1/2) • Partition: each stream is broken
into one or more partitions, which is an ordered, replayable sequence of records. • Task: the unit of parallelism of the job, just as the partition is to the stream. コンテナ数を増やすとスケール (ただし多重度の限界はprtition数に依存)

Advanced Concept Overview (2/2) • Job Coordinator – manage the
assignment of tasks across the individual containers – monitor the liveness of individual containers – redistribute the tasks among the remaining ones during a failure

Fault tolerance どのoffsetまで処理したかを記録しているので、故障時は記録を元に再開ただし、処理は済んだが記録する前に故障した場合、復旧時に処理が重複する参考< hhttps://samza.apache.org/learn/documentation/latest/architecture/architecture-overview.html >

Samza & Kubernetes: Working Together

Workflow

Proposed Changes • The Samza Operator, similar to the Samza
AM in YARN, is the control hub for Samza applications running on Kubernetes. It is responsible for requesting Pods from Kubernetes and coordinating work assignment across Pods. • Below graph describes the lifecycle of a Samza application running on Kubernetes. 参考<https://cwiki.apache.org/confluence/display/SAMZA/SEP-20%3A+Samza+on+Kubernetes>

Overview - Samza on Kubernetes

Node – Zoom In

Samza on YARN • Samza leverages YARN for scheduling, resource-management,
and deployment.

Kubernetes vs Apache Yarn k8s版はなし

Samzaの今後 • Adding support for other languages, like Python •
Hot-standby containers to support applications with strict downtime requirements • Making it easy to auto-scale and auto-tune Samza applications • Supporting machine learning related use cases • Enabling end-to-end exactly once processing 参考< https://engineering.linkedin.com/blog/2018/11/samza-1-0--stream-processing-at-massive-scale>

“Running Apache Samza on Kubernetes” Recap : K...

“Running Apache Samza on Kubernetes” Recap : KubeCon2019@NA

More Decks by yosshi_

Other Decks in Technology

Featured

Transcript