Developing real-time ML system with Apache Kafka (ko)

143f88e8c2b2a1123e87c81d9bbefa02?s=47 Lee Dongjin
November 12, 2019

Developing real-time ML system with Apache Kafka (ko)

Kafka와 함께 하는 실시간 기계 학습: Kafka를 사용한 실시간 예측 시스템을 개발할 때 고려해야 할 것들, 가능한 대안들, 각자의 장단점과 관계에 대하여.
2019년 11월 12일, Cloudera Sessions Seoul 2019 에서 발표.

Kafka and friends: About the tools for monitoring, operating, testing Kaka.
Presented in Cloudera Openday 2019, November 12nd, 2019.

Slides: Korean. Presentation: Korean.

143f88e8c2b2a1123e87c81d9bbefa02?s=128

Lee Dongjin

November 12, 2019
Tweet

Transcript

  1. KAFKA와 함께 하는 실시간 기계 학습 Lee Dongjin | dongjin@apache.org

  2. © 2019 Cloudera, Inc. All rights reserved. 2 들어가기 전에:

    우리가 풀려는 문제 • Real-time Prediction Prediction System Input Events Predictions
  3. © 2019 Cloudera, Inc. All rights reserved. 3 ...에 수반되는

    문제들 • Algorithm • Time-related Operations • Delivery Semantic • Latency • Language • Model
  4. © 2019 Cloudera, Inc. All rights reserved. 4 1. Latency

    • 가장 중요한 것 • 동시에 쉽게 간과되는 것. • 아무리 정확해도 너무 느리면 소용이 없음. • 영향을 주는 것 • 사용하는 기술 • Model Serving 방식
  5. © 2019 Cloudera, Inc. All rights reserved. 5 1. Latency

    - 사용하는 기술 • 연관되는 문제 • 병렬 처리 • 다양한 Storage 지원 • 다양한 Algorithm 보유 • 대안 • Spark (Structured) Streaming • Flink • Kafka Streams
  6. © 2019 Cloudera, Inc. All rights reserved. 6 1. Latency

    - Model Serving 방식 (1) • Embedded Model • 장점 • No Network Latency • Security • No Lock-in • Offline Inference • 단점 • Integration • Model Size Application w/ model Input Event Prediction
  7. © 2019 Cloudera, Inc. All rights reserved. 7 1. Latency

    - Model Serving 방식 (2) • Dedicated Model Serving • 직접 개발 • Tensorflow Serving • Cloud: AzureML, SageMaker, ... Application w/o model Model Server request response Input Event Prediction
  8. © 2019 Cloudera, Inc. All rights reserved. 8 2. Model

    • "Training 결과를 어떻게 저장할 것인가?" • 여기저기 영향을 미치는 것 • Latency • Training 방식 • cloud or not? • 협업 방식 • language?
  9. © 2019 Cloudera, Inc. All rights reserved. 9 2. Model

    • Technology Specific Format • Spark ML format, Pickle, SavedModel ... • Standard Format • PMML • PFA
  10. © 2019 Cloudera, Inc. All rights reserved. 10 2. Model

    • Generated Code • H2O • Model as a Service (MaaS) • by Cloud Platforms • Dedicated Model Server
  11. © 2019 Cloudera, Inc. All rights reserved. 11 3. Algorithm

    • 수행하려는 작업 뿐만 아니라 기술, Model, Model Serving 방식과 연관됨 • Latency 제한 • 생각 외로 중요하지 않을 수 있음 • Retraining • Hyperparameter Setting • 주의: 숨겨진 문제
  12. © 2019 Cloudera, Inc. All rights reserved. 12 3. Algorithm

    • Log Cleanup • Retention, Compaction • 기본값: '켜져 있음 + 7일이 경과하면 지우기 시작' • log.cleanup.policy • log.retention.{ms,minutes,hours} • log.retention.bytes • Schema 관리
  13. © 2019 Cloudera, Inc. All rights reserved. 13 4. etc

    • Language • 처리 속도, 협업 • 굳이 Java나 Scala를 고집할 필요가 없다! • go (sarama, confluent-kafka-go) • python (kafka-python, confluent-kafka-python) • .NET (confluent-kafka-dotnet)
  14. © 2019 Cloudera, Inc. All rights reserved. 14 4. etc

    • Delivery Semantics • At Least Once • Exactly Once
  15. © 2019 Cloudera, Inc. All rights reserved. 15 4. etc

    • Time-related Operations • What ‘time’? • Window Type • Tumbling, Hopping, Sliding, Session, ... • Window Size • 정확도 뿐만 아니라 처리 속도에도 영향을 줄 수 있음 • Case: Pinterest Ads Platform (Kafka Summit 2018)
  16. © 2019 Cloudera, Inc. All rights reserved. 16 4. etc

    - ‘Time’ • Event Time • Created Time • Append Time • Processing Time • Wall-clock time • Stream Time
  17. © 2019 Cloudera, Inc. All rights reserved. 17 4. etc

    - tumbling window • source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
  18. © 2019 Cloudera, Inc. All rights reserved. 18 4. etc

    - hopping window • source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
  19. © 2019 Cloudera, Inc. All rights reserved. 19 4. etc

    - sliding window • source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
  20. © 2019 Cloudera, Inc. All rights reserved. 20 4. etc

    - session window • source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions
  21. © 2019 Cloudera, Inc. All rights reserved. 21 4. etc

    - technology (Spark Structured Streaming) • Event Time이 기본 • Processing Time: current_timestamp() • Session Window를 지원하지 않음 • https://joeyfaherty.github.io/2018/04/14/spark-ss-custom-session-windows/ • http://blog.madhukaraphatak.com/introduction-to-spark-structured-streaming-part-14/
  22. © 2019 Cloudera, Inc. All rights reserved. 22 4. etc

    - technology (Flink) • Event Time, Processing Time 모두를 지원 • Hopping Window를 따로 정의하지 않음 • 'Event Time 기준 Sliding Window'
  23. © 2019 Cloudera, Inc. All rights reserved. 23 4. etc

    - technology (Kafka Streams) • Sliding Window는 Join 할 때만 쓸 수 있음
  24. © 2019 Cloudera, Inc. All rights reserved. 24 정리 •

    LATENCY • Algorithm • 숨겨진 문제: Log Cleanup, Schema 관리, ... • Language, Technology & Model • 협업 상황을 상정해야 함을 명심. • etc • Delivery Semantic • Time-related operations
  25. THANK YOU