Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Developing real-time ML system with Apache Kafka (ko)

Lee Dongjin
November 12, 2019

Developing real-time ML system with Apache Kafka (ko)

Kafka와 함께 하는 실시간 기계 학습: Kafka를 사용한 실시간 예측 시스템을 개발할 때 고려해야 할 것들, 가능한 대안들, 각자의 장단점과 관계에 대하여.
2019년 11월 12일, Cloudera Sessions Seoul 2019 에서 발표.

Kafka and friends: About the tools for monitoring, operating, testing Kaka.
Presented in Cloudera Openday 2019, November 12nd, 2019.

Slides: Korean. Presentation: Korean.

Lee Dongjin

November 12, 2019
Tweet

More Decks by Lee Dongjin

Other Decks in Technology

Transcript

  1. KAFKA와 함께 하는
    실시간 기계 학습
    Lee Dongjin | [email protected]

    View Slide

  2. © 2019 Cloudera, Inc. All rights reserved. 2
    들어가기 전에: 우리가 풀려는 문제
    • Real-time Prediction
    Prediction
    System
    Input Events Predictions

    View Slide

  3. © 2019 Cloudera, Inc. All rights reserved. 3
    ...에 수반되는 문제들
    • Algorithm
    • Time-related Operations
    • Delivery Semantic
    • Latency
    • Language
    • Model

    View Slide

  4. © 2019 Cloudera, Inc. All rights reserved. 4
    1. Latency
    • 가장 중요한 것
    ● 동시에 쉽게 간과되는 것.
    ● 아무리 정확해도 너무 느리면 소용이 없음.
    • 영향을 주는 것
    ● 사용하는 기술
    ● Model Serving 방식

    View Slide

  5. © 2019 Cloudera, Inc. All rights reserved. 5
    1. Latency - 사용하는 기술
    • 연관되는 문제
    ● 병렬 처리
    ● 다양한 Storage 지원
    ● 다양한 Algorithm 보유
    • 대안
    ● Spark (Structured) Streaming
    ● Flink
    ● Kafka Streams

    View Slide

  6. © 2019 Cloudera, Inc. All rights reserved. 6
    1. Latency - Model Serving 방식 (1)
    • Embedded Model
    ● 장점
    • No Network Latency
    • Security
    • No Lock-in
    • Offline Inference
    ● 단점
    • Integration
    • Model Size
    Application
    w/ model
    Input Event
    Prediction

    View Slide

  7. © 2019 Cloudera, Inc. All rights reserved. 7
    1. Latency - Model Serving 방식 (2)
    • Dedicated Model Serving
    ● 직접 개발
    ● Tensorflow Serving
    ● Cloud: AzureML, SageMaker, ...
    Application
    w/o model
    Model Server
    request
    response
    Input Event
    Prediction

    View Slide

  8. © 2019 Cloudera, Inc. All rights reserved. 8
    2. Model
    • "Training 결과를 어떻게 저장할 것인가?"
    • 여기저기 영향을 미치는 것
    ● Latency
    ● Training 방식
    • cloud or not?
    ● 협업 방식
    • language?

    View Slide

  9. © 2019 Cloudera, Inc. All rights reserved. 9
    2. Model
    • Technology Specific Format
    ● Spark ML format, Pickle, SavedModel ...
    • Standard Format
    ● PMML
    ● PFA

    View Slide

  10. © 2019 Cloudera, Inc. All rights reserved. 10
    2. Model
    • Generated Code
    ● H2O
    • Model as a Service (MaaS)
    ● by Cloud Platforms
    ● Dedicated Model Server

    View Slide

  11. © 2019 Cloudera, Inc. All rights reserved. 11
    3. Algorithm
    • 수행하려는 작업 뿐만 아니라 기술, Model, Model Serving 방식과 연관됨
    ● Latency 제한
    • 생각 외로 중요하지 않을 수 있음
    ● Retraining
    ● Hyperparameter Setting
    • 주의: 숨겨진 문제

    View Slide

  12. © 2019 Cloudera, Inc. All rights reserved. 12
    3. Algorithm
    • Log Cleanup
    ● Retention, Compaction
    ● 기본값: '켜져 있음 + 7일이 경과하면 지우기 시작'
    ● log.cleanup.policy
    ● log.retention.{ms,minutes,hours}
    ● log.retention.bytes
    • Schema 관리

    View Slide

  13. © 2019 Cloudera, Inc. All rights reserved. 13
    4. etc
    • Language
    ● 처리 속도, 협업
    • 굳이 Java나 Scala를 고집할 필요가 없다!
    ● go (sarama, confluent-kafka-go)
    ● python (kafka-python, confluent-kafka-python)
    ● .NET (confluent-kafka-dotnet)

    View Slide

  14. © 2019 Cloudera, Inc. All rights reserved. 14
    4. etc
    • Delivery Semantics
    ● At Least Once
    ● Exactly Once

    View Slide

  15. © 2019 Cloudera, Inc. All rights reserved. 15
    4. etc
    • Time-related Operations
    ● What ‘time’?
    ● Window Type
    • Tumbling, Hopping, Sliding, Session, ...
    ● Window Size
    • 정확도 뿐만 아니라 처리 속도에도 영향을 줄 수 있음
    ● Case: Pinterest Ads Platform (Kafka Summit 2018)

    View Slide

  16. © 2019 Cloudera, Inc. All rights reserved. 16
    4. etc - ‘Time’
    • Event Time
    ● Created Time
    ● Append Time
    • Processing Time
    ● Wall-clock time
    ● Stream Time

    View Slide

  17. © 2019 Cloudera, Inc. All rights reserved. 17
    4. etc - tumbling window

    source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

    View Slide

  18. © 2019 Cloudera, Inc. All rights reserved. 18
    4. etc - hopping window

    source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

    View Slide

  19. © 2019 Cloudera, Inc. All rights reserved. 19
    4. etc - sliding window

    source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

    View Slide

  20. © 2019 Cloudera, Inc. All rights reserved. 20
    4. etc - session window

    source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

    View Slide

  21. © 2019 Cloudera, Inc. All rights reserved. 21
    4. etc - technology (Spark Structured Streaming)
    • Event Time이 기본
    ● Processing Time: current_timestamp()
    • Session Window를 지원하지 않음
    ● https://joeyfaherty.github.io/2018/04/14/spark-ss-custom-session-windows/
    ● http://blog.madhukaraphatak.com/introduction-to-spark-structured-streaming-part-14/

    View Slide

  22. © 2019 Cloudera, Inc. All rights reserved. 22
    4. etc - technology (Flink)
    • Event Time, Processing Time 모두를 지원
    • Hopping Window를 따로 정의하지 않음
    ● 'Event Time 기준 Sliding Window'

    View Slide

  23. © 2019 Cloudera, Inc. All rights reserved. 23
    4. etc - technology (Kafka Streams)
    • Sliding Window는 Join 할 때만 쓸 수 있음

    View Slide

  24. © 2019 Cloudera, Inc. All rights reserved. 24
    정리
    • LATENCY
    • Algorithm
    ● 숨겨진 문제: Log Cleanup, Schema 관리, ...
    • Language, Technology & Model
    ● 협업 상황을 상정해야 함을 명심.
    • etc
    ● Delivery Semantic
    ● Time-related operations

    View Slide

  25. THANK YOU

    View Slide