Slide 1

Slide 1 text

KAFKA와 함께 하는 실시간 기계 학습 Lee Dongjin | [email protected]

Slide 2

Slide 2 text

© 2019 Cloudera, Inc. All rights reserved. 2 들어가기 전에: 우리가 풀려는 문제 • Real-time Prediction Prediction System Input Events Predictions

Slide 3

Slide 3 text

© 2019 Cloudera, Inc. All rights reserved. 3 ...에 수반되는 문제들 • Algorithm • Time-related Operations • Delivery Semantic • Latency • Language • Model

Slide 4

Slide 4 text

© 2019 Cloudera, Inc. All rights reserved. 4 1. Latency • 가장 중요한 것 ● 동시에 쉽게 간과되는 것. ● 아무리 정확해도 너무 느리면 소용이 없음. • 영향을 주는 것 ● 사용하는 기술 ● Model Serving 방식

Slide 5

Slide 5 text

© 2019 Cloudera, Inc. All rights reserved. 5 1. Latency - 사용하는 기술 • 연관되는 문제 ● 병렬 처리 ● 다양한 Storage 지원 ● 다양한 Algorithm 보유 • 대안 ● Spark (Structured) Streaming ● Flink ● Kafka Streams

Slide 6

Slide 6 text

© 2019 Cloudera, Inc. All rights reserved. 6 1. Latency - Model Serving 방식 (1) • Embedded Model ● 장점 • No Network Latency • Security • No Lock-in • Offline Inference ● 단점 • Integration • Model Size Application w/ model Input Event Prediction

Slide 7

Slide 7 text

© 2019 Cloudera, Inc. All rights reserved. 7 1. Latency - Model Serving 방식 (2) • Dedicated Model Serving ● 직접 개발 ● Tensorflow Serving ● Cloud: AzureML, SageMaker, ... Application w/o model Model Server request response Input Event Prediction

Slide 8

Slide 8 text

© 2019 Cloudera, Inc. All rights reserved. 8 2. Model • "Training 결과를 어떻게 저장할 것인가?" • 여기저기 영향을 미치는 것 ● Latency ● Training 방식 • cloud or not? ● 협업 방식 • language?

Slide 9

Slide 9 text

© 2019 Cloudera, Inc. All rights reserved. 9 2. Model • Technology Specific Format ● Spark ML format, Pickle, SavedModel ... • Standard Format ● PMML ● PFA

Slide 10

Slide 10 text

© 2019 Cloudera, Inc. All rights reserved. 10 2. Model • Generated Code ● H2O • Model as a Service (MaaS) ● by Cloud Platforms ● Dedicated Model Server

Slide 11

Slide 11 text

© 2019 Cloudera, Inc. All rights reserved. 11 3. Algorithm • 수행하려는 작업 뿐만 아니라 기술, Model, Model Serving 방식과 연관됨 ● Latency 제한 • 생각 외로 중요하지 않을 수 있음 ● Retraining ● Hyperparameter Setting • 주의: 숨겨진 문제

Slide 12

Slide 12 text

© 2019 Cloudera, Inc. All rights reserved. 12 3. Algorithm • Log Cleanup ● Retention, Compaction ● 기본값: '켜져 있음 + 7일이 경과하면 지우기 시작' ● log.cleanup.policy ● log.retention.{ms,minutes,hours} ● log.retention.bytes • Schema 관리

Slide 13

Slide 13 text

© 2019 Cloudera, Inc. All rights reserved. 13 4. etc • Language ● 처리 속도, 협업 • 굳이 Java나 Scala를 고집할 필요가 없다! ● go (sarama, confluent-kafka-go) ● python (kafka-python, confluent-kafka-python) ● .NET (confluent-kafka-dotnet)

Slide 14

Slide 14 text

© 2019 Cloudera, Inc. All rights reserved. 14 4. etc • Delivery Semantics ● At Least Once ● Exactly Once

Slide 15

Slide 15 text

© 2019 Cloudera, Inc. All rights reserved. 15 4. etc • Time-related Operations ● What ‘time’? ● Window Type • Tumbling, Hopping, Sliding, Session, ... ● Window Size • 정확도 뿐만 아니라 처리 속도에도 영향을 줄 수 있음 ● Case: Pinterest Ads Platform (Kafka Summit 2018)

Slide 16

Slide 16 text

© 2019 Cloudera, Inc. All rights reserved. 16 4. etc - ‘Time’ • Event Time ● Created Time ● Append Time • Processing Time ● Wall-clock time ● Stream Time

Slide 17

Slide 17 text

© 2019 Cloudera, Inc. All rights reserved. 17 4. etc - tumbling window • source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Slide 18

Slide 18 text

© 2019 Cloudera, Inc. All rights reserved. 18 4. etc - hopping window • source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Slide 19

Slide 19 text

© 2019 Cloudera, Inc. All rights reserved. 19 4. etc - sliding window • source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Slide 20

Slide 20 text

© 2019 Cloudera, Inc. All rights reserved. 20 4. etc - session window • source: https://docs.microsoft.com/en-us/azure/stream-analytics/stream-analytics-window-functions

Slide 21

Slide 21 text

© 2019 Cloudera, Inc. All rights reserved. 21 4. etc - technology (Spark Structured Streaming) • Event Time이 기본 ● Processing Time: current_timestamp() • Session Window를 지원하지 않음 ● https://joeyfaherty.github.io/2018/04/14/spark-ss-custom-session-windows/ ● http://blog.madhukaraphatak.com/introduction-to-spark-structured-streaming-part-14/

Slide 22

Slide 22 text

© 2019 Cloudera, Inc. All rights reserved. 22 4. etc - technology (Flink) • Event Time, Processing Time 모두를 지원 • Hopping Window를 따로 정의하지 않음 ● 'Event Time 기준 Sliding Window'

Slide 23

Slide 23 text

© 2019 Cloudera, Inc. All rights reserved. 23 4. etc - technology (Kafka Streams) • Sliding Window는 Join 할 때만 쓸 수 있음

Slide 24

Slide 24 text

© 2019 Cloudera, Inc. All rights reserved. 24 정리 • LATENCY • Algorithm ● 숨겨진 문제: Log Cleanup, Schema 관리, ... • Language, Technology & Model ● 협업 상황을 상정해야 함을 명심. • etc ● Delivery Semantic ● Time-related operations

Slide 25

Slide 25 text

THANK YOU