Upgrade to Pro — share decks privately, control downloads, hide ads and more …

추천 시스템을 위한 데이터 분석 시스템 구축하기

추천 시스템을 위한 데이터 분석 시스템 구축하기

Agenda
- 데이터 분석의 위한 사전 지식
- 데이터 구조
- 데이터 온도 스펙트럼
- 데이터 파이프라인
- 추천 시스템 구축을 위해 필요한 데이터
- 사용자 행동 로그
- 추천성과분석지표
- 사용자 행동 로그 수집을 위한 데이터 분석 아키텍처
- 추천 성과 분석을 위한 데이터 분석 시스템 확장
- Lesson Learned

Sungmin Kim

April 26, 2022
Tweet

More Decks by Sungmin Kim

Other Decks in Programming

Transcript

  1. © 2020, Amazon Web Services, Inc. or its Affiliates. Sungmin,

    Kim AWS Solutions Architect 추천 서비스를 위한 데이터 분석 시스템 구축하기
  2. Agenda • 데이터 분석의 위한 사전 지식 • 데이터 구조

    • 데이터 온도 스펙트럼 • 데이터 파이프라인 • 추천 시스템 구축을 위해 필요한 데이터 • 사용자 행동 로그 • 추천 성과 분석 지표 • 사용자 행동 로그 수집을 위한 데이터 분석 아키텍처 • 추천 성과 분석을 위한 데이터 분석 시스템 확장 • Lesson Learned
  3. Structure Hot data Warm data Cold data Low High High

    Request rate Low High Cost / GB Low High Latency Low High Data Volume Low In-Memory SQL NoSQL Search Object Storage Archive Storage Graph Data Temperature Spectrum
  4. Simplify Big Data Processing Collect Consume Store Process/Analyze Data 1

    4 0 9 5 Answers & Insights Time to answer (Latency) Throughput Cost ETL
  5. 추천 시스템에 필요한 데이터 • 추천 계산을 위한 사용자 행동

    로그 • 상품 상세 페이지 보기 • 장바구니 담기 • 상품 구매 • … • 추천 성과 측정을 위한 데이터 • 추천 아이템 노출 횟수 • 추천 아이템 클릭 횟수 • 추천 아이템 노출 위치 • … visit view cart buy
  6. 사용자 행동 로그 수집 Data Set Group Users Items Interactions

    Solution (Recipes) Model selection, training, tunning and verification Campaign Model hosting, and inference Events Tracker Data Sets Users, Items, and Interactions Real-Time Events Amazon Personalize
  7. • visit • view • cart • buy Web Server

    Users Items Interactions Amazon Personalize 사용자 행동 로그 수집
  8. • visit • view • cart • buy Web Server

    Users Items Interactions Dataset Group Dataset Import Recipe & Solution Campaign Amazon Personalize Personalize Runtime API 사용자 행동 로그 수집
  9. • visit • view • cart • buy Web Server

    Users Items Interactions Dataset Group Dataset Import Recipe & Solution Campaign AWS Step Functions workflow Personalize Runtime API 사용자 행동 로그 수집
  10. • visit • view • cart • buy Web Server

    Users Items Interactions Dataset Group Dataset Import Recipe & Solution Campaign AWS Step Functions workflow S3 Personalize Runtime API 사용자 행동 로그 수집
  11. • visit • view • cart • buy Web Server

    Users Items Interactions Dataset Group Dataset Import Recipe & Solution Campaign AWS Step Functions workflow Personalize Runtime API Users Items Interactions S3 사용자 행동 로그 수집
  12. • visit • view • cart • buy Web Server

    S3 Users Items Interactions Step Functions How to deliver? ü Fastly ü Without loss Personalize Runtime API 사용자 행동 로그 수집
  13. Key Components of Real-time Analytics Data Source Stream Storage Stream

    Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common)
  14. Web Server S3 Users Items Interactions Step Functions Data Source

    Data Sink Personalize Runtime API 사용자 행동 로그 수집 • visit • view • cart • buy
  15. • visit • view • cart • buy Web Server

    S3 Users Items Interactions Step Functions Data Source Data Sink Personalize Runtime API 사용자 행동 로그 수집 Stream Storage
  16. • visit • view • cart • buy Web Server

    S3 Stream Delivery Users Items Interactions Step Functions Data Source Data Sink Personalize Runtime API 사용자 행동 로그 수집 Stream Storage
  17. • visit • view • cart • buy Web Server

    S3 Stream Delivery Users Items Interactions Step Functions Kinesis Data Streams Managed Streaming for Kafka Kinesis Data Firehose Personalize Runtime API 사용자 행동 로그 수집 Stream Storage
  18. Why is Stream Storage? • Decouple producers & consumers •

    Persistent buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce
  19. Hash Function Consumer Consumer Consumer Consumer Group PK PK PK

    PK = next consumer offset oldest data newest data Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3 Anatomy of
  20. Comparing Amazon Kinesis Data Streams to MSK • Streams and

    shards • AWS API experience • Throughput provisioning model • Seamless scaling • Typically lower costs • Deep AWS integrations • Topics and partitions • Open-source compatibility • Strong third-party tooling • Cluster provisioning model • Apache Kafka scaling isn’t seamless to clients • Raw performance Amazon Kinesis Data Streams Amazon MSK
  21. Stream Ingestion • AWS SDKs • Publish directly from application

    code via APIs • AWS Mobile SDK • Kinesis Agent • Monitors log files and forwards lines as messages to Kinesis Data Streams • Kinesis Producer Library (KPL) • Background process aggregates and batches messages • 3rd party and open source • Kafka Connect (kinesis-kafka-connector) • fluentd (aws-fluent-plugin-kinesis) • Log4J Appender (kinesis-log4j-appender) • and more … Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams
  22. Elasticsearch Service Redshift Stream Delivery Data Source Stream Storage Stream

    Process Stream Ingestion Data Sink Stream Delivery Kinesis Data Firehose • Kinesis Agent • CloudWatch Logs • CloudWatch Events • AWS IoT • Direct PUT using APIs • Kinesis Data Streams • MSK(Kafka) using Kafka Connect Kinesis Data Analytics S3
  23. Amazon Kinesis Data Firehose • Zero administration and seamless elasticity

    • Direct-to-data store integration • Serverless continuous data transformations • Near real-time
  24. Kinesis Firehose: Filter, Enrich, Convert Data Source apache log apache

    log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } json Lambda function Kinesis Data Firehose 1 2 3
  25. Pre-built Data Conversion Data Source Kinesis Data Firehose JSON Data

    schema AWS Glue Data Catalog Amazon S3 • Convert the format of your input data from JSON to columnar data format Apache Parquet or Apache ORC before storing the data in Amazon S3 • Works in conjunction to the transform features to convert other format to JSON before the data conversion convert to columnar format /failed
  26. • visit • view • cart • buy Web Server

    S3 Stream Delivery Users Items Interactions Step Functions Personalize Runtime API Kinesis Data Streams Managed Streaming for Kafka Kinesis Data Firehose 사용자 행동 로그 수집 Stream Storage
  27. • visit • view • cart • buy Web Server

    S3 Users Items Interactions Step Functions Kinesis Data Streams Personalize Runtime API 사용자 행동 로그 수집
  28. • visit • view • cart • buy Web Server

    S3 Users Items Interactions Step Functions Kinesis Data Streams Kinesis Data Firehose Personalize Runtime API 사용자 행동 로그 수집
  29. • visit • view • cart • buy Web Server

    S3 Users Items Interactions Step Functions Kinesis Data Streams Kinesis Data Firehose Event Tracker Lambda Personalize Runtime API 사용자 행동 로그 수집
  30. 추천 성과 분석 • E-Commerce 주요 지표 • Retention Rate

    (체류 시간) • Churn Rate (이탈률) • Conversion Rate (전환률) • 추천 알고리즘 지표 - A/B Test • Coverage • CTR (Click-through Rate) ≈ Relevance + Serendipity If you can’t measure it, you can’t improve it. – Peter Drucker
  31. • visit • view • cart • buy Web Server

    S3 Kinesis Firehose Users Items Interactions Event Tracker Lambda Step Functions Personalize Runtime API 추천 성과 분석 Kinesis Data Streams
  32. • visit • view • cart • buy Web Server

    S3 Kinesis Firehose Users Items Interactions Amazon Personalize 사용자 행동 로그 추천 성과 분석 Kinesis Data Streams
  33. • visit • view • cart • buy ü click

    ü impression ü channel Web Server S3 Kinesis Firehose Users Items Interactions Amazon Personalize 사용자 행동 로그 추천 성과 지표 추천 성과 분석 Kinesis Data Streams
  34. • visit • view • cart • buy ü click

    ü impression ü channel Web Server S3 Kinesis Firehose Users Items Interactions Amazon Personalize Marketer Data Scientist Business Analyst 사용자 행동 로그 추천 성과 분석 추천 성과 지표 Kinesis Data Streams
  35. • visit • view • cart • buy ü click

    ü impression ü channel Web Server S3 Kinesis Firehose Users Items Interactions Amazon Personalize Marketer Data Scientist Business Analyst QuickSight 사용자 행동 로그 추천 성과 지표 추천 성과 분석 Kinesis Data Streams
  36. DATA SOURCES Relational Databases Flat Files And Many Others! DATA

    SETS Retail Data Ops Data Marketing Data ANALYSES DASHBOARDS & STORIES Amazon QuickSight Fast BI Service with Pay-per-Session Pricing and ML Insights for everyone
  37. Web Server S3 Kinesis Firehose Users Items Interactions ? QuickSight

    Glue Athena EMR Redshift Batch Interactive Amazon Personalize • visit • view • cart • buy ü click ü impression ü channel 사용자 행동 로그 추천 성과 지표 추천 성과 분석 Kinesis Data Streams
  38. Comparison of SQL Processing engines Data Structure Semi Semi Semi

    Full Languages API/SQL SQL SQL SQL Data Store S3 (Glue), S3/HDFS (Spark) S3/HDFS S3 Local Use case Transformation SQL Queries for S3/HDFS Serverless SQL Queries for S3 Fully Featured SQL Database Performance AWS Glue Amazon Athena Amazon Redshift
  39. Web Server S3 Kinesis Firehose Users Items Interactions QuickSight Amazon

    Personalize Athena • visit • view • cart • buy ü click ü impression ü channel 사용자 행동 로그 추천 성과 지표 추천 성과 분석 Kinesis Data Streams
  40. Web Server S3 Kinesis Firehose Users Items Interactions QuickSight Amazon

    Personalize How to analyze data in real-time? Athena • visit • view • cart • buy ü click ü impression ü channel 추천 성과 분석 Kinesis Data Streams
  41. Kinesis Data Streams Amazon Elasticsearch Service Kibana EMR real-time dashboard

    ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3 Kinesis Data Firehose Collect Store Consume Process/Analyze ETL
  42. Kinesis Data Streams Amazon Elasticsearch Service Kibana EMR real-time dashboard

    ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3 Kinesis Data Firehose Collect Store Consume Process/Analyze ETL
  43. Amazon EMR Applications Framework Process Layer Data Layer Infrastructure S3

    EMRFS Amazon S3 Instances Spot Instances Amazon EMR Easily run and scale Apache Spark, Hive, Presto, and other big data frameworks
  44. • Interact with streaming data in real-time using SQL or

    integrated Apache Flink applications • Build fully managed and elastic stream processing applications Amazon Kinesis Data Analytics A managed Apache Flink solution that enables building of sophisticated streaming applications
  45. Kinesis Data Analytics for SQL Data Source Stream Storage Stream

    Ingestion Data Sink [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] “It's raining cats and dogs!” It’s 1 raining 1 cats 1 and 1 dogs! 1
  46. Kinesis Data Streams Amazon Elasticsearch Service Kibana EMR real-time dashboard

    ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3 Kinesis Data Firehose Collect Store Consume Process/analyze ETL
  47. Kinesis Data Streams Amazon Elasticsearch Service Kibana EMR real-time dashboard

    ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3 Kinesis Data Firehose 3 ways to build Real-time Analytics
  48. Web Server S3 Kinesis Firehose Users Items Interactions QuickSight Amazon

    Personalize How to analyze data in real-time? Athena • visit • view • cart • buy ü click ü impression ü channel 추천 성과 분석 Kinesis Data Streams
  49. Web Server S3 Kinesis Firehose Users Items Interactions Athena QuickSight

    Amazon Personalize Kinesis Data Firehose Amazon ES Kibana • visit • view • cart • buy ü click ü impression ü channel 추천 성과 분석 Kinesis Data Streams
  50. Web Server Kinesis Firehose Users Items Interactions Kibana Athena QuickSight

    Kinesis Firehose Amazon Personalize Amazon ES S3 데이터 분석 시스템 Kinesis Data Streams
  51. Web Server Kinesis Firehose Users Items Interactions Kibana Athena QuickSight

    Data Lake Kinesis Firehose Amazon Personalize Amazon ES 데이터 분석 시스템 Kinesis Data Streams
  52. Use Amazon S3 as your Data lake • Natively supported

    by big data frameworks (Spark, Hive, Presto, etc.) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters and services can use the same data • Designed for 99.999999999% durability • No need to pay for data replication within a region • Secure: SSL, client/server-side encryption at rest • Low cost
  53. Web Server Kinesis Firehose Users Items Interactions Kibana Athena QuickSight

    Data Lake Kinesis Firehose Amazon Personalize Amazon ES 데이터 분석 시스템 Kinesis Data Streams
  54. Web Server S3 Users Items Interactions Kibana Athena QuickSight Data

    Lake Stream Storage Stream Delivery Kinesis Data Firehose Kinesis Data Firehose Kinesis Data Streams Amazon ES 데이터 분석 시스템 Amazon Personalize
  55. Web Server S3 Kinesis Firehose Users Items Interactions Amazon Personalize

    Amazon ES Kibana Kinesis Firehose Athena QuickSight Data Lake Batch Layer Speed Layer Serving Layer 데이터 분석 시스템 Kinesis Data Streams
  56. Streaming Data Batch View Stream Process Real-time View Query Query

    Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer Lambda Architecture
  57. Web Server S3 Kinesis Firehose Users Items Interactions Kibana Athena

    QuickSight Event Tracker Lambda Step Functions Kinesis Firehose Dataset Group Dataset Import Recipe & Solution Campaign Amazon ES Personalize Runtime API 추천 + 데이터 분석 시스템 Kinesis Data Streams
  58. Web Server S3 Kinesis Firehose Users Items Interactions Kibana Athena

    QuickSight Event Tracker Lambda Step Functions Kinesis Firehose Dataset Group Dataset Import Recipe & Solution Campaign Amazon ES Amazon Personalize Personalize Runtime API 추천 시스템 Kinesis Data Streams
  59. Web Server S3 Kinesis Firehose Users Items Interactions Kibana Athena

    QuickSight Event Tracker Lambda Step Functions Kinesis Firehose Dataset Group Dataset Import Recipe & Solution Campaign Amazon ES Analytics Personalize Runtime API 데이터 분석 시스템 Kinesis Data Streams
  60. Web Server S3 Kinesis Firehose Users Items Interactions Kibana Athena

    QuickSight Event Tracker Lambda Step Functions Kinesis Firehose Dataset Group Dataset Import Recipe & Solution Campaign Amazon ES Amazon Personalize Analytics Personalize Runtime API 추천 + 데이터 분석 Kinesis Data Streams
  61. Web Server Kinesis Firehose Users Items Interactions Kibana Athena QuickSight

    Event Tracker Lambda Step Functions Kinesis Firehose Dataset Group Dataset Import Recipe & Solution Campaign Amazon ES Amazon Personalize Kinesis Data Streams S3 Personalize Runtime API Analytics 추천 + 데이터 분석
  62. From Batch to Real-time: Lambda Architecture Data Source Stream Storage

    Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process
  63. Collect Consume Store Process / Analyze Data 1 4 0

    9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL
  64. Collect Consume Store Process / Analyze Data 1 4 0

    9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL Stream Storage
  65. Collect Consume Store Process / Analyze Data 1 4 0

    9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL Stream Delivery
  66. Collect Consume Store Process / Analyze Data 1 4 0

    9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL Data Lake
  67. Collect Consume Store Process / Analyze Data 1 4 0

    9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL Stream/Batch Process (Batch, Speed Layer)
  68. Collect Consume Store Process / Analyze Data 1 4 0

    9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL Serving Layer
  69. Lessons Learned: Architectural Principles • Build decoupled systems - Data

    → Store → Process → Store → Analyze → Answers • Use the right tool for the job - Data structure, latency, throughput, access patterns • Leverage managed and serverless services - Scalable/elastic, available, reliable, secure, no/low admin • Use log-centric design patterns - Immutable logs (data lake), materialized views • Be cost-conscious - Big data ≠ Big cost • Working backwards - Design from consume to collect
  70. Reference • Build BI System From Scratch • Hands-on-Lab: https://serverless-bi-system-from-scratch.workshop.aws/ko/

    • Sample Code: https://tinyurl.com/37d9kd76 • Video: https://tinyurl.com/y2r6kljp • Real-time Analytics on AWS • Slide: https://tinyurl.com/tpacz9w3 • Video: https://tinyurl.com/s5d83982 • Choose Right Stream Storage: Amazon Kinesis Data Streams vs. MSK(Kafka) • Slide: https://tinyurl.com/3eetzek5 • Video: https://tinyurl.com/yfzttdbm • AWS 리소스 허브 – 데이터베이스 및 데이터 분석AI • https://kr-resources.awscloud.com/data-databases-and-analytics