추천 시스템을 위한 데이터 분석 시스템 구축하기

© 2020, Amazon Web Services, Inc. or its Affiliates. Sungmin,
Kim AWS Solutions Architect 추천 서비스를 위한 데이터 분석 시스템 구축하기

Agenda • 데이터 분석의 위한 사전 지식 • 데이터 구조
• 데이터 온도 스펙트럼 • 데이터 파이프라인 • 추천 시스템 구축을 위해 필요한 데이터 • 사용자 행동 로그 • 추천 성과 분석 지표 • 사용자 행동 로그 수집을 위한 데이터 분석 아키텍처 • 추천 성과 분석을 위한 데이터 분석 시스템 확장 • Lesson Learned

데이터 분석에 필요한 3가지 개념

Structured, Unstructured, and Semi-Structured

Structure Hot data Warm data Cold data Low High High
Request rate Low High Cost / GB Low High Latency Low High Data Volume Low In-Memory SQL NoSQL Search Object Storage Archive Storage Graph Data Temperature Spectrum

Simplify Big Data Processing Collect Consume Store Process/Analyze Data 1
4 0 9 5 Answers & Insights Time to answer (Latency) Throughput Cost ETL

추천 시스템에 필요한 데이터 • 추천 계산을 위한 사용자 행동
로그 • 상품 상세 페이지 보기 • 장바구니 담기 • 상품 구매 • … • 추천 성과 측정을 위한 데이터 • 추천 아이템 노출 횟수 • 추천 아이템 클릭 횟수 • 추천 아이템 노출 위치 • … visit view cart buy

사용자 행동 로그 수집 Data Set Group Users Items Interactions
Solution (Recipes) Model selection, training, tunning and verification Campaign Model hosting, and inference Events Tracker Data Sets Users, Items, and Interactions Real-Time Events Amazon Personalize

• visit • view • cart • buy Web Server
Users Items Interactions Amazon Personalize 사용자 행동 로그 수집

Users Items Interactions Dataset Group Dataset Import Recipe & Solution Campaign Amazon Personalize Personalize Runtime API 사용자 행동 로그 수집

Users Items Interactions Dataset Group Dataset Import Recipe & Solution Campaign AWS Step Functions workflow Personalize Runtime API 사용자 행동 로그 수집

Users Items Interactions Dataset Group Dataset Import Recipe & Solution Campaign AWS Step Functions workflow S3 Personalize Runtime API 사용자 행동 로그 수집

Users Items Interactions Dataset Group Dataset Import Recipe & Solution Campaign AWS Step Functions workflow Personalize Runtime API Users Items Interactions S3 사용자 행동 로그 수집

S3 Users Items Interactions Step Functions How to deliver? ü Fastly ü Without loss Personalize Runtime API 사용자 행동 로그 수집

Key Components of Real-time Analytics Data Source Stream Storage Stream
Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common)

Web Server S3 Users Items Interactions Step Functions Data Source
Data Sink Personalize Runtime API 사용자 행동 로그 수집 • visit • view • cart • buy

S3 Users Items Interactions Step Functions Data Source Data Sink Personalize Runtime API 사용자 행동 로그 수집 Stream Storage

S3 Stream Delivery Users Items Interactions Step Functions Data Source Data Sink Personalize Runtime API 사용자 행동 로그 수집 Stream Storage

S3 Stream Delivery Users Items Interactions Step Functions Kinesis Data Streams Managed Streaming for Kafka Kinesis Data Firehose Personalize Runtime API 사용자 행동 로그 수집 Stream Storage

Why is Stream Storage? • Decouple producers & consumers •
Persistent buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce

Hash Function Consumer Consumer Consumer Consumer Group PK PK PK
PK = next consumer offset oldest data newest data Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3 Anatomy of

Comparing Amazon Kinesis Data Streams to MSK • Streams and
shards • AWS API experience • Throughput provisioning model • Seamless scaling • Typically lower costs • Deep AWS integrations • Topics and partitions • Open-source compatibility • Strong third-party tooling • Cluster provisioning model • Apache Kafka scaling isn’t seamless to clients • Raw performance Amazon Kinesis Data Streams Amazon MSK

Stream Ingestion • AWS SDKs • Publish directly from application
code via APIs • AWS Mobile SDK • Kinesis Agent • Monitors log files and forwards lines as messages to Kinesis Data Streams • Kinesis Producer Library (KPL) • Background process aggregates and batches messages • 3rd party and open source • Kafka Connect (kinesis-kafka-connector) • fluentd (aws-fluent-plugin-kinesis) • Log4J Appender (kinesis-log4j-appender) • and more … Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams

Elasticsearch Service Redshift Stream Delivery Data Source Stream Storage Stream
Process Stream Ingestion Data Sink Stream Delivery Kinesis Data Firehose • Kinesis Agent • CloudWatch Logs • CloudWatch Events • AWS IoT • Direct PUT using APIs • Kinesis Data Streams • MSK(Kafka) using Kafka Connect Kinesis Data Analytics S3

Amazon Kinesis Data Firehose • Zero administration and seamless elasticity
• Direct-to-data store integration • Serverless continuous data transformations • Near real-time

Kinesis Firehose: Filter, Enrich, Convert Data Source apache log apache
log json Data Sink [Wed Oct 11 14:32:52 2017] [error] [client 192.34.86.178] [Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1] { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" } geo-ip { "recordId": "1", "result": "Ok", "data": { "date": "2017/10/11 14:32:52", "status": "error", "source": "192.34.86.178", "city": "Boston", "state": "MA" }, }, { "recordId": "2", "result": "Dropped" } json Lambda function Kinesis Data Firehose 1 2 3

Pre-built Data Conversion Data Source Kinesis Data Firehose JSON Data
schema AWS Glue Data Catalog Amazon S3 • Convert the format of your input data from JSON to columnar data format Apache Parquet or Apache ORC before storing the data in Amazon S3 • Works in conjunction to the transform features to convert other format to JSON before the data conversion convert to columnar format /failed

S3 Stream Delivery Users Items Interactions Step Functions Personalize Runtime API Kinesis Data Streams Managed Streaming for Kafka Kinesis Data Firehose 사용자 행동 로그 수집 Stream Storage

S3 Users Items Interactions Step Functions Kinesis Data Streams Personalize Runtime API 사용자 행동 로그 수집

S3 Users Items Interactions Step Functions Kinesis Data Streams Kinesis Data Firehose Personalize Runtime API 사용자 행동 로그 수집

S3 Users Items Interactions Step Functions Kinesis Data Streams Kinesis Data Firehose Event Tracker Lambda Personalize Runtime API 사용자 행동 로그 수집

추천 성과 분석 • E-Commerce 주요 지표 • Retention Rate
(체류 시간) • Churn Rate (이탈률) • Conversion Rate (전환률) • 추천 알고리즘 지표 - A/B Test • Coverage • CTR (Click-through Rate) ≈ Relevance + Serendipity If you can’t measure it, you can’t improve it. – Peter Drucker

S3 Kinesis Firehose Users Items Interactions Event Tracker Lambda Step Functions Personalize Runtime API 추천 성과 분석 Kinesis Data Streams

S3 Kinesis Firehose Users Items Interactions Amazon Personalize 사용자 행동 로그 추천 성과 분석 Kinesis Data Streams

• visit • view • cart • buy ü click
ü impression ü channel Web Server S3 Kinesis Firehose Users Items Interactions Amazon Personalize 사용자 행동 로그 추천 성과 지표 추천 성과 분석 Kinesis Data Streams

ü impression ü channel Web Server S3 Kinesis Firehose Users Items Interactions Amazon Personalize Marketer Data Scientist Business Analyst 사용자 행동 로그 추천 성과 분석 추천 성과 지표 Kinesis Data Streams

ü impression ü channel Web Server S3 Kinesis Firehose Users Items Interactions Amazon Personalize Marketer Data Scientist Business Analyst QuickSight 사용자 행동 로그 추천 성과 지표 추천 성과 분석 Kinesis Data Streams

DATA SOURCES Relational Databases Flat Files And Many Others! DATA
SETS Retail Data Ops Data Marketing Data ANALYSES DASHBOARDS & STORIES Amazon QuickSight Fast BI Service with Pay-per-Session Pricing and ML Insights for everyone

Web Server S3 Kinesis Firehose Users Items Interactions ? QuickSight
Glue Athena EMR Redshift Batch Interactive Amazon Personalize • visit • view • cart • buy ü click ü impression ü channel 사용자 행동 로그 추천 성과 지표 추천 성과 분석 Kinesis Data Streams

Comparison of SQL Processing engines Data Structure Semi Semi Semi
Full Languages API/SQL SQL SQL SQL Data Store S3 (Glue), S3/HDFS (Spark) S3/HDFS S3 Local Use case Transformation SQL Queries for S3/HDFS Serverless SQL Queries for S3 Fully Featured SQL Database Performance AWS Glue Amazon Athena Amazon Redshift

Web Server S3 Kinesis Firehose Users Items Interactions QuickSight Amazon
Personalize Athena • visit • view • cart • buy ü click ü impression ü channel 사용자 행동 로그 추천 성과 지표 추천 성과 분석 Kinesis Data Streams

Personalize How to analyze data in real-time? Athena • visit • view • cart • buy ü click ü impression ü channel 추천 성과 분석 Kinesis Data Streams

Kinesis Data Streams Amazon Elasticsearch Service Kibana EMR real-time dashboard
ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3 Kinesis Data Firehose Collect Store Consume Process/Analyze ETL

Amazon Kinesis Data Analytics for Flink AWS Glue Amazon EMR
Serverless Serverless Fully Managed

Amazon EMR Applications Framework Process Layer Data Layer Infrastructure S3
EMRFS Amazon S3 Instances Spot Instances Amazon EMR Easily run and scale Apache Spark, Hive, Presto, and other big data frameworks

• Interact with streaming data in real-time using SQL or
integrated Apache Flink applications • Build fully managed and elastic stream processing applications Amazon Kinesis Data Analytics A managed Apache Flink solution that enables building of sophisticated streaming applications

Kinesis Data Analytics for SQL Data Source Stream Storage Stream
Ingestion Data Sink [("It's", 1), ("raining", 1), ("cats", 1), ("and", 1), ("dogs!", 1)] “It's raining cats and dogs!” It’s 1 raining 1 cats 1 and 1 dogs! 1

ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3 Kinesis Data Firehose Collect Store Consume Process/analyze ETL

Amazon Elasticsearch Service Fully managed, scalable, and secure Elasticsearch service

ElastiCache Kinesis Data Analytics Lambda function QuickSight Amazon RDS Kinesis Data Streams DynamoDB 1 2 3 Kinesis Data Firehose 3 ways to build Real-time Analytics

Personalize How to analyze data in real-time? Athena • visit • view • cart • buy ü click ü impression ü channel 추천 성과 분석 Kinesis Data Streams

Web Server S3 Kinesis Firehose Users Items Interactions Athena QuickSight
Amazon Personalize Kinesis Data Firehose Amazon ES Kibana • visit • view • cart • buy ü click ü impression ü channel 추천 성과 분석 Kinesis Data Streams

Web Server Kinesis Firehose Users Items Interactions Kibana Athena QuickSight
Kinesis Firehose Amazon Personalize Amazon ES S3 데이터 분석 시스템 Kinesis Data Streams

Data Lake Kinesis Firehose Amazon Personalize Amazon ES 데이터 분석 시스템 Kinesis Data Streams

Use Amazon S3 as your Data lake • Natively supported
by big data frameworks (Spark, Hive, Presto, etc.) • Decouple storage and compute • No need to run compute clusters for storage (unlike HDFS) • Can run transient Amazon EMR clusters with Amazon EC2 Spot Instances • Multiple & heterogeneous analysis clusters and services can use the same data • Designed for 99.999999999% durability • No need to pay for data replication within a region • Secure: SSL, client/server-side encryption at rest • Low cost

Data Lake Kinesis Firehose Amazon Personalize Amazon ES 데이터 분석 시스템 Kinesis Data Streams

Web Server S3 Users Items Interactions Kibana Athena QuickSight Data
Lake Stream Storage Stream Delivery Kinesis Data Firehose Kinesis Data Firehose Kinesis Data Streams Amazon ES 데이터 분석 시스템 Amazon Personalize

Web Server S3 Kinesis Firehose Users Items Interactions Amazon Personalize
Amazon ES Kibana Kinesis Firehose Athena QuickSight Data Lake Batch Layer Speed Layer Serving Layer 데이터 분석 시스템 Kinesis Data Streams

Streaming Data Batch View Stream Process Real-time View Query Query
Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer Lambda Architecture

Web Server S3 Kinesis Firehose Users Items Interactions Kibana Athena
QuickSight Event Tracker Lambda Step Functions Kinesis Firehose Dataset Group Dataset Import Recipe & Solution Campaign Amazon ES Personalize Runtime API 추천 + 데이터 분석 시스템 Kinesis Data Streams

QuickSight Event Tracker Lambda Step Functions Kinesis Firehose Dataset Group Dataset Import Recipe & Solution Campaign Amazon ES Amazon Personalize Personalize Runtime API 추천 시스템 Kinesis Data Streams

QuickSight Event Tracker Lambda Step Functions Kinesis Firehose Dataset Group Dataset Import Recipe & Solution Campaign Amazon ES Analytics Personalize Runtime API 데이터 분석 시스템 Kinesis Data Streams

QuickSight Event Tracker Lambda Step Functions Kinesis Firehose Dataset Group Dataset Import Recipe & Solution Campaign Amazon ES Amazon Personalize Analytics Personalize Runtime API 추천 + 데이터 분석 Kinesis Data Streams

Event Tracker Lambda Step Functions Kinesis Firehose Dataset Group Dataset Import Recipe & Solution Campaign Amazon ES Amazon Personalize Kinesis Data Streams S3 Personalize Runtime API Analytics 추천 + 데이터 분석

From Batch to Real-time: Lambda Architecture Data Source Stream Storage
Speed Layer Batch Layer Batch Process Batch View Real- time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process

Collect Consume Store Process / Analyze Data 1 4 0
9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL

9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL Stream Storage

9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL Stream Delivery

9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL Data Lake

9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL Stream/Batch Process (Batch, Speed Layer)

9 5 Answers & Insights Amazon Kinesis Data Firehose Amazon Kinesis Data Streams Amazon Managed Streams for Kafka Amazon S3 Amazon Kinesis Data Analytics AWS Glue Amazon EMR Amazon Athena Amazon QuickSight Amazon Redshift Amazon Elasticsearch Service Amazon Machine Learning AWS Lambda ETL Serving Layer

Lessons Learned: Architectural Principles • Build decoupled systems - Data
→ Store → Process → Store → Analyze → Answers • Use the right tool for the job - Data structure, latency, throughput, access patterns • Leverage managed and serverless services - Scalable/elastic, available, reliable, secure, no/low admin • Use log-centric design patterns - Immutable logs (data lake), materialized views • Be cost-conscious - Big data ≠ Big cost • Working backwards - Design from consume to collect

Reference • Build BI System From Scratch • Hands-on-Lab: https://serverless-bi-system-from-scratch.workshop.aws/ko/
• Sample Code: https://tinyurl.com/37d9kd76 • Video: https://tinyurl.com/y2r6kljp • Real-time Analytics on AWS • Slide: https://tinyurl.com/tpacz9w3 • Video: https://tinyurl.com/s5d83982 • Choose Right Stream Storage: Amazon Kinesis Data Streams vs. MSK(Kafka) • Slide: https://tinyurl.com/3eetzek5 • Video: https://tinyurl.com/yfzttdbm • AWS 리소스 허브 – 데이터베이스 및 데이터 분석AI • https://kr-resources.awscloud.com/data-databases-and-analytics

추천 시스템을 위한 데이터 분석 시스템 구축하기

추천 시스템을 위한 데이터 분석 시스템 구축하기

More Decks by Sungmin Kim

Other Decks in Programming

Featured

Transcript