Slide 1

Slide 1 text

Serverless Streaming Log Architecture ~ Theory & Practice ~ Kenju Wagatsuma

Slide 2

Slide 2 text

Agenda Part I … Theory - What is “streaming” ? - “Batch” vs “Streaming” - “Event Time” vs “Processing Time” - Lambda Architecture - Kappa Architecture - Apache Hadoop/Storm/Spark/Kafka/Flink - Late Logs - Discarding, Watermark, Trigger, Accumulation Part II … Practice - Overall Data-flow - Watermark Implementation - Aggregation - Kinesis -> Lambda -> DynamoDB - DynamoDB Streams -> Lambda -> DynamoDB - Monitoring - “GetRecords.IteratorAgeMilliseconds” - DynamoDB Streams -> Lambda -> Slack - Misc (Cognito, Golang, Serverless Framework)

Slide 3

Slide 3 text

Who are you? Kenju Wagatsuma - Serverside Engineer at Cookpad Inc. - Ruby, Golang, AWS - https://github.com/kenju/ - “Header Bidding 導入によるネットワーク広告改善 の開発事情”

Slide 4

Slide 4 text

Theory - Streaming Part I

Slide 5

Slide 5 text

Definition & Glossary

Slide 6

Slide 6 text

Image Area What is “streaming” ? Definition: - a type of data processing engine that is designed with infinite data sets in mind. (https://www.oreilly.com/ideas/the-world-bey ond-batch-streaming-101)

Slide 7

Slide 7 text

“Batch” vs “Streaming” - Batch Processing - Process grouped logs at once, and process occasionaly - Streaming Processing - Micro-batch - ex) AWS Lambda with “Batch Size = 2 ~ n” (n is a not-too-large natural number.) - Real Streaming - ex) AWS Lambda with “Batch Size = 1”

Slide 8

Slide 8 text

Image Area - Event time, which is the time at which events actually occurred. - Processing time, which is the time at which events are observed in the system. Figure: Example time domain mapping from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 “Event Time” vs “Processing Time”

Slide 9

Slide 9 text

Architecture

Slide 10

Slide 10 text

Lambda Architecture - introduced by Nathan Marz, the programmer of Apache Storm - “How to beat the CAP theorem” http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html - Batch Layer + Serving Layer + Speed Layer - Batch Layer … re-computable, can ensure Consistency - Serving Layer … merge views from Batch Layer & real-time logs from Speed Layer - Speed Layer … low latency

Slide 11

Slide 11 text

Image Area Figure: lambda architecture from http://lambda-architecture.net/

Slide 12

Slide 12 text

Image Area Presentation: https://speakerdeck.com/wata/realtime-ad-log-aggregation-and-utilization

Slide 13

Slide 13 text

Image Area Presentation: https://speakerdeck.com/wata/realtime-ad-log-aggregation-and-utilization

Slide 14

Slide 14 text

Kappa Architecture - introduced by Jay Kreps, a co-founder and CEO at Confluent which was acquired by LinkedIn - “Questioning the Lambda Architecture” https://www.oreilly.com/ideas/questioning-the-lambda-architecture - Streaming Layer + Serving Layer

Slide 15

Slide 15 text

Image Area Figure: kappa architecture from https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry

Slide 16

Slide 16 text

Lambda vs Kappa Architecture Pros Cons Lambda Architecture - Robust to data consistency - Harder to maintain multiple layers Kappa Architecture - Simple implementation - Need extra works to guarantee data onsistency

Slide 17

Slide 17 text

Apache Hadoop/Storm/Spark/Kafka/Flink Name Speciality Batch Processing Stream Processing Hadoop HDFS(FIle System), YARN, MapReduce O X Storm Topology (Spout + Bolt), Tuple, Task X O Kafka Broker, Producer/Consumer (“Un-Managed Kinesis” ?) X O Spark inspired by Hadoop’s MapReduce engine O O (Spark Streaming) Flink Batch and Streaming in One System, ML Support, DataStram API O O

Slide 18

Slide 18 text

Common Problems

Slide 19

Slide 19 text

Late Logs Figure: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Slide 20

Slide 20 text

How to tackle on “Late Logs”? - Discarding - simply discard late logs - Watermarks - “all input data with event times less than X have been observed.” - Triggers - declaring when the output for a window should be materialized - Accumulation - accumulate multiple results that are observed for the same window Read https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 for more details :)

Slide 21

Slide 21 text

What is “Watermark” Definition: a special mark contained in electronic documents, pictures, music etc that is used to stop people from copying them Example: audio watermark detection, photo watermark, copyright watermark

Slide 22

Slide 22 text

How to “Watermark” - A. Save State in any external memory - e.g. RDS, DynamoDB - B. Calculate on-memory - formula … - e.g. late threshold = 10 minutes - use `median(event time)` instead to handle the too-future logs - e.g. mobile devices’ system clock are somehow modified by users incorrectly watermark = max(event time) - late_threshold

Slide 23

Slide 23 text

Image Area Figure: https://cdn.oreillystatic.com/en/assets/1/event/155/Watermarks_%20Time%20and%20progress%20in%20streaming%2

Slide 24

Slide 24 text

Image Area Figure: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Slide 25

Slide 25 text

Practice - storeTV Part II

Slide 26

Slide 26 text

Image Area

Slide 27

Slide 27 text

Overall Data-flow - Kinesis Stream receives logs from Android clients directly via kinesis:PutRecord - once per 90 sec, from at most 15,000 devices - Lambda polls Kinesis and aggregate as impression by increment - DynamoDB stores incremented records with UpdateItem (ADD) operation - another Lambda(s) polls DynamoDB Streams and aggregate hourly/daily

Slide 28

Slide 28 text

Aggregation

Slide 29

Slide 29 text

Aggregation

Slide 30

Slide 30 text

Aggregation - Aggregate gradually per minute -> per 10 min -> hourly -> daily - Partition usual logs/lagged logs, and update records to the separate tables - NOT discarding (for now) to see how many logs will be discarded - determine whether logs’ timestamp is behind the watermark or not - watermark … the median of all timestamps - because users can change system clocks to the future

Slide 31

Slide 31 text

Watermark Implementation func (sr *StreamRecords) watermark(eventTimes []EventTime) (median EventTime) { sort.Ints(eventTimes) // 1. sort l := len(eventTimes) // 2. get the median if l%2 == 0 { // when even median = Mean(eventTimes[l/2-1 : l/2 + 1]) } else { // when odd median = EventTime(eventTimes[l/2]) } return median }

Slide 32

Slide 32 text

Monitoring

Slide 33

Slide 33 text

Monitoring

Slide 34

Slide 34 text

Image Area What to “monitor”? The famous Google’s “SRE” book says in Chapter 6 “The Four Golden Signals” section: - Latency - Traffix - Errors - Saturation Monitoring - The Four Golden Signals

Slide 35

Slide 35 text

Image Area - Latency - How long does it take? - Traffix - How many Get/PutRecords? - Errors - Availability? - How many errors occur? - Saturation - Any delayed data? - IteratorAge? Monitoring - CloudWatch Dashboard

Slide 36

Slide 36 text

Image Area - Create custom metrics - via cloudwatch:PutMetricData - Flexible alarm setting - Period - Evaluation Period - Datapoints to Alarm Figure: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html Monitoring - Custom Metrics & Alarm

Slide 37

Slide 37 text

Monitoring - Implementation Example dataInput := &cloudwatch.PutMetricDataInput{ Namespace: aws.String("StoreTvAdLambdaMetrics"), MetricData: []*cloudwatch.MetricDatum{ { Dimensions: []*cloudwatch.Dimension{ { Name: aws.String("Function"), Value: aws.String("monitor-late-data"), }, }, MetricName: aws.String("LateLogCount"), Unit: aws.String("Count"), Value: aws.Float64(float64(lateLogCount)), }, }, } cloudWatchClient.PutMetricData(dataInput)

Slide 38

Slide 38 text

Monitoring - Kinesis Streams IteraterAge - “GetRecords.IteratorAgeMilliseconds” Metrics - can monitor “how much stream processing is delayed” - https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html - [CloudWatch Alarms -> SNS Topic -> Lambda -> Slack]

Slide 39

Slide 39 text

Monitoring - Stream Saturation - Simply calculate the diff between updated records’ timestamp and current time, and compare with the threshold. - timestamp … get from DynamoDB Stream event records - threshold … pass via ENV (currently 10 min) - calculation … - [DynamoDB Stream -> Custom CloudWatch Metrics -> SNS -> Lambda -> Slack] time.Since(timestamp).Minutes() >= threshold

Slide 40

Slide 40 text

Analysis

Slide 41

Slide 41 text

Analysis

Slide 42

Slide 42 text

Misc

Slide 43

Slide 43 text

Image Area Cognito - pass role to the Android clients - which can kinesis:PutRecord to the target Kinesis Stream ARN - use unidentified pool - because the Android does not need any login feature - much more secure than embedding API_KEY/CREDENTIAL_KEY to the Android clients

Slide 44

Slide 44 text

Lambda x Golang 1.x - Officially supported from January 15th, 2018 - https://dev.classmethod.jp/cloud/aws/aws-lambda-supports-go/ - aws/aws-lambda-go - https://github.com/aws/aws-lambda-go - can easily grasp what kind of JSON event records will be available with Type - IMHO - One of the favorite language above other officially supported languages - Type, Runtime Performance, ecosystem, etc. - goroutine/channels have too much overhead for running on Lambda

Slide 45

Slide 45 text

Lambda x Golang 1.x 1. init() function for declaring the global vars > A single instance of your Lambda function will never handle multiple events simultaneously https://docs.aws.amazon.com/lambda/latest/dg/go-programming-model-handler-types.html

Slide 46

Slide 46 text

Lambda x Golang 1.x 2. use -ldflags=”-s -w” to reduce the binary size - -s … Omit the symbol table and debug information. - -w … Omit the DWARF symbol table. by https://golang.org/cmd/link/

Slide 47

Slide 47 text

go build without -ldflags=“-s -w” $ ls -lh bin/ total 145704 -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:26 aggregate-daily-logs -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:26 aggregate-hourly-logs -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:26 aggregate-logs -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:27 monitor-late-data -rwxr-xr-x 1 kenju-wagatsuma staff 14M Aug 9 22:26 put-s3 -rwxr-xr-x 1 kenju-wagatsuma staff 7.8M Aug 9 22:27 sns-notification

Slide 48

Slide 48 text

go build -ldflags=“-s -w” $ ls -lh bin/ total 96776 -rwxr-xr-x 1 kenju-wagatsuma staff 8.3M Aug 9 22:05 aggregate-daily-logs -rwxr-xr-x 1 kenju-wagatsuma staff 8.3M Aug 9 22:05 aggregate-hourly-logs -rwxr-xr-x 1 kenju-wagatsuma staff 8.2M Aug 9 22:05 aggregate-logs -rwxr-xr-x 1 kenju-wagatsuma staff 7.8M Aug 9 22:05 monitor-late-data -rwxr-xr-x 1 kenju-wagatsuma staff 9.3M Aug 9 22:05 put-s3 -rwxr-xr-x 1 kenju-wagatsuma staff 5.4M Aug 9 22:05 sns-notification

Slide 49

Slide 49 text

[NOTE] Golang dependencies $ dep status PROJECT CONSTRAINT VERSION REVISION LATEST PKGS USED github.com/aws/aws-lambda-go ^1.0.0 v1.2.0 4d30d0f e630af3 4 github.com/aws/aws-sdk-go ^1.14.23 v1.15.3 cc03a15 36aaf21 37 github.com/go-ini/ini * v1.38.1 358ee76 358ee76 1 github.com/jmespath/go-jmespath * 0b12d6b 1 github.com/kenju/go-cloudwatch branch master branch master c60ecc3 c60ecc3 1 github.com/kenju/go-nested-counter branch master branch master c6ca0d8 c6ca0d8 1 github.com/kenju/go-slack-webhook branch master branch master 627aa7e 627aa7e 1 github.com/satori/go.uuid ^1.2.0 v1.2.0 f58768c f58768c 1

Slide 50

Slide 50 text

Serverless Framework - Why Serverless Framework? - Development Speed (easy to configure) - Motivation (never used before at production) - (Might be) easy to migrate to CloudFormation/SAM later - Why not Apex? - Easy to deploy Lambda, but that’s all - Why not SAM? - Writing CloudFormation stacks from the scratch might takes time - However, sam-local is a great tool so might migrate to SAM in the near future

Slide 51

Slide 51 text

Serverless Framework - $ serverless deploy --stage (dev|prod|staging) - change stage via `--stage` option - $ serverless metrics - show simple metrics for functions - $ serverless invoke - Useful Lambda Event fixtures can be found at ... https://github.com/aws/aws-lambda-go/tree/master/events/testdata

Slide 52

Slide 52 text

serverless invoke $ cat Makefile | tail -n8 run-sns-notification: deploy-dev serverless invoke \ --log \ --stage dev \ --function sns-notification \ --path fixtures/sns-events.json

Slide 53

Slide 53 text

Thank you! ヽ(*・ω・)ノ