Serverless Streaming Log Architecture ~ Theory & Practice ~

Serverless Streaming Log Architecture ~ Theory & Practice ~ Kenju
Wagatsuma

Agenda Part I … Theory - What is “streaming” ?
- “Batch” vs “Streaming” - “Event Time” vs “Processing Time” - Lambda Architecture - Kappa Architecture - Apache Hadoop/Storm/Spark/Kafka/Flink - Late Logs - Discarding, Watermark, Trigger, Accumulation Part II … Practice - Overall Data-flow - Watermark Implementation - Aggregation - Kinesis -> Lambda -> DynamoDB - DynamoDB Streams -> Lambda -> DynamoDB - Monitoring - “GetRecords.IteratorAgeMilliseconds” - DynamoDB Streams -> Lambda -> Slack - Misc (Cognito, Golang, Serverless Framework)

Who are you? Kenju Wagatsuma - Serverside Engineer at Cookpad
Inc. - Ruby, Golang, AWS - https://github.com/kenju/ - “Header Bidding 導入によるネットワーク広告改善の開発事情”

Theory - Streaming Part I

Definition & Glossary

Image Area What is “streaming” ? Definition: - a type
of data processing engine that is designed with infinite data sets in mind. (https://www.oreilly.com/ideas/the-world-bey ond-batch-streaming-101)

“Batch” vs “Streaming” - Batch Processing - Process grouped logs
at once, and process occasionaly - Streaming Processing - Micro-batch - ex) AWS Lambda with “Batch Size = 2 ~ n” (n is a not-too-large natural number.) - Real Streaming - ex) AWS Lambda with “Batch Size = 1”

Image Area - Event time, which is the time at
which events actually occurred. - Processing time, which is the time at which events are observed in the system. Figure: Example time domain mapping from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 “Event Time” vs “Processing Time”

Architecture

Lambda Architecture - introduced by Nathan Marz, the programmer of
Apache Storm - “How to beat the CAP theorem” http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html - Batch Layer + Serving Layer + Speed Layer - Batch Layer … re-computable, can ensure Consistency - Serving Layer … merge views from Batch Layer & real-time logs from Speed Layer - Speed Layer … low latency

Image Area Figure: lambda architecture from http://lambda-architecture.net/

Image Area Presentation: https://speakerdeck.com/wata/realtime-ad-log-aggregation-and-utilization

Kappa Architecture - introduced by Jay Kreps, a co-founder and
CEO at Confluent which was acquired by LinkedIn - “Questioning the Lambda Architecture” https://www.oreilly.com/ideas/questioning-the-lambda-architecture - Streaming Layer + Serving Layer

Image Area Figure: kappa architecture from https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry

Lambda vs Kappa Architecture Pros Cons Lambda Architecture - Robust
to data consistency - Harder to maintain multiple layers Kappa Architecture - Simple implementation - Need extra works to guarantee data onsistency

Apache Hadoop/Storm/Spark/Kafka/Flink Name Speciality Batch Processing Stream Processing Hadoop HDFS(FIle
System), YARN, MapReduce O X Storm Topology (Spout + Bolt), Tuple, Task X O Kafka Broker, Producer/Consumer (“Un-Managed Kinesis” ?) X O Spark inspired by Hadoop’s MapReduce engine O O (Spark Streaming) Flink Batch and Streaming in One System, ML Support, DataStram API O O

Common Problems

Late Logs Figure: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

How to tackle on “Late Logs”? - Discarding - simply
discard late logs - Watermarks - “all input data with event times less than X have been observed.” - Triggers - declaring when the output for a window should be materialized - Accumulation - accumulate multiple results that are observed for the same window Read https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 for more details :)

What is “Watermark” Definition: a special mark contained in electronic
documents, pictures, music etc that is used to stop people from copying them Example: audio watermark detection, photo watermark, copyright watermark

How to “Watermark” - A. Save State in any external
memory - e.g. RDS, DynamoDB - B. Calculate on-memory - formula … - e.g. late threshold = 10 minutes - use `median(event time)` instead to handle the too-future logs - e.g. mobile devices’ system clock are somehow modified by users incorrectly watermark = max(event time) - late_threshold

Image Area Figure: https://cdn.oreillystatic.com/en/assets/1/event/155/Watermarks_%20Time%20and%20progress%20in%20streaming%2

Image Area Figure: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

Practice - storeTV Part II

Image Area

Overall Data-flow - Kinesis Stream receives logs from Android clients
directly via kinesis:PutRecord - once per 90 sec, from at most 15,000 devices - Lambda polls Kinesis and aggregate as impression by increment - DynamoDB stores incremented records with UpdateItem (ADD) operation - another Lambda(s) polls DynamoDB Streams and aggregate hourly/daily

Aggregation

Aggregation - Aggregate gradually per minute -> per 10 min
-> hourly -> daily - Partition usual logs/lagged logs, and update records to the separate tables - NOT discarding (for now) to see how many logs will be discarded - determine whether logs’ timestamp is behind the watermark or not - watermark … the median of all timestamps - because users can change system clocks to the future

Watermark Implementation func (sr *StreamRecords) watermark(eventTimes []EventTime) (median EventTime) {
sort.Ints(eventTimes) // 1. sort l := len(eventTimes) // 2. get the median if l%2 == 0 { // when even median = Mean(eventTimes[l/2-1 : l/2 + 1]) } else { // when odd median = EventTime(eventTimes[l/2]) } return median }

Monitoring

Image Area What to “monitor”? The famous Google’s “SRE” book
says in Chapter 6 “The Four Golden Signals” section: - Latency - Traffix - Errors - Saturation Monitoring - The Four Golden Signals

Image Area - Latency - How long does it take?
- Traffix - How many Get/PutRecords? - Errors - Availability? - How many errors occur? - Saturation - Any delayed data? - IteratorAge? Monitoring - CloudWatch Dashboard

Image Area - Create custom metrics - via cloudwatch:PutMetricData -
Flexible alarm setting - Period - Evaluation Period - Datapoints to Alarm Figure: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html Monitoring - Custom Metrics & Alarm

Monitoring - Implementation Example dataInput := &cloudwatch.PutMetricDataInput{ Namespace: aws.String("StoreTvAdLambdaMetrics"), MetricData:
[]*cloudwatch.MetricDatum{ { Dimensions: []*cloudwatch.Dimension{ { Name: aws.String("Function"), Value: aws.String("monitor-late-data"), }, }, MetricName: aws.String("LateLogCount"), Unit: aws.String("Count"), Value: aws.Float64(float64(lateLogCount)), }, }, } cloudWatchClient.PutMetricData(dataInput)

Monitoring - Kinesis Streams IteraterAge - “GetRecords.IteratorAgeMilliseconds” Metrics - can
monitor “how much stream processing is delayed” - https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html - [CloudWatch Alarms -> SNS Topic -> Lambda -> Slack]

Monitoring - Stream Saturation - Simply calculate the diff between
updated records’ timestamp and current time, and compare with the threshold. - timestamp … get from DynamoDB Stream event records - threshold … pass via ENV (currently 10 min) - calculation … - [DynamoDB Stream -> Custom CloudWatch Metrics -> SNS -> Lambda -> Slack] time.Since(timestamp).Minutes() >= threshold

Analysis

Image Area Cognito - pass role to the Android clients
- which can kinesis:PutRecord to the target Kinesis Stream ARN - use unidentified pool - because the Android does not need any login feature - much more secure than embedding API_KEY/CREDENTIAL_KEY to the Android clients

Lambda x Golang 1.x - Officially supported from January 15th,
2018 - https://dev.classmethod.jp/cloud/aws/aws-lambda-supports-go/ - aws/aws-lambda-go - https://github.com/aws/aws-lambda-go - can easily grasp what kind of JSON event records will be available with Type - IMHO - One of the favorite language above other officially supported languages - Type, Runtime Performance, ecosystem, etc. - goroutine/channels have too much overhead for running on Lambda

Lambda x Golang 1.x 1. init() function for declaring the
global vars > A single instance of your Lambda function will never handle multiple events simultaneously https://docs.aws.amazon.com/lambda/latest/dg/go-programming-model-handler-types.html

Lambda x Golang 1.x 2. use -ldflags=”-s -w” to reduce
the binary size - -s … Omit the symbol table and debug information. - -w … Omit the DWARF symbol table. by https://golang.org/cmd/link/

go build without -ldflags=“-s -w” $ ls -lh bin/ total
145704 -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:26 aggregate-daily-logs -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:26 aggregate-hourly-logs -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:26 aggregate-logs -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:27 monitor-late-data -rwxr-xr-x 1 kenju-wagatsuma staff 14M Aug 9 22:26 put-s3 -rwxr-xr-x 1 kenju-wagatsuma staff 7.8M Aug 9 22:27 sns-notification

go build -ldflags=“-s -w” $ ls -lh bin/ total 96776
-rwxr-xr-x 1 kenju-wagatsuma staff 8.3M Aug 9 22:05 aggregate-daily-logs -rwxr-xr-x 1 kenju-wagatsuma staff 8.3M Aug 9 22:05 aggregate-hourly-logs -rwxr-xr-x 1 kenju-wagatsuma staff 8.2M Aug 9 22:05 aggregate-logs -rwxr-xr-x 1 kenju-wagatsuma staff 7.8M Aug 9 22:05 monitor-late-data -rwxr-xr-x 1 kenju-wagatsuma staff 9.3M Aug 9 22:05 put-s3 -rwxr-xr-x 1 kenju-wagatsuma staff 5.4M Aug 9 22:05 sns-notification

[NOTE] Golang dependencies $ dep status PROJECT CONSTRAINT VERSION REVISION
LATEST PKGS USED github.com/aws/aws-lambda-go ^1.0.0 v1.2.0 4d30d0f e630af3 4 github.com/aws/aws-sdk-go ^1.14.23 v1.15.3 cc03a15 36aaf21 37 github.com/go-ini/ini * v1.38.1 358ee76 358ee76 1 github.com/jmespath/go-jmespath * 0b12d6b 1 github.com/kenju/go-cloudwatch branch master branch master c60ecc3 c60ecc3 1 github.com/kenju/go-nested-counter branch master branch master c6ca0d8 c6ca0d8 1 github.com/kenju/go-slack-webhook branch master branch master 627aa7e 627aa7e 1 github.com/satori/go.uuid ^1.2.0 v1.2.0 f58768c f58768c 1

Serverless Framework - Why Serverless Framework? - Development Speed (easy
to configure) - Motivation (never used before at production) - (Might be) easy to migrate to CloudFormation/SAM later - Why not Apex? - Easy to deploy Lambda, but that’s all - Why not SAM? - Writing CloudFormation stacks from the scratch might takes time - However, sam-local is a great tool so might migrate to SAM in the near future

Serverless Framework - $ serverless deploy --stage (dev|prod|staging) - change
stage via `--stage` option - $ serverless metrics - show simple metrics for functions - $ serverless invoke - Useful Lambda Event fixtures can be found at ... https://github.com/aws/aws-lambda-go/tree/master/events/testdata

serverless invoke $ cat Makefile | tail -n8 run-sns-notification: deploy-dev
serverless invoke \ --log \ --stage dev \ --function sns-notification \ --path fixtures/sns-events.json

Thank you! ヽ(*・ω・)ﾉ

Serverless Streaming Log Architecture ~ Theor...

Serverless Streaming Log Architecture ~ Theory & Practice ~

More Decks by Ken Wagatsuma

Other Decks in Programming

Featured

Transcript