Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Serverless Streaming Log Architecture ~ Theory & Practice ~

Serverless Streaming Log Architecture ~ Theory & Practice ~

Stream Processing

Ken Wagatsuma

August 12, 2018
Tweet

More Decks by Ken Wagatsuma

Other Decks in Programming

Transcript

  1. Serverless Streaming
    Log Architecture
    ~ Theory & Practice ~
    Kenju Wagatsuma

    View Slide

  2. Agenda
    Part I … Theory
    - What is “streaming” ?
    - “Batch” vs “Streaming”
    - “Event Time” vs “Processing Time”
    - Lambda Architecture
    - Kappa Architecture
    - Apache Hadoop/Storm/Spark/Kafka/Flink
    - Late Logs
    - Discarding, Watermark, Trigger,
    Accumulation
    Part II … Practice
    - Overall Data-flow
    - Watermark Implementation
    - Aggregation
    - Kinesis -> Lambda -> DynamoDB
    - DynamoDB Streams -> Lambda -> DynamoDB
    - Monitoring
    - “GetRecords.IteratorAgeMilliseconds”
    - DynamoDB Streams -> Lambda -> Slack
    - Misc (Cognito, Golang, Serverless Framework)

    View Slide

  3. Who are you?
    Kenju Wagatsuma
    - Serverside Engineer at Cookpad Inc.
    - Ruby, Golang, AWS
    - https://github.com/kenju/
    - “Header Bidding 導入によるネットワーク広告改善
    の開発事情”

    View Slide

  4. Theory - Streaming
    Part I

    View Slide

  5. Definition & Glossary

    View Slide

  6. Image
    Area
    What is “streaming” ?
    Definition:
    - a type of data processing engine that is
    designed with infinite data sets in mind.
    (https://www.oreilly.com/ideas/the-world-bey
    ond-batch-streaming-101)

    View Slide

  7. “Batch” vs “Streaming”
    - Batch Processing
    - Process grouped logs at once, and process occasionaly
    - Streaming Processing
    - Micro-batch
    - ex) AWS Lambda with “Batch Size = 2 ~ n” (n is a not-too-large natural number.)
    - Real Streaming
    - ex) AWS Lambda with “Batch Size = 1”

    View Slide

  8. Image
    Area
    - Event time, which is the
    time at which events
    actually occurred.
    - Processing time, which is
    the time at which events
    are observed in the
    system.
    Figure: Example time domain mapping from https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
    “Event Time” vs
    “Processing Time”

    View Slide

  9. Architecture

    View Slide

  10. Lambda Architecture
    - introduced by Nathan Marz, the programmer of Apache Storm
    - “How to beat the CAP theorem”
    http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
    - Batch Layer + Serving Layer + Speed Layer
    - Batch Layer … re-computable, can ensure Consistency
    - Serving Layer … merge views from Batch Layer & real-time logs from Speed Layer
    - Speed Layer … low latency

    View Slide

  11. Image
    Area
    Figure: lambda architecture from http://lambda-architecture.net/

    View Slide

  12. Image
    Area
    Presentation: https://speakerdeck.com/wata/realtime-ad-log-aggregation-and-utilization

    View Slide

  13. Image
    Area
    Presentation: https://speakerdeck.com/wata/realtime-ad-log-aggregation-and-utilization

    View Slide

  14. Kappa Architecture
    - introduced by Jay Kreps, a co-founder and CEO at Confluent which was
    acquired by LinkedIn
    - “Questioning the Lambda Architecture”
    https://www.oreilly.com/ideas/questioning-the-lambda-architecture
    - Streaming Layer + Serving Layer

    View Slide

  15. Image
    Area
    Figure: kappa architecture from https://www.oreilly.com/ideas/applying-the-kappa-architecture-in-the-telco-industry

    View Slide

  16. Lambda vs Kappa Architecture
    Pros Cons
    Lambda Architecture - Robust to data consistency
    - Harder to maintain multiple
    layers
    Kappa Architecture - Simple implementation
    - Need extra works to
    guarantee data onsistency

    View Slide

  17. Apache Hadoop/Storm/Spark/Kafka/Flink
    Name Speciality Batch Processing Stream Processing
    Hadoop HDFS(FIle System), YARN, MapReduce O X
    Storm Topology (Spout + Bolt), Tuple, Task X O
    Kafka Broker, Producer/Consumer
    (“Un-Managed Kinesis” ?)
    X O
    Spark inspired by Hadoop’s MapReduce engine O
    O
    (Spark Streaming)
    Flink Batch and Streaming in One System, ML
    Support, DataStram API
    O O

    View Slide

  18. Common Problems

    View Slide

  19. Late Logs
    Figure: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

    View Slide

  20. How to tackle on “Late Logs”?
    - Discarding
    - simply discard late logs
    - Watermarks
    - “all input data with event times less than X have been observed.”
    - Triggers
    - declaring when the output for a window should be materialized
    - Accumulation
    - accumulate multiple results that are observed for the same window
    Read https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 for more details :)

    View Slide

  21. What is “Watermark”
    Definition:
    a special mark contained in electronic documents, pictures, music etc
    that is used to stop people from copying them
    Example:
    audio watermark detection, photo watermark, copyright watermark

    View Slide

  22. How to “Watermark”
    - A. Save State in any external memory
    - e.g. RDS, DynamoDB
    - B. Calculate on-memory
    - formula …
    - e.g. late threshold = 10 minutes
    - use `median(event time)` instead to handle the too-future logs
    - e.g. mobile devices’ system clock are somehow modified by users incorrectly
    watermark = max(event time) - late_threshold

    View Slide

  23. Image
    Area
    Figure:
    https://cdn.oreillystatic.com/en/assets/1/event/155/Watermarks_%20Time%20and%20progress%20in%20streaming%2

    View Slide

  24. Image
    Area
    Figure: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

    View Slide

  25. Practice - storeTV
    Part II

    View Slide

  26. Image
    Area

    View Slide

  27. Overall Data-flow
    - Kinesis Stream receives logs from Android clients directly via
    kinesis:PutRecord
    - once per 90 sec, from at most 15,000 devices
    - Lambda polls Kinesis and aggregate as impression by increment
    - DynamoDB stores incremented records with UpdateItem (ADD) operation
    - another Lambda(s) polls DynamoDB Streams and aggregate hourly/daily

    View Slide

  28. Aggregation

    View Slide

  29. Aggregation

    View Slide

  30. Aggregation
    - Aggregate gradually per minute -> per 10 min -> hourly -> daily
    - Partition usual logs/lagged logs, and update records to the separate tables
    - NOT discarding (for now) to see how many logs will be discarded
    - determine whether logs’ timestamp is behind the watermark or not
    - watermark … the median of all timestamps
    - because users can change system clocks to the future

    View Slide

  31. Watermark Implementation
    func (sr *StreamRecords) watermark(eventTimes []EventTime) (median EventTime) {
    sort.Ints(eventTimes) // 1. sort
    l := len(eventTimes) // 2. get the median
    if l%2 == 0 { // when even
    median = Mean(eventTimes[l/2-1 : l/2 + 1])
    } else { // when odd
    median = EventTime(eventTimes[l/2])
    }
    return median
    }

    View Slide

  32. Monitoring

    View Slide

  33. Monitoring

    View Slide

  34. Image
    Area
    What to “monitor”? The famous
    Google’s “SRE” book says in Chapter 6
    “The Four Golden Signals” section:
    - Latency
    - Traffix
    - Errors
    - Saturation
    Monitoring -
    The Four Golden Signals

    View Slide

  35. Image
    Area
    - Latency
    - How long does it take?
    - Traffix
    - How many Get/PutRecords?
    - Errors
    - Availability?
    - How many errors occur?
    - Saturation
    - Any delayed data?
    - IteratorAge?
    Monitoring -
    CloudWatch Dashboard

    View Slide

  36. Image
    Area
    - Create custom metrics
    - via cloudwatch:PutMetricData
    - Flexible alarm setting
    - Period
    - Evaluation Period
    - Datapoints to Alarm
    Figure: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
    Monitoring -
    Custom Metrics & Alarm

    View Slide

  37. Monitoring - Implementation Example
    dataInput := &cloudwatch.PutMetricDataInput{
    Namespace: aws.String("StoreTvAdLambdaMetrics"),
    MetricData: []*cloudwatch.MetricDatum{
    {
    Dimensions: []*cloudwatch.Dimension{
    { Name: aws.String("Function"), Value: aws.String("monitor-late-data"), },
    },
    MetricName: aws.String("LateLogCount"),
    Unit: aws.String("Count"),
    Value: aws.Float64(float64(lateLogCount)),
    },
    },
    }
    cloudWatchClient.PutMetricData(dataInput)

    View Slide

  38. Monitoring - Kinesis Streams IteraterAge
    - “GetRecords.IteratorAgeMilliseconds” Metrics
    - can monitor “how much stream processing is delayed”
    - https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-cloudwatch.html
    - [CloudWatch Alarms -> SNS Topic -> Lambda -> Slack]

    View Slide

  39. Monitoring - Stream Saturation
    - Simply calculate the diff between updated records’ timestamp and current time, and compare
    with the threshold.
    - timestamp … get from DynamoDB Stream event records
    - threshold … pass via ENV (currently 10 min)
    - calculation …
    - [DynamoDB Stream -> Custom CloudWatch Metrics -> SNS -> Lambda -> Slack]
    time.Since(timestamp).Minutes() >= threshold

    View Slide

  40. Analysis

    View Slide

  41. Analysis

    View Slide

  42. Misc

    View Slide

  43. Image
    Area
    Cognito
    - pass role to the Android clients
    - which can kinesis:PutRecord to the
    target Kinesis Stream ARN
    - use unidentified pool
    - because the Android does not need any
    login feature
    - much more secure than embedding
    API_KEY/CREDENTIAL_KEY to the Android
    clients

    View Slide

  44. Lambda x Golang 1.x
    - Officially supported from January 15th, 2018
    - https://dev.classmethod.jp/cloud/aws/aws-lambda-supports-go/
    - aws/aws-lambda-go
    - https://github.com/aws/aws-lambda-go
    - can easily grasp what kind of JSON event records will be available with Type
    - IMHO
    - One of the favorite language above other officially supported languages
    - Type, Runtime Performance, ecosystem, etc.
    - goroutine/channels have too much overhead for running on Lambda

    View Slide

  45. Lambda x Golang 1.x
    1. init() function for declaring the global vars
    > A single instance of your Lambda function will never handle multiple events
    simultaneously
    https://docs.aws.amazon.com/lambda/latest/dg/go-programming-model-handler-types.html

    View Slide

  46. Lambda x Golang 1.x
    2. use -ldflags=”-s -w” to reduce the binary size
    - -s … Omit the symbol table and debug information.
    - -w … Omit the DWARF symbol table.
    by https://golang.org/cmd/link/

    View Slide

  47. go build without -ldflags=“-s -w”
    $ ls -lh bin/
    total 145704
    -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:26 aggregate-daily-logs
    -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:26 aggregate-hourly-logs
    -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:26 aggregate-logs
    -rwxr-xr-x 1 kenju-wagatsuma staff 12M Aug 9 22:27 monitor-late-data
    -rwxr-xr-x 1 kenju-wagatsuma staff 14M Aug 9 22:26 put-s3
    -rwxr-xr-x 1 kenju-wagatsuma staff 7.8M Aug 9 22:27 sns-notification

    View Slide

  48. go build -ldflags=“-s -w”
    $ ls -lh bin/
    total 96776
    -rwxr-xr-x 1 kenju-wagatsuma staff 8.3M Aug 9 22:05 aggregate-daily-logs
    -rwxr-xr-x 1 kenju-wagatsuma staff 8.3M Aug 9 22:05 aggregate-hourly-logs
    -rwxr-xr-x 1 kenju-wagatsuma staff 8.2M Aug 9 22:05 aggregate-logs
    -rwxr-xr-x 1 kenju-wagatsuma staff 7.8M Aug 9 22:05 monitor-late-data
    -rwxr-xr-x 1 kenju-wagatsuma staff 9.3M Aug 9 22:05 put-s3
    -rwxr-xr-x 1 kenju-wagatsuma staff 5.4M Aug 9 22:05 sns-notification

    View Slide

  49. [NOTE] Golang dependencies
    $ dep status
    PROJECT CONSTRAINT VERSION REVISION LATEST PKGS USED
    github.com/aws/aws-lambda-go ^1.0.0 v1.2.0 4d30d0f e630af3 4
    github.com/aws/aws-sdk-go ^1.14.23 v1.15.3 cc03a15 36aaf21 37
    github.com/go-ini/ini * v1.38.1 358ee76 358ee76 1
    github.com/jmespath/go-jmespath * 0b12d6b 1
    github.com/kenju/go-cloudwatch branch master branch master c60ecc3 c60ecc3 1
    github.com/kenju/go-nested-counter branch master branch master c6ca0d8 c6ca0d8 1
    github.com/kenju/go-slack-webhook branch master branch master 627aa7e 627aa7e 1
    github.com/satori/go.uuid ^1.2.0 v1.2.0 f58768c f58768c 1

    View Slide

  50. Serverless Framework
    - Why Serverless Framework?
    - Development Speed (easy to configure)
    - Motivation (never used before at production)
    - (Might be) easy to migrate to CloudFormation/SAM later
    - Why not Apex?
    - Easy to deploy Lambda, but that’s all
    - Why not SAM?
    - Writing CloudFormation stacks from the scratch might takes time
    - However, sam-local is a great tool so might migrate to SAM in the near future

    View Slide

  51. Serverless Framework
    - $ serverless deploy --stage (dev|prod|staging)
    - change stage via `--stage` option
    - $ serverless metrics
    - show simple metrics for functions
    - $ serverless invoke
    - Useful Lambda Event fixtures can be found at ...
    https://github.com/aws/aws-lambda-go/tree/master/events/testdata

    View Slide

  52. serverless invoke
    $ cat Makefile | tail -n8
    run-sns-notification: deploy-dev
    serverless invoke \
    --log \
    --stage dev \
    --function sns-notification \
    --path fixtures/sns-events.json

    View Slide

  53. Thank you!
    ヽ(*・ω・)ノ

    View Slide