Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Amit Ramesh, Qui Nguyen - Building Stream Processing Applications

Amit Ramesh, Qui Nguyen - Building Stream Processing Applications

Do you have a stream of data that you would like to process in real time? There are many components with Python APIs that you can put together to build a stream processing application. We will go through some common design patterns, tradeoffs and available components / frameworks for designing such systems. We will solve an example problem during the presentation to make these points concrete. Much of what will be presented is based on experience gained from building production pipelines for the real-time processing of ad streams at Yelp. This talk will cover topics such as consistency, availability, idempotency, scalability, etc.

https://us.pycon.org/2017/schedule/presentation/392/

PyCon 2017

June 05, 2017
Tweet

More Decks by PyCon 2017

Other Decks in Programming

Transcript

  1. Building Stream Processing
    Applications
    Amit Ramesh Qui Nguyen

    View Slide

  2. Yelp’s Mission
    Connecting people with great
    local businesses.

    View Slide

  3. I. Why stream processing?
    II. Putting an application together
    Example problem
    Components and data operations
    III. Design principles and tradeoffs
    Horizontal scalability
    Handling failures
    Idempotency
    Consistency versus availability

    View Slide

  4. I. Why stream processing?
    II. Putting an application together
    Example problem
    Components and data operations
    III. Design principles and tradeoffs
    Horizontal scalability
    Handling failures
    Idempotency
    Consistency versus availability

    View Slide

  5. View Slide

  6. Data processing
    measurements
    from a sensor
    clicking on ads

    View Slide

  7. Data processing
    measurements
    from a sensor
    clicking on ads
    average value in
    the last minute
    total clicks on a
    day

    View Slide

  8. Batch
    Finite chunk of data
    Operations defined over the entire input
    Data processing: Batch or stream
    8

    View Slide

  9. Batch
    Finite chunk of data
    Operations defined over the entire input
    Stream
    Unbounded stream of events flowing in
    Events are processed continuously
    (possibly with state)
    Data processing: Batch or stream
    9

    View Slide

  10. Why stream processing over batch?
    ● Lower latency on results
    ● Most data is unbounded, so streaming model is more
    flexible

    View Slide

  11. Why stream processing over batch?
    ● Lower latency on results
    ● Most data is unbounded, so streaming model is more
    flexible
    Day 12 Day 13

    View Slide

  12. Our evolution

    View Slide

  13. Our evolution

    View Slide

  14. Our evolution

    View Slide

  15. Our evolution
    mrjob

    View Slide

  16. Our evolution

    View Slide

  17. Our evolution

    View Slide

  18. I. Why stream processing?
    II. Putting an application together
    Example problem
    Components and data operations
    III. Design principles and tradeoffs
    Horizontal scalability
    Handling failures
    Idempotency
    Consistency versus availability

    View Slide

  19. Example problem: ad campaign metrics
    Ad Yelp

    View Slide

  20. ad {
    id: 1200834,
    campaign_id: 2001,
    user_id: 9zkjacn81m,
    timestamp: 1490732147
    }
    view {
    id: 1200834,
    timestamp: 1490732150
    }
    click {
    id: 1200834,
    timestamp: 1490732168
    }

    View Slide

  21. Metrics (views, clicks) for each
    campaign over time
    Ad Yelp

    View Slide

  22. I. Why stream processing?
    II. Putting an application together
    Example problem
    Components and data operations
    III. Design principles and tradeoffs
    Horizontal scalability
    Handling failures
    Idempotency
    Consistency versus availability

    View Slide

  23. Source of
    streaming data
    Stream processing pipelines
    Stream
    processing
    engine
    Storage
    Data sink

    View Slide

  24. Stream processing pipelines
    Stream
    processing
    engine
    Storage
    Data sink
    Source of
    streaming data

    View Slide

  25. Types of operations
    1. Ingestion
    2. Stateless transforms
    3. Stateful transforms
    4. Keyed stateful transforms
    5. Publishing

    View Slide

  26. Operations: 1. Ingestion
    Kafka
    Reader
    Operation
    Source

    View Slide

  27. Operations: 1. Ingestion
    Kafka
    Reader
    Operation
    Source
    from pyspark.streaming.kafka import KafkaUtils
    ad_stream = KafkaUtils.createDirectStream(
    streaming_context,
    topics=[‘ad_events’],
    kafkaParams={...},
    )

    View Slide

  28. Operations: 2. Stateless transforms
    Operation Transform Operation

    View Slide

  29. Operations: 2a. Stateless transforms
    Filter
    e.g., filtering

    View Slide

  30. Operations: 2a. Stateless transforms
    Filter
    e.g., filtering
    def is_not_from_bot(event):
    return event[‘ip’] not in bot_ips
    filtered_stream = ad_stream.filter(is_not_from_bot)

    View Slide

  31. Operations: 2b. Stateless transforms
    Project
    e.g., projection

    View Slide

  32. Operations: 2b. Stateless transforms
    Project
    e.g., projection
    desired_fields = [‘ad_id’, ‘campaign_id’]
    def trim_event(event):
    return {key: event[key] for key in desired_fields}
    projected_stream = ad_stream.map(trim_event)

    View Slide

  33. Operations: 3. Stateful transforms
    On windows of data
    Transform
    Sliding window

    View Slide

  34. Operations: 3. Stateful transforms
    On windows of data
    Transform
    Sliding window
    Tumbling window

    View Slide

  35. Operations: 3. Stateful transforms
    e.g., aggregation
    Sum 5 6
    0 1 1 3 0 1 2

    View Slide

  36. Operations: 3. Stateful transforms
    e.g., aggregation
    Sum 5 6
    0 1 1 3 0 1 2
    aggregated_stream = event_stream.reduceByWindow(
    func=operator.add,
    windowLength=4,
    slideInterval=3,
    )

    View Slide

  37. Operations: 4. Keyed stateful transforms
    Shuffle
    Group events by key (shuffle) within each window before
    transform
    Transform

    View Slide

  38. Operations: 4a. Keyed stateful transforms
    c_id: 1
    views: 1
    c_id: 2
    views: 2
    c_id: 1
    views: 1
    c_id: 2
    views: 1
    c_id: 2
    views: 1
    sum
    views
    by c_id
    e.g., aggregate views by campaign_id

    View Slide

  39. Operations: 4a. Keyed stateful transforms
    e.g., aggregate views by campaign_id
    aggregated_views = view_stream.reduceByKeyAndWindow(
    func=operator.add,
    windowLength=3,
    slideInterval=3,
    )
    c_id: 1
    views: 1
    c_id: 2
    views: 2
    c_id: 1
    views: 1
    c_id: 2
    views: 1
    c_id: 2
    views: 1
    sum
    views
    by c_id

    View Slide

  40. Operations: 4b. Keyed stateful transforms
    Can also be on more than one stream, e.g., join by id
    Shuffle Join

    View Slide

  41. Operations: 4b. Keyed stateful transforms
    e.g., join by ad_id
    Join by
    ad_id
    Ad
    ad_id: 11
    c_id: 1
    ad_id: 22
    c_id: 2
    ad_id: 22
    time: 5
    ad_id: 11
    time: 7
    ad_id: 11
    ad: {
    c_id: 1
    },
    view: {
    time: 7
    }
    ad_id: 22
    ad: {
    c_id: 2
    },
    view: {
    time: 5
    }

    View Slide

  42. Operations: 4b. Keyed stateful transforms
    windowed_ad_stream = ad_stream.window(
    windowLength=2,
    slideInterval=2,
    )
    windowed_view_stream = view_stream.window(
    windowLength=2,
    slideInterval=2,
    )
    joined_stream = windowed_ad_stream.join(
    windowed_view_stream,
    )
    e.g., join by ad_id

    View Slide

  43. Operations: 5. Publishing
    Sink
    File
    writer
    Operation

    View Slide

  44. Operations: 5. Publishing
    results_stream.saveAsTextFiles(‘s3://my.bucket/results/’)
    File
    writer
    Operation
    Sink

    View Slide

  45. Operations: Summary
    1. Ingestion
    2. Stateless transforms: on single events
    a. Filtering
    b. Projections
    3. Stateful transforms: on windows of events
    4. Keyed stateful transforms
    a. On single streams, transform by key
    b. Join events from several streams by key
    5. Publishing

    View Slide

  46. Putting it together: campaign metrics
    Ad filter
    read join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    filter
    read project
    filter
    read project

    View Slide

  47. read
    Ad filter
    read join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    filter
    read project
    filter
    read project
    {
    ip: bot_id,
    ...
    }
    {
    ip: OK_id,
    ...
    }

    View Slide

  48. filter
    Ad filter
    read join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    filter
    read project
    filter
    read project
    {
    ip: bot_id,
    ...
    }
    {
    ip: OK_id,
    ...
    }

    View Slide

  49. project
    Ad filter
    read join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    filter
    read project
    filter
    read project
    {
    ip: OK_id,
    scoring: {
    ...
    },
    ...
    }

    View Slide

  50. project
    Ad filter
    read join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    filter
    read project
    filter
    read project
    {
    ip: OK_id,
    scoring: {
    ...
    },
    ...
    }

    View Slide

  51. join by ad id
    er join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    er project
    er project
    {
    ad_id: 1,
    ad_data: ...
    }
    {
    ad_id: 1,
    view_data: ...
    }

    View Slide

  52. join by ad id
    er join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    er project
    er project
    {
    ad_id: 1,
    ad_data: ...,
    view_data: ...,
    }

    View Slide

  53. transform
    er join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    er project
    er project
    {
    ad_id: 1,
    campaign_id: 7,
    view: true,
    click: false
    }

    View Slide

  54. sum by campaign
    oin by
    ad id
    transform
    write
    sum by
    campaign
    transform
    write
    {
    ad_id: 1,
    campaign_id: 7,
    view: true,
    click: false
    }
    {
    ad_id: 23,
    campaign_id: 7,
    view: true,
    click: false
    }

    View Slide

  55. sum by campaign
    oin by
    ad id
    transform
    write
    sum by
    campaign
    transform
    write
    {
    campaign_id: 7,
    views: 2,
    clicks: 0
    }

    View Slide

  56. write db.write(
    campaign_id=7,
    views=2,
    clicks=0,
    )
    m
    write
    sum by
    campaign
    m
    write

    View Slide

  57. Ad campaign metrics pipeline
    Ad filter
    read join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    filter
    read project
    filter
    read project

    View Slide

  58. I. Why stream processing?
    II. Putting an application together
    Example problem
    Components and data operations
    III. Design principles and tradeoffs
    Horizontal scalability
    Handling failures
    Idempotency
    Consistency versus availability

    View Slide

  59. Horizontal scalability: Basic idea

    View Slide

  60. Horizontal scalability: Basic idea

    View Slide

  61. Horizontal scalability: Basic idea

    View Slide

  62. Horizontal scalability: Basic idea

    View Slide

  63. Horizontal scalability: Why?

    View Slide

  64. Horizontal scalability: Why?

    View Slide

  65. Horizontal scalability: How?
    Random
    partitioning
    Partitioning

    View Slide

  66. Horizontal scalability: How?
    Ad
    read
    read
    read
    filter
    filter
    filter
    project
    project
    project
    read
    read
    read
    filter
    filter
    filter
    project
    project
    project
    Partitioning
    Random
    partitioning

    View Slide

  67. project
    project
    project
    join by ad id
    Horizontal scalability: How?
    Partitioning

    View Slide

  68. project
    project
    project
    join by ad id
    Horizontal scalability: How?
    Partitioning
    Keyed partitioning

    View Slide

  69. Horizontal scalability: watch out!
    Hot spots / data skew
    transform
    sum by
    campaign
    transform

    View Slide

  70. Horizontal scalability: watch out!
    Hot spots / data skew
    Keyed partitioning
    transform
    sum by
    campaign
    transform

    View Slide

  71. Horizontal scalability: Summary
    ● Random partitioning for stateless transforms
    ● Keyed partitioning for keyed transformations
    ● Watch out for hot spots, and use appropriate
    mitigation strategy

    View Slide

  72. I. Why stream processing?
    II. Putting an application together
    Example problem
    Components and data operations
    III. Design principles and tradeoffs
    Horizontal scalability
    Handling failures
    Idempotency
    Consistency versus availability

    View Slide

  73. Idempotency

    View Slide

  74. Idempotency
    An idempotent operation can be
    applied more than once and have
    the same effect.

    View Slide

  75. Ad filter
    read join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    filter
    read project
    filter
    read project

    View Slide

  76. Ad filter
    read join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    filter
    read project
    filter
    read project
    project
    write

    View Slide

  77. What operations are idempotent?
    Transforms: filters, projections, etc
    No side effects!
    Stateful operations

    View Slide

  78. Ad filter
    read join by
    ad id
    transform
    write
    sum by
    campaign
    project
    transform
    write
    filter
    read project
    filter
    read project
    project
    write

    View Slide

  79. Idempotent writes with unique keys
    campaign_id = 7,
    minute = 20,
    views = 2
    campaign
    _id
    minute views
    7 20 2
    campaign_id = 7,
    minute = 20,
    views = 2

    View Slide

  80. Writes that aren’t idempotent
    campaign
    _id
    hour views
    7 2 0

    View Slide

  81. Writes that aren’t idempotent
    campaign_id = 7,
    hour = 2,
    views += 1
    campaign
    _id
    hour views
    7 2 1

    View Slide

  82. Writes that aren’t idempotent
    campaign_id = 7,
    hour = 2,
    views += 1
    campaign
    _id
    hour views
    7 2 2
    campaign_id = 7,
    hour = 2,
    views += 1

    View Slide

  83. Support for idempotency
    campaign_id = 7,
    hour = 2,
    views += 1,
    version = 1 campaign
    _id
    hour views
    7 2 1
    campaign_id = 7,
    hour = 2,
    views += 1
    version = 1

    View Slide

  84. Idempotency in streaming pipelines
    Both in output to data sink and in local state (joining,
    aggregation)
    Re-processing of events
    - Some frameworks provide exactly once guarantees

    View Slide

  85. Consistency vs. availability

    View Slide

  86. Always a tradeoff between
    consistency and availability
    when handling failures

    View Slide

  87. Consistency
    Every read sees a current view of the data.
    Availability
    Capacity to serve requests

    View Slide

  88. A = 9 A = 9

    View Slide

  89. A = 3 A = 3
    A = 3
    A = 3

    View Slide

  90. A = 9 A = 9

    View Slide

  91. A = 9 A = 9
    Consistency > availability
    A = 3
    A = 3

    View Slide

  92. A = 9 A = 9
    Consistency > availability A = 3
    Error: write
    unavailable

    View Slide

  93. A = 9 A = 9
    Availability > consistency
    A = 3
    A = 3

    View Slide

  94. A = 9 A = 3
    Availability > consistency
    Not consistent:
    3 != 9

    View Slide

  95. Prioritizing consistency or availability
    Applies to systems for both your data source and data
    sink
    Source
    Stream
    processing
    engine
    Data sink
    Storage

    View Slide

  96. Prioritizing consistency or availability
    Applies to systems for both your data source and data
    sink
    ● Some systems pick one, be aware
    ● Others let you choose
    ○ ex. Cassandra - how many replicas respond to
    write?
    Streaming applications run continuously

    View Slide

  97. Prioritizing consistency or availability
    Depends on the needs of your application
    Metrics (views,
    clicks) for each
    campaign over time

    View Slide

  98. Prioritizing consistency or availability
    More consistency
    Metrics (views,
    clicks) for each
    campaign over time

    View Slide

  99. Prioritizing consistency or availability
    More availability
    Internal graphs
    Metrics (views,
    clicks) for each
    campaign over time

    View Slide

  100. Conclusion
    ● Stream processing: data processing with operations on
    events or windows of events
    ● Horizontal scalability, as data will grow and change over
    time
    ● Handle failures appropriately
    ○ Keep operations idempotent, for retries
    ○ Tradeoff between availability and consistency

    View Slide

  101. www.yelp.com/careers/
    We're Hiring!

    View Slide

  102. @YelpEngineering
    fb.com/YelpEngineers
    engineeringblog.yelp.com
    github.com/yelp

    View Slide