Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Amit Ramesh, Qui Nguyen - Building Stream Processing Applications

Amit Ramesh, Qui Nguyen - Building Stream Processing Applications

Do you have a stream of data that you would like to process in real time? There are many components with Python APIs that you can put together to build a stream processing application. We will go through some common design patterns, tradeoffs and available components / frameworks for designing such systems. We will solve an example problem during the presentation to make these points concrete. Much of what will be presented is based on experience gained from building production pipelines for the real-time processing of ad streams at Yelp. This talk will cover topics such as consistency, availability, idempotency, scalability, etc.

https://us.pycon.org/2017/schedule/presentation/392/

Bde70c0ba031a765ff25c19e6b7d6d23?s=128

PyCon 2017

June 05, 2017
Tweet

Transcript

  1. Building Stream Processing Applications Amit Ramesh Qui Nguyen

  2. Yelp’s Mission Connecting people with great local businesses.

  3. I. Why stream processing? II. Putting an application together Example

    problem Components and data operations III. Design principles and tradeoffs Horizontal scalability Handling failures Idempotency Consistency versus availability
  4. I. Why stream processing? II. Putting an application together Example

    problem Components and data operations III. Design principles and tradeoffs Horizontal scalability Handling failures Idempotency Consistency versus availability
  5. None
  6. Data processing measurements from a sensor clicking on ads

  7. Data processing measurements from a sensor clicking on ads average

    value in the last minute total clicks on a day
  8. Batch Finite chunk of data Operations defined over the entire

    input Data processing: Batch or stream 8
  9. Batch Finite chunk of data Operations defined over the entire

    input Stream Unbounded stream of events flowing in Events are processed continuously (possibly with state) Data processing: Batch or stream 9
  10. Why stream processing over batch? • Lower latency on results

    • Most data is unbounded, so streaming model is more flexible
  11. Why stream processing over batch? • Lower latency on results

    • Most data is unbounded, so streaming model is more flexible Day 12 Day 13
  12. Our evolution

  13. Our evolution

  14. Our evolution

  15. Our evolution mrjob

  16. Our evolution

  17. Our evolution

  18. I. Why stream processing? II. Putting an application together Example

    problem Components and data operations III. Design principles and tradeoffs Horizontal scalability Handling failures Idempotency Consistency versus availability
  19. Example problem: ad campaign metrics Ad Yelp

  20. ad { id: 1200834, campaign_id: 2001, user_id: 9zkjacn81m, timestamp: 1490732147

    } view { id: 1200834, timestamp: 1490732150 } click { id: 1200834, timestamp: 1490732168 }
  21. Metrics (views, clicks) for each campaign over time Ad Yelp

  22. I. Why stream processing? II. Putting an application together Example

    problem Components and data operations III. Design principles and tradeoffs Horizontal scalability Handling failures Idempotency Consistency versus availability
  23. Source of streaming data Stream processing pipelines Stream processing engine

    Storage Data sink
  24. Stream processing pipelines Stream processing engine Storage Data sink Source

    of streaming data
  25. Types of operations 1. Ingestion 2. Stateless transforms 3. Stateful

    transforms 4. Keyed stateful transforms 5. Publishing
  26. Operations: 1. Ingestion Kafka Reader Operation Source

  27. Operations: 1. Ingestion Kafka Reader Operation Source from pyspark.streaming.kafka import

    KafkaUtils ad_stream = KafkaUtils.createDirectStream( streaming_context, topics=[‘ad_events’], kafkaParams={...}, )
  28. Operations: 2. Stateless transforms Operation Transform Operation

  29. Operations: 2a. Stateless transforms Filter e.g., filtering

  30. Operations: 2a. Stateless transforms Filter e.g., filtering def is_not_from_bot(event): return

    event[‘ip’] not in bot_ips filtered_stream = ad_stream.filter(is_not_from_bot)
  31. Operations: 2b. Stateless transforms Project e.g., projection

  32. Operations: 2b. Stateless transforms Project e.g., projection desired_fields = [‘ad_id’,

    ‘campaign_id’] def trim_event(event): return {key: event[key] for key in desired_fields} projected_stream = ad_stream.map(trim_event)
  33. Operations: 3. Stateful transforms On windows of data Transform Sliding

    window
  34. Operations: 3. Stateful transforms On windows of data Transform Sliding

    window Tumbling window
  35. Operations: 3. Stateful transforms e.g., aggregation Sum 5 6 0

    1 1 3 0 1 2
  36. Operations: 3. Stateful transforms e.g., aggregation Sum 5 6 0

    1 1 3 0 1 2 aggregated_stream = event_stream.reduceByWindow( func=operator.add, windowLength=4, slideInterval=3, )
  37. Operations: 4. Keyed stateful transforms Shuffle Group events by key

    (shuffle) within each window before transform Transform
  38. Operations: 4a. Keyed stateful transforms c_id: 1 views: 1 c_id:

    2 views: 2 c_id: 1 views: 1 c_id: 2 views: 1 c_id: 2 views: 1 sum views by c_id e.g., aggregate views by campaign_id
  39. Operations: 4a. Keyed stateful transforms e.g., aggregate views by campaign_id

    aggregated_views = view_stream.reduceByKeyAndWindow( func=operator.add, windowLength=3, slideInterval=3, ) c_id: 1 views: 1 c_id: 2 views: 2 c_id: 1 views: 1 c_id: 2 views: 1 c_id: 2 views: 1 sum views by c_id
  40. Operations: 4b. Keyed stateful transforms Can also be on more

    than one stream, e.g., join by id Shuffle Join
  41. Operations: 4b. Keyed stateful transforms e.g., join by ad_id Join

    by ad_id Ad ad_id: 11 c_id: 1 ad_id: 22 c_id: 2 ad_id: 22 time: 5 ad_id: 11 time: 7 ad_id: 11 ad: { c_id: 1 }, view: { time: 7 } ad_id: 22 ad: { c_id: 2 }, view: { time: 5 }
  42. Operations: 4b. Keyed stateful transforms windowed_ad_stream = ad_stream.window( windowLength=2, slideInterval=2,

    ) windowed_view_stream = view_stream.window( windowLength=2, slideInterval=2, ) joined_stream = windowed_ad_stream.join( windowed_view_stream, ) e.g., join by ad_id
  43. Operations: 5. Publishing Sink File writer Operation

  44. Operations: 5. Publishing results_stream.saveAsTextFiles(‘s3://my.bucket/results/’) File writer Operation Sink

  45. Operations: Summary 1. Ingestion 2. Stateless transforms: on single events

    a. Filtering b. Projections 3. Stateful transforms: on windows of events 4. Keyed stateful transforms a. On single streams, transform by key b. Join events from several streams by key 5. Publishing
  46. Putting it together: campaign metrics Ad filter read join by

    ad id transform write sum by campaign project transform write filter read project filter read project
  47. read Ad filter read join by ad id transform write

    sum by campaign project transform write filter read project filter read project { ip: bot_id, ... } { ip: OK_id, ... }
  48. filter Ad filter read join by ad id transform write

    sum by campaign project transform write filter read project filter read project { ip: bot_id, ... } { ip: OK_id, ... }
  49. project Ad filter read join by ad id transform write

    sum by campaign project transform write filter read project filter read project { ip: OK_id, scoring: { ... }, ... }
  50. project Ad filter read join by ad id transform write

    sum by campaign project transform write filter read project filter read project { ip: OK_id, scoring: { ... }, ... }
  51. join by ad id er join by ad id transform

    write sum by campaign project transform write er project er project { ad_id: 1, ad_data: ... } { ad_id: 1, view_data: ... }
  52. join by ad id er join by ad id transform

    write sum by campaign project transform write er project er project { ad_id: 1, ad_data: ..., view_data: ..., }
  53. transform er join by ad id transform write sum by

    campaign project transform write er project er project { ad_id: 1, campaign_id: 7, view: true, click: false }
  54. sum by campaign oin by ad id transform write sum

    by campaign transform write { ad_id: 1, campaign_id: 7, view: true, click: false } { ad_id: 23, campaign_id: 7, view: true, click: false }
  55. sum by campaign oin by ad id transform write sum

    by campaign transform write { campaign_id: 7, views: 2, clicks: 0 }
  56. write db.write( campaign_id=7, views=2, clicks=0, ) m write sum by

    campaign m write
  57. Ad campaign metrics pipeline Ad filter read join by ad

    id transform write sum by campaign project transform write filter read project filter read project
  58. I. Why stream processing? II. Putting an application together Example

    problem Components and data operations III. Design principles and tradeoffs Horizontal scalability Handling failures Idempotency Consistency versus availability
  59. Horizontal scalability: Basic idea

  60. Horizontal scalability: Basic idea

  61. Horizontal scalability: Basic idea

  62. Horizontal scalability: Basic idea

  63. Horizontal scalability: Why?

  64. Horizontal scalability: Why?

  65. Horizontal scalability: How? Random partitioning Partitioning

  66. Horizontal scalability: How? Ad read read read filter filter filter

    project project project read read read filter filter filter project project project Partitioning Random partitioning
  67. project project project join by ad id Horizontal scalability: How?

    Partitioning
  68. project project project join by ad id Horizontal scalability: How?

    Partitioning Keyed partitioning
  69. Horizontal scalability: watch out! Hot spots / data skew transform

    sum by campaign transform
  70. Horizontal scalability: watch out! Hot spots / data skew Keyed

    partitioning transform sum by campaign transform
  71. Horizontal scalability: Summary • Random partitioning for stateless transforms •

    Keyed partitioning for keyed transformations • Watch out for hot spots, and use appropriate mitigation strategy
  72. I. Why stream processing? II. Putting an application together Example

    problem Components and data operations III. Design principles and tradeoffs Horizontal scalability Handling failures Idempotency Consistency versus availability
  73. Idempotency

  74. Idempotency An idempotent operation can be applied more than once

    and have the same effect.
  75. Ad filter read join by ad id transform write sum

    by campaign project transform write filter read project filter read project
  76. Ad filter read join by ad id transform write sum

    by campaign project transform write filter read project filter read project project write
  77. What operations are idempotent? Transforms: filters, projections, etc No side

    effects! Stateful operations
  78. Ad filter read join by ad id transform write sum

    by campaign project transform write filter read project filter read project project write
  79. Idempotent writes with unique keys campaign_id = 7, minute =

    20, views = 2 campaign _id minute views 7 20 2 campaign_id = 7, minute = 20, views = 2
  80. Writes that aren’t idempotent campaign _id hour views 7 2

    0
  81. Writes that aren’t idempotent campaign_id = 7, hour = 2,

    views += 1 campaign _id hour views 7 2 1
  82. Writes that aren’t idempotent campaign_id = 7, hour = 2,

    views += 1 campaign _id hour views 7 2 2 campaign_id = 7, hour = 2, views += 1
  83. Support for idempotency campaign_id = 7, hour = 2, views

    += 1, version = 1 campaign _id hour views 7 2 1 campaign_id = 7, hour = 2, views += 1 version = 1
  84. Idempotency in streaming pipelines Both in output to data sink

    and in local state (joining, aggregation) Re-processing of events - Some frameworks provide exactly once guarantees
  85. Consistency vs. availability

  86. Always a tradeoff between consistency and availability when handling failures

  87. Consistency Every read sees a current view of the data.

    Availability Capacity to serve requests
  88. A = 9 A = 9

  89. A = 3 A = 3 A = 3 A

    = 3
  90. A = 9 A = 9

  91. A = 9 A = 9 Consistency > availability A

    = 3 A = 3
  92. A = 9 A = 9 Consistency > availability A

    = 3 Error: write unavailable
  93. A = 9 A = 9 Availability > consistency A

    = 3 A = 3
  94. A = 9 A = 3 Availability > consistency Not

    consistent: 3 != 9
  95. Prioritizing consistency or availability Applies to systems for both your

    data source and data sink Source Stream processing engine Data sink Storage
  96. Prioritizing consistency or availability Applies to systems for both your

    data source and data sink • Some systems pick one, be aware • Others let you choose ◦ ex. Cassandra - how many replicas respond to write? Streaming applications run continuously
  97. Prioritizing consistency or availability Depends on the needs of your

    application Metrics (views, clicks) for each campaign over time
  98. Prioritizing consistency or availability More consistency Metrics (views, clicks) for

    each campaign over time
  99. Prioritizing consistency or availability More availability Internal graphs Metrics (views,

    clicks) for each campaign over time
  100. Conclusion • Stream processing: data processing with operations on events

    or windows of events • Horizontal scalability, as data will grow and change over time • Handle failures appropriately ◦ Keep operations idempotent, for retries ◦ Tradeoff between availability and consistency
  101. www.yelp.com/careers/ We're Hiring!

  102. @YelpEngineering fb.com/YelpEngineers engineeringblog.yelp.com github.com/yelp