Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Event Driven Data Pipelines with AWS

Scalable Event Driven Data Pipelines with AWS

This presentation is an adaptation of https://speakerdeck.com/wayoutmind/dwolla-tableau-and-aws-building-scalable-event-driven-data-pipelines-for-payments focussing specifically on how to capture flexible, semantically atomic events. It also introduces an open source Python library, https://github.com/Dwolla/arbalest and demonstrates how to compose AWS services to enable interactive querying and analysis at scale. [This presentation was originally presented at Iowa Code Camp 16, Dec. 5, 2015.]

85c28d48194940198f232f31a06fc1d8?s=128

Fredrick Galoso

December 05, 2015
Tweet

Transcript

  1. Building Scalable Event Driven Data Pipelines With AWS
 Fredrick Galoso

    @wayoutmind
  2. 2014: How can we quickly grow our infrastructure to keep

    up with demand?
  3. DWOLLA • B2C and B2B Payments platform • White label

    ACH API, MassPay, Next day transfers, Recurring Payments, oAuth + RESTful API • Simple, Integrated, Real-time • Partnerships with Veridian Credit Union, BBVA Compass, Comenity Capital Bank, US Department of Treasury, State of Iowa
  4. 2015: How do we maximize our data and learn from

    it?
  5. Before Tableau

  6. Tableau at Dwolla • Rich, actionable data visualizations • Immediate

    success with integration, in less than 1 year • ~ 50% with Tableau Server or Desktop • Hundreds of workbooks and dashboards • Discoverability and measurement of many heterogeneous data sources
  7. None
  8. None
  9. Scaling to Meet Demand Managed hosting provider AWS VPC

  10. Scaling to Meet Demand AWS VPC • Flexibility • Reduce

    response time • Cost savings • Predictability • Leverage best practices • Reuse puzzle pieces • Complexity
  11. Key Use Cases • Product management • Marketing and sales

    • Fraud and compliance • Customer insights and demographics
  12. Growing Pains • Blunt Tools • Data discovery difficult •

    Poor performance • Unable to take advantage of all data • No ubiquity or consistency in facts • Manual data massaging
  13. How can we analyze information from different contexts?

  14. Data Capture and Delivery User activity Analysis

  15. Data Capture and Delivery User activity Analysis How do we

    capture this information?
  16. Do we have the right data? Can we adapt? Can

    we answer, “What if?”
  17. Flexibility

  18. Save enough information to be able to answer future inquiries

  19. Granular data, specific, and flexible to adaptation

  20. User has an email address which can be two states:

    created or verified email_address status jane@doe.com created Typical RDBMS Record
  21. Typical RDBMS Record UPDATE Users SET status = ‘verified’ WHERE

    email_address = ‘jane@doe.com’; email_address status jane@doe.com verified
  22. What Happened? • Can we answer? • When did the

    user become verified? • What was the verification velocity?
  23. What Happened? • Even if we changed the schema •

    Context? • Explosion of individual record size • Tight coupling between storage of value and structure
  24. Atomic Values • Transaction atomicity • Operations that are all

    or nothing, indivisible or irreducible • Semantic atomicity • Values that have indivisible meaning; cannot be broken down further, time based fact
  25. None
  26. Transaction Atomicity

  27. Semantic Atomicity

  28. Can derive values from atoms, but semantically atomic values cannot

    be derived Semantically atomic
  29. Semantically Atomic State: Events • Unique identity • Specific value

    • Structure • Time based fact • Immutable, does not change • Separate what in data with how it is stored
  30. State Transition With Events user.created { “email_address”: “jane@doe.com”, “timestamp”: “2015-08-18T06:36:40Z”

    } user.verified { “email_address”: “jane@doe.com”, “timestamp”: “2015-08-18T07:38:40Z” }
  31. Context Specific Event Values user.created { “email_address”: “jane@doe.com”, “timestamp”: “2015-08-18T06:36:40Z”

    } user.verified { “email_address”: “jane@doe.com”, “timestamp”: “2015-08-18T07:38:40Z” “workflow”: mobile }
  32. Maintaining Semantic Atomicity • Additive schema changes • Use new

    event type if properties or changes to existing events change the fundamental meaning or identity
  33. Embracing Event Streams • Lowest common denominator for downstream systems

    • Apps can transform and store event streams specific to their use cases and bounded context
  34. Embracing Event Streams • Eliminate breaking schema changes and side

    effects • Derived values can be destroyed and recreated from scratch • Extension of the log, big data’s streaming abstraction, but explicitly semi-structured
  35. Data Capture and Delivery User activity Analysis How do we

    capture this information? How do we manage, transform, and make data available?
  36. Data Structure Spectrum: Scale Unstructured • Logs • Billions Semi-structured

    • Events • Billions Structured • Application
 databases • Data warehouse • 100s of millions+
  37. Semi-structured Data Infrastructure Unstructured • Logs • Billions Semi-structured •

    Amazon S3 • Billions Structured • Application
 databases • Data warehouse • 100s of millions+
  38. Event Transport to Amazon S3 User activity Storage (S3) Transport

    (EC2)
  39. Event Payload s3://bucket-name/[event name]/[yyyy-MM- dd]/[hh]/eventId s3://bucket-name/user.created/ 2015-08-18/06/009bd890-cb8f-4896- b9e7-8bb6c9b8b8fb {“email_address”: “jane@doe.com”,

    “timestamp”: “2015-08-18T06:36:40Z”}
  40. None
  41. Write job Run job Wait … Results

  42. Grief Denial Anger Bargaining Acceptance map. reduce. all. the. things.

    again
  43. map. reduce. all. the. things. again

  44. “If I could only query this data…”

  45. Interactive Analysis at Scale • SQL, already a great abstraction

    • Apache Pig • Apache Hive • Cloudera Impala • Shark on Apache Spark • Amazon Redshift
  46. Structured Data Infrastructure Unstructured • Logs • Billions Semi-structured •

    Amazon S3 • Billions Structured • SQL Server, MySQL • Amazon Redshift • 100s of millions+
  47. Why Amazon Redshift? • Cost effective and faster than alternatives

    (Airbnb, Pinterest) • Column Store (think Apache Cassandra) • ParAccel C++ backend • dist (sharding and parallelism hint) and sort (order hint) keys • Speed up analysis feedback loop (Bit.ly) • Flexibility in data consumption/manipulation, talks PostgreSQL (Kickstarter)
  48. AMPLab Big Data Benchmark, UC Berkeley

  49. AMPLab Big Data Benchmark, UC Berkeley

  50. AMPLab Big Data Benchmark, UC Berkeley

  51. $1,000/TB/Year (3 Year Partial Upfront Reserved Instance pricing)

  52. None
  53. None
  54. None
  55. None
  56. None
  57. None
  58. Arbalest • Big data ingestion from S3 to Redshift •

    Schema creation • Highly available data import strategies • Running data import jobs • Generating and uploading prerequisite artifacts for import • Open source: github.com/Dwolla/arbalest
  59. Configuration as Code • Encapsulate best practices into a lightweight

    Python library • Handle failures • Strategies for time series or sparse ingestion
  60. Configuration as Code • Validation of event schemas • Transformation

    is plain-ole-SQL • Idempotent operations
  61. Configuration as Code 01 self.bulk_copy(metadata='', 02 source='google_analytics.user.created, 03 schema=JsonObject('google_analytics_user_created', 04

    Property('trackingId', 05 'VARCHAR(MAX)'), 06 Property('sessionId', 07 'VARCHAR(36)'), 08 Property('userId', 'VARCHAR(36)'), 09 Property('googleUserId', 'VARCHAR(20)'), 10 Property('googleUserTimestamp', 'TIMESTAMP')), 11 max_error_count=env('MAXERROR'))
  62. Use Cases • post-MapReduce • Expose S3 catch-all data sink

    for analytics and reporting • Existing complex Python pipelines that could become SQL query-able at scale
  63. Data Archival and Automation • Minimize TCO based on data

    recency and frequency needs • Hot: Amazon Redshift • Warm: Amazon S3 • Cold: Amazon Glacier • Simple archival of event based data warehouse
  64. DELETE FROM google_analytics_user_created WHERE timestamp < ‘2015-01-01’;

  65. Event Driven Data Pipeline Principles • Immutable semantic events, immutability

    changes everything • Idempotent pipelines, applying any operation twice results in the same value as applying it once • Transformations are state machines, small composable steps with well defined transitions from one state (data value) to another
  66. None
  67. Demo

  68. None
  69. @wayoutmind github.com/fredrick github.com/Dwolla/arbalest arbalest.readthedocs.org Fredrick Galoso