Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scalable Event Driven Data Pipelines with AWS

Scalable Event Driven Data Pipelines with AWS

This presentation is an adaptation of https://speakerdeck.com/wayoutmind/dwolla-tableau-and-aws-building-scalable-event-driven-data-pipelines-for-payments focussing specifically on how to capture flexible, semantically atomic events. It also introduces an open source Python library, https://github.com/Dwolla/arbalest and demonstrates how to compose AWS services to enable interactive querying and analysis at scale. [This presentation was originally presented at Iowa Code Camp 16, Dec. 5, 2015.]

Fredrick Galoso

December 05, 2015
Tweet

More Decks by Fredrick Galoso

Other Decks in Programming

Transcript

  1. DWOLLA • B2C and B2B Payments platform • White label

    ACH API, MassPay, Next day transfers, Recurring Payments, oAuth + RESTful API • Simple, Integrated, Real-time • Partnerships with Veridian Credit Union, BBVA Compass, Comenity Capital Bank, US Department of Treasury, State of Iowa
  2. Tableau at Dwolla • Rich, actionable data visualizations • Immediate

    success with integration, in less than 1 year • ~ 50% with Tableau Server or Desktop • Hundreds of workbooks and dashboards • Discoverability and measurement of many heterogeneous data sources
  3. Scaling to Meet Demand AWS VPC • Flexibility • Reduce

    response time • Cost savings • Predictability • Leverage best practices • Reuse puzzle pieces • Complexity
  4. Key Use Cases • Product management • Marketing and sales

    • Fraud and compliance • Customer insights and demographics
  5. Growing Pains • Blunt Tools • Data discovery difficult •

    Poor performance • Unable to take advantage of all data • No ubiquity or consistency in facts • Manual data massaging
  6. User has an email address which can be two states:

    created or verified email_address status [email protected] created Typical RDBMS Record
  7. What Happened? • Can we answer? • When did the

    user become verified? • What was the verification velocity?
  8. What Happened? • Even if we changed the schema •

    Context? • Explosion of individual record size • Tight coupling between storage of value and structure
  9. Atomic Values • Transaction atomicity • Operations that are all

    or nothing, indivisible or irreducible • Semantic atomicity • Values that have indivisible meaning; cannot be broken down further, time based fact
  10. Semantically Atomic State: Events • Unique identity • Specific value

    • Structure • Time based fact • Immutable, does not change • Separate what in data with how it is stored
  11. Context Specific Event Values user.created { “email_address”: “[email protected]”, “timestamp”: “2015-08-18T06:36:40Z”

    } user.verified { “email_address”: “[email protected]”, “timestamp”: “2015-08-18T07:38:40Z” “workflow”: mobile }
  12. Maintaining Semantic Atomicity • Additive schema changes • Use new

    event type if properties or changes to existing events change the fundamental meaning or identity
  13. Embracing Event Streams • Lowest common denominator for downstream systems

    • Apps can transform and store event streams specific to their use cases and bounded context
  14. Embracing Event Streams • Eliminate breaking schema changes and side

    effects • Derived values can be destroyed and recreated from scratch • Extension of the log, big data’s streaming abstraction, but explicitly semi-structured
  15. Data Capture and Delivery User activity Analysis How do we

    capture this information? How do we manage, transform, and make data available?
  16. Data Structure Spectrum: Scale Unstructured • Logs • Billions Semi-structured

    • Events • Billions Structured • Application
 databases • Data warehouse • 100s of millions+
  17. Semi-structured Data Infrastructure Unstructured • Logs • Billions Semi-structured •

    Amazon S3 • Billions Structured • Application
 databases • Data warehouse • 100s of millions+
  18. Interactive Analysis at Scale • SQL, already a great abstraction

    • Apache Pig • Apache Hive • Cloudera Impala • Shark on Apache Spark • Amazon Redshift
  19. Structured Data Infrastructure Unstructured • Logs • Billions Semi-structured •

    Amazon S3 • Billions Structured • SQL Server, MySQL • Amazon Redshift • 100s of millions+
  20. Why Amazon Redshift? • Cost effective and faster than alternatives

    (Airbnb, Pinterest) • Column Store (think Apache Cassandra) • ParAccel C++ backend • dist (sharding and parallelism hint) and sort (order hint) keys • Speed up analysis feedback loop (Bit.ly) • Flexibility in data consumption/manipulation, talks PostgreSQL (Kickstarter)
  21. Arbalest • Big data ingestion from S3 to Redshift •

    Schema creation • Highly available data import strategies • Running data import jobs • Generating and uploading prerequisite artifacts for import • Open source: github.com/Dwolla/arbalest
  22. Configuration as Code • Encapsulate best practices into a lightweight

    Python library • Handle failures • Strategies for time series or sparse ingestion
  23. Configuration as Code 01 self.bulk_copy(metadata='', 02 source='google_analytics.user.created, 03 schema=JsonObject('google_analytics_user_created', 04

    Property('trackingId', 05 'VARCHAR(MAX)'), 06 Property('sessionId', 07 'VARCHAR(36)'), 08 Property('userId', 'VARCHAR(36)'), 09 Property('googleUserId', 'VARCHAR(20)'), 10 Property('googleUserTimestamp', 'TIMESTAMP')), 11 max_error_count=env('MAXERROR'))
  24. Use Cases • post-MapReduce • Expose S3 catch-all data sink

    for analytics and reporting • Existing complex Python pipelines that could become SQL query-able at scale
  25. Data Archival and Automation • Minimize TCO based on data

    recency and frequency needs • Hot: Amazon Redshift • Warm: Amazon S3 • Cold: Amazon Glacier • Simple archival of event based data warehouse
  26. Event Driven Data Pipeline Principles • Immutable semantic events, immutability

    changes everything • Idempotent pipelines, applying any operation twice results in the same value as applying it once • Transformations are state machines, small composable steps with well defined transitions from one state (data value) to another