Scalable Event Driven Data Pipelines with AWS

Slide 1

Slide 1 text

Building Scalable Event Driven Data Pipelines With AWS  Fredrick Galoso @wayoutmind

Slide 2

Slide 2 text

2014: How can we quickly grow our infrastructure to keep up with demand?

Slide 3

Slide 3 text

DWOLLA • B2C and B2B Payments platform • White label ACH API, MassPay, Next day transfers, Recurring Payments, oAuth + RESTful API • Simple, Integrated, Real-time • Partnerships with Veridian Credit Union, BBVA Compass, Comenity Capital Bank, US Department of Treasury, State of Iowa

Slide 4

Slide 4 text

2015: How do we maximize our data and learn from it?

Slide 5

Slide 5 text

Before Tableau

Slide 6

Slide 6 text

Tableau at Dwolla • Rich, actionable data visualizations • Immediate success with integration, in less than 1 year • ~ 50% with Tableau Server or Desktop • Hundreds of workbooks and dashboards • Discoverability and measurement of many heterogeneous data sources

Slide 7

Slide 7 text

No content

Slide 8

Slide 8 text

No content

Slide 9

Slide 9 text

Scaling to Meet Demand Managed hosting provider AWS VPC

Slide 10

Slide 10 text

Scaling to Meet Demand AWS VPC • Flexibility • Reduce response time • Cost savings • Predictability • Leverage best practices • Reuse puzzle pieces • Complexity

Slide 11

Slide 11 text

Key Use Cases • Product management • Marketing and sales • Fraud and compliance • Customer insights and demographics

Slide 12

Slide 12 text

Growing Pains • Blunt Tools • Data discovery diﬃcult • Poor performance • Unable to take advantage of all data • No ubiquity or consistency in facts • Manual data massaging

Slide 13

Slide 13 text

How can we analyze information from different contexts?

Slide 14

Slide 14 text

Data Capture and Delivery User activity Analysis

Slide 15

Slide 15 text

Data Capture and Delivery User activity Analysis How do we capture this information?

Slide 16

Slide 16 text

Do we have the right data? Can we adapt? Can we answer, “What if?”

Slide 17

Slide 17 text

Flexibility

Slide 18

Slide 18 text

Save enough information to be able to answer future inquiries

Slide 19

Slide 19 text

Granular data, speciﬁc, and ﬂexible to adaptation

Slide 20

Slide 20 text

User has an email address which can be two states: created or veriﬁed email_address status [email protected] created Typical RDBMS Record

Slide 21

Slide 21 text

Typical RDBMS Record UPDATE Users SET status = ‘verified’ WHERE email_address = ‘[email protected]’; email_address status [email protected] verified

Slide 22

Slide 22 text

What Happened? • Can we answer? • When did the user become veriﬁed? • What was the veriﬁcation velocity?

Slide 23

Slide 23 text

What Happened? • Even if we changed the schema • Context? • Explosion of individual record size • Tight coupling between storage of value and structure

Slide 24

Slide 24 text

Atomic Values • Transaction atomicity • Operations that are all or nothing, indivisible or irreducible • Semantic atomicity • Values that have indivisible meaning; cannot be broken down further, time based fact

Slide 25

Slide 25 text

No content

Slide 26

Slide 26 text

Transaction Atomicity

Slide 27

Slide 27 text

Semantic Atomicity

Slide 28

Slide 28 text

Can derive values from atoms, but semantically atomic values cannot be derived Semantically atomic

Slide 29

Slide 29 text

Semantically Atomic State: Events • Unique identity • Speciﬁc value • Structure • Time based fact • Immutable, does not change • Separate what in data with how it is stored

Slide 30

Slide 30 text

State Transition With Events user.created { “email_address”: “[email protected]”, “timestamp”: “2015-08-18T06:36:40Z” } user.verified { “email_address”: “[email protected]”, “timestamp”: “2015-08-18T07:38:40Z” }

Slide 31

Slide 31 text

Context Speciﬁc Event Values user.created { “email_address”: “[email protected]”, “timestamp”: “2015-08-18T06:36:40Z” } user.verified { “email_address”: “[email protected]”, “timestamp”: “2015-08-18T07:38:40Z” “workflow”: mobile }

Slide 32

Slide 32 text

Maintaining Semantic Atomicity • Additive schema changes • Use new event type if properties or changes to existing events change the fundamental meaning or identity

Slide 33

Slide 33 text

Embracing Event Streams • Lowest common denominator for downstream systems • Apps can transform and store event streams speciﬁc to their use cases and bounded context

Slide 34

Slide 34 text

Embracing Event Streams • Eliminate breaking schema changes and side eﬀects • Derived values can be destroyed and recreated from scratch • Extension of the log, big data’s streaming abstraction, but explicitly semi-structured

Slide 35

Slide 35 text

Data Capture and Delivery User activity Analysis How do we capture this information? How do we manage, transform, and make data available?

Slide 36

Slide 36 text

Data Structure Spectrum: Scale Unstructured • Logs • Billions Semi-structured • Events • Billions Structured • Application  databases • Data warehouse • 100s of millions+

Slide 37

Slide 37 text

Semi-structured Data Infrastructure Unstructured • Logs • Billions Semi-structured • Amazon S3 • Billions Structured • Application  databases • Data warehouse • 100s of millions+

Slide 38

Slide 38 text

Event Transport to Amazon S3 User activity Storage (S3) Transport (EC2)

Slide 39

Slide 39 text

Event Payload s3://bucket-name/[event name]/[yyyy-MM- dd]/[hh]/eventId s3://bucket-name/user.created/ 2015-08-18/06/009bd890-cb8f-4896- b9e7-8bb6c9b8b8fb {“email_address”: “[email protected]”, “timestamp”: “2015-08-18T06:36:40Z”}

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Write job Run job Wait … Results

Slide 42

Slide 42 text

Grief Denial Anger Bargaining Acceptance map. reduce. all. the. things. again

Slide 43

Slide 43 text

map. reduce. all. the. things. again

Slide 44

Slide 44 text

“If I could only query this data…”

Slide 45

Slide 45 text

Interactive Analysis at Scale • SQL, already a great abstraction • Apache Pig • Apache Hive • Cloudera Impala • Shark on Apache Spark • Amazon Redshift

Slide 46

Slide 46 text

Structured Data Infrastructure Unstructured • Logs • Billions Semi-structured • Amazon S3 • Billions Structured • SQL Server, MySQL • Amazon Redshift • 100s of millions+

Slide 47

Slide 47 text

Why Amazon Redshift? • Cost eﬀective and faster than alternatives (Airbnb, Pinterest) • Column Store (think Apache Cassandra) • ParAccel C++ backend • dist (sharding and parallelism hint) and sort (order hint) keys • Speed up analysis feedback loop (Bit.ly) • Flexibility in data consumption/manipulation, talks PostgreSQL (Kickstarter)

Slide 48

Slide 48 text

AMPLab Big Data Benchmark, UC Berkeley

Slide 49

Slide 49 text

AMPLab Big Data Benchmark, UC Berkeley

Slide 50

Slide 50 text

AMPLab Big Data Benchmark, UC Berkeley

Slide 51

Slide 51 text

$1,000/TB/Year (3 Year Partial Upfront Reserved Instance pricing)

Slide 52

Slide 52 text

No content

Slide 53

Slide 53 text

No content

Slide 54

Slide 54 text

No content

Slide 55

Slide 55 text

No content

Slide 56

Slide 56 text

No content

Slide 57

Slide 57 text

No content

Slide 58

Slide 58 text

Arbalest • Big data ingestion from S3 to Redshift • Schema creation • Highly available data import strategies • Running data import jobs • Generating and uploading prerequisite artifacts for import • Open source: github.com/Dwolla/arbalest

Slide 59

Slide 59 text

Conﬁguration as Code • Encapsulate best practices into a lightweight Python library • Handle failures • Strategies for time series or sparse ingestion

Slide 60

Slide 60 text

Conﬁguration as Code • Validation of event schemas • Transformation is plain-ole-SQL • Idempotent operations

Slide 61

Slide 61 text

Conﬁguration as Code 01 self.bulk_copy(metadata='', 02 source='google_analytics.user.created, 03 schema=JsonObject('google_analytics_user_created', 04 Property('trackingId', 05 'VARCHAR(MAX)'), 06 Property('sessionId', 07 'VARCHAR(36)'), 08 Property('userId', 'VARCHAR(36)'), 09 Property('googleUserId', 'VARCHAR(20)'), 10 Property('googleUserTimestamp', 'TIMESTAMP')), 11 max_error_count=env('MAXERROR'))

Slide 62

Slide 62 text

Use Cases • post-MapReduce • Expose S3 catch-all data sink for analytics and reporting • Existing complex Python pipelines that could become SQL query-able at scale

Slide 63

Slide 63 text

Data Archival and Automation • Minimize TCO based on data recency and frequency needs • Hot: Amazon Redshift • Warm: Amazon S3 • Cold: Amazon Glacier • Simple archival of event based data warehouse

Slide 64

Slide 64 text

DELETE FROM google_analytics_user_created WHERE timestamp < ‘2015-01-01’;

Slide 65

Slide 65 text

Event Driven Data Pipeline Principles • Immutable semantic events, immutability changes everything • Idempotent pipelines, applying any operation twice results in the same value as applying it once • Transformations are state machines, small composable steps with well deﬁned transitions from one state (data value) to another

Slide 66

Slide 66 text

No content

Slide 67

Slide 67 text

Demo

Slide 68

Slide 68 text

No content

Slide 69

Slide 69 text

@wayoutmind github.com/fredrick github.com/Dwolla/arbalest arbalest.readthedocs.org Fredrick Galoso