Scalable Event Driven Data Pipelines with AWS

Building Scalable Event Driven Data Pipelines With AWS  Fredrick Galoso
@wayoutmind

2014: How can we quickly grow our infrastructure to keep
up with demand?

DWOLLA • B2C and B2B Payments platform • White label
ACH API, MassPay, Next day transfers, Recurring Payments, oAuth + RESTful API • Simple, Integrated, Real-time • Partnerships with Veridian Credit Union, BBVA Compass, Comenity Capital Bank, US Department of Treasury, State of Iowa

2015: How do we maximize our data and learn from
it?

Before Tableau

Tableau at Dwolla • Rich, actionable data visualizations • Immediate
success with integration, in less than 1 year • ~ 50% with Tableau Server or Desktop • Hundreds of workbooks and dashboards • Discoverability and measurement of many heterogeneous data sources

Scaling to Meet Demand Managed hosting provider AWS VPC

Scaling to Meet Demand AWS VPC • Flexibility • Reduce
response time • Cost savings • Predictability • Leverage best practices • Reuse puzzle pieces • Complexity

Key Use Cases • Product management • Marketing and sales
• Fraud and compliance • Customer insights and demographics

Growing Pains • Blunt Tools • Data discovery diﬃcult •
Poor performance • Unable to take advantage of all data • No ubiquity or consistency in facts • Manual data massaging

How can we analyze information from different contexts?

Data Capture and Delivery User activity Analysis

Data Capture and Delivery User activity Analysis How do we
capture this information?

Do we have the right data? Can we adapt? Can
we answer, “What if?”

Flexibility

Save enough information to be able to answer future inquiries

Granular data, speciﬁc, and ﬂexible to adaptation

User has an email address which can be two states:
created or veriﬁed email_address status [email protected] created Typical RDBMS Record

Typical RDBMS Record UPDATE Users SET status = ‘verified’ WHERE
email_address = ‘[email protected]’; email_address status [email protected] verified

What Happened? • Can we answer? • When did the
user become veriﬁed? • What was the veriﬁcation velocity?

What Happened? • Even if we changed the schema •
Context? • Explosion of individual record size • Tight coupling between storage of value and structure

Atomic Values • Transaction atomicity • Operations that are all
or nothing, indivisible or irreducible • Semantic atomicity • Values that have indivisible meaning; cannot be broken down further, time based fact

Transaction Atomicity

Semantic Atomicity

Can derive values from atoms, but semantically atomic values cannot
be derived Semantically atomic

Semantically Atomic State: Events • Unique identity • Speciﬁc value
• Structure • Time based fact • Immutable, does not change • Separate what in data with how it is stored

State Transition With Events user.created { “email_address”: “[email protected]”, “timestamp”: “2015-08-18T06:36:40Z”
} user.verified { “email_address”: “[email protected]”, “timestamp”: “2015-08-18T07:38:40Z” }

Context Speciﬁc Event Values user.created { “email_address”: “[email protected]”, “timestamp”: “2015-08-18T06:36:40Z”
} user.verified { “email_address”: “[email protected]”, “timestamp”: “2015-08-18T07:38:40Z” “workflow”: mobile }

Maintaining Semantic Atomicity • Additive schema changes • Use new
event type if properties or changes to existing events change the fundamental meaning or identity

Embracing Event Streams • Lowest common denominator for downstream systems
• Apps can transform and store event streams speciﬁc to their use cases and bounded context

Embracing Event Streams • Eliminate breaking schema changes and side
eﬀects • Derived values can be destroyed and recreated from scratch • Extension of the log, big data’s streaming abstraction, but explicitly semi-structured

Data Capture and Delivery User activity Analysis How do we
capture this information? How do we manage, transform, and make data available?

Data Structure Spectrum: Scale Unstructured • Logs • Billions Semi-structured
• Events • Billions Structured • Application  databases • Data warehouse • 100s of millions+

Semi-structured Data Infrastructure Unstructured • Logs • Billions Semi-structured •
Amazon S3 • Billions Structured • Application  databases • Data warehouse • 100s of millions+

Event Transport to Amazon S3 User activity Storage (S3) Transport
(EC2)

Event Payload s3://bucket-name/[event name]/[yyyy-MM- dd]/[hh]/eventId s3://bucket-name/user.created/ 2015-08-18/06/009bd890-cb8f-4896- b9e7-8bb6c9b8b8fb {“email_address”: “[email protected]”,
“timestamp”: “2015-08-18T06:36:40Z”}

Write job Run job Wait … Results

Grief Denial Anger Bargaining Acceptance map. reduce. all. the. things.
again

map. reduce. all. the. things. again

“If I could only query this data…”

Interactive Analysis at Scale • SQL, already a great abstraction
• Apache Pig • Apache Hive • Cloudera Impala • Shark on Apache Spark • Amazon Redshift

Structured Data Infrastructure Unstructured • Logs • Billions Semi-structured •
Amazon S3 • Billions Structured • SQL Server, MySQL • Amazon Redshift • 100s of millions+

Why Amazon Redshift? • Cost eﬀective and faster than alternatives
(Airbnb, Pinterest) • Column Store (think Apache Cassandra) • ParAccel C++ backend • dist (sharding and parallelism hint) and sort (order hint) keys • Speed up analysis feedback loop (Bit.ly) • Flexibility in data consumption/manipulation, talks PostgreSQL (Kickstarter)

AMPLab Big Data Benchmark, UC Berkeley

$1,000/TB/Year (3 Year Partial Upfront Reserved Instance pricing)

Arbalest • Big data ingestion from S3 to Redshift •
Schema creation • Highly available data import strategies • Running data import jobs • Generating and uploading prerequisite artifacts for import • Open source: github.com/Dwolla/arbalest

Conﬁguration as Code • Encapsulate best practices into a lightweight
Python library • Handle failures • Strategies for time series or sparse ingestion

Conﬁguration as Code • Validation of event schemas • Transformation
is plain-ole-SQL • Idempotent operations

Conﬁguration as Code 01 self.bulk_copy(metadata='', 02 source='google_analytics.user.created, 03 schema=JsonObject('google_analytics_user_created', 04
Property('trackingId', 05 'VARCHAR(MAX)'), 06 Property('sessionId', 07 'VARCHAR(36)'), 08 Property('userId', 'VARCHAR(36)'), 09 Property('googleUserId', 'VARCHAR(20)'), 10 Property('googleUserTimestamp', 'TIMESTAMP')), 11 max_error_count=env('MAXERROR'))

Use Cases • post-MapReduce • Expose S3 catch-all data sink
for analytics and reporting • Existing complex Python pipelines that could become SQL query-able at scale

Data Archival and Automation • Minimize TCO based on data
recency and frequency needs • Hot: Amazon Redshift • Warm: Amazon S3 • Cold: Amazon Glacier • Simple archival of event based data warehouse

DELETE FROM google_analytics_user_created WHERE timestamp < ‘2015-01-01’;

Event Driven Data Pipeline Principles • Immutable semantic events, immutability
changes everything • Idempotent pipelines, applying any operation twice results in the same value as applying it once • Transformations are state machines, small composable steps with well deﬁned transitions from one state (data value) to another

@wayoutmind github.com/fredrick github.com/Dwolla/arbalest arbalest.readthedocs.org Fredrick Galoso

Scalable Event Driven Data Pipelines with AWS

Scalable Event Driven Data Pipelines with AWS

More Decks by Fredrick Galoso

Other Decks in Programming

Featured

Transcript