Dwolla, Tableau, & AWS: Building Scalable Event-Driven Data Pipelines for Payments

Fredrick Galoso Software Developer, Data & Analytics Technical Lead Dwolla
# D w o l l a D a t a

Software Developer, Data & Analytics Technical Lead [email protected] @wayoutmind

DWOLLA

• Launched nationally in 2010 • 70+ employees across 3
offices (DSM, SF, NYC) • Direct to Consumer (B2C), Direct to Business (B2B), through financial institutions, through other fintech companies and platforms • Partnerships with BBVA Compass, Comenity Capital Bank, US Department of Treasury

Scaling infrastructure to meet demand Managed hosting provider AWS VPC

Scaling infrastructure to meet demand AWS VPC • Flexibility •
Reduce response time • Cost savings • Predictability • Leverage best practices • Reuse puzzle pieces • Complexity

2015: How do we maximize this data and reduce
time to insights?

Key Use Cases • Product management • Marketing and sales
• Fraud and compliance • Customer insights and demographics

Before Tableau: Bank Secrecy Act risk monitoring

Pain points Which report has accounts in a HIFCA
zip code? Why is this report taking so long to load? How do I manipulate this raw data to answer my specific question?

Rudimentary tools were good enough when data was small
and simple

Tools were getting in the way of analyzing larger
amounts of data

Data Growing Pains Blunt tools • Data discovery difficult •
Poor performance • Unable to take advantage of all data No ubiquity or consistency in facts • Error prone, labor intensive, manual data massaging

Why Tableau? Reduce time to cognition • Business intelligence, visualization
> data sheets • Dashboard discoverability • Reports load in seconds instead of minutes • Support for all of our data sources

Why Tableau? Reduce time to answers • Eliminate BI “chewing
gum and duct tape” • Create dashboards in hours instead of days • Free up engineering resources

Tableau at Dwolla Rich, actionable data visualizations Immediate success with
integration, in less than 1 year • ~ 30 Server, 5 Desktop users • Hundreds of workbooks and dashboards • Discoverability and measurement of many heterogeneous data sources

Data Capture and Delivery

Do we have the right data? Can we adapt?
Can we answer, “What if?”

Building Flexibility • Need to save enough information to be
able to answer future inquiries • Data must be granular, specific, and flexible to adaptation

Typical RDBMS Record User has an email address which can
be two states: created or verified email_address status [email protected] created

Typical RDBMS Update Jane verifies her email address UPDATE Users
SET status = ‘verified’ WHERE email_address = ‘[email protected]’; email_address status [email protected] verified

What happened? Can we answer the following? • When did
the user become verified? • What was the verification velocity?

What happened? Even if we changed the schema • Context?
• Explosion of individual record size • Tight coupling between storage of value and structure

Atomic values Transaction atomicity Operations that are all or nothing,
indivisible or irreducible Semantic atomicity Values that have indivisible meaning;; cannot be broken down further, time based fact

Transaction vs. Semantic Atomicity Transaction atomicity

Transaction vs. Semantic Atomicity Semantic atomicity

Semantically atomic Can derive values from atoms, but
semantically atomic values cannot be derived

Semantically Atomic State: Events • Unique identity • Specific value
• Structure • Time based fact • Immutable, does not change • Separate what in data with how it is stored

State Transition With Events user.created { “email_address”: “[email protected]”, “timestamp”: “2015-‐08-‐18T06:36:40Z”
} user.verified { “email_address”: “[email protected]”, “timestamp”: “2015-‐08-‐18T07:38:40Z” }

Context Specific Event Values user.created { “email_address”: “[email protected]”, “timestamp”: “2015-‐08-‐18T06:36:40Z”
} user.verified { “email_address”: “[email protected]”, “timestamp”: “2015-‐08-‐18T07:38:40Z”, “workflow”: “mobile” }

Maintaining Semantic Atomicity • Additive schema changes • Use new
event type if properties or changes to existing events change the fundamental meaning or identity

Embracing Event Streams • Lowest common denominator for downstream systems
• Apps can transform and store event streams specific to their use cases and bounded context

Embracing Event Streams • Eliminate breaking schema changes and side
effects • Derived values can be destroyed and recreated from scratch • Extension of the log, big data’s streaming abstraction, but explicitly semi-structured

Data Capture and Delivery

Data Structure Spectrum: Scale Unstructured • Logs • 100s of
millions+ Semi-structured • Events • 100s of millions+ Structured • Application databases • Data warehouse • 100s of millions

Semi-structured Data Infrastructure Unstructured • Logs • 100s of millions+
Semi-structured • Amazon S3 • 100s of millions+ Structured • SQL Server, MySQL • Amazon Redshift • 100s of millions

Event Transport to Amazon S3

Event Payload s3://bucket-‐name/[event name]/[yyyy-‐MM-‐dd]/[hh]/eventId s3://bucket-‐name/user.created/2015-‐08-‐18/06/ {“email_address”: “[email protected]”, “timestamp”:
“2015-‐08-‐18T06:36:40Z”}

Typical Big Data Batch Analysis 1. Data is now in
an easier to consume, semi-structured form, but I need to do something with it 2. Write job 3. Run job 4. Wait 5. Get results

Typical Big Data Batch Analysis 6. Grief • Denial •
Anger, how did I miss this?! • Bargaining, maybe I can salvage • Acceptance: map. reduce. all. the. things. again

“If I could only query this data…”

Interactive Analysis at Scale • SQL, already a great abstraction
• Apache Pig • Apache Hive • Cloudera Impala • Shark on Apache Spark • Amazon Redshift

Structured Data Infrastructure Unstructured • Logs • 100s of millions+
Semi-structured • Amazon S3 • 100s of millions+ Structured • SQL Server, MySQL • Amazon Redshift • 100s of millions

Why Amazon Redshift? • Cost effective and faster than alternatives
(Airbnb Pinterest • Column Store (think Apache Cassandra) • ParAccel C++ backend • dist (sharding and parallelism hint) and sort (order hint) keys • Speed up analysis feedback loop (Bit.ly • Flexibility in data consumption/manipulation, talks PostgreSQL Kickstarter

AMPLab Big Data Benchmark, UC Berkeley

Arbalest • Big data ingestion from S3 to Redshift •
Schema creation • Highly available data import strategies • Running data import jobs • Generating and uploading prerequisite artifacts for import • Open source: github.com/dwolla/arbalest

Configuration as Code • Encapsulate best practices into a lightweight
Python library • Handle failures • Strategies for time series or sparse ingestion

Configuration as Code • Validation of event schemas • Transformation
is plain-ole-SQL • Idempotent operations

Configuration as Code self.bulk_copy metadata source schema JsonObject Property Property
Property Property Property max_error_count

Data Archival and Automation • Minimize TCO based on data
recency and frequency needs • Hot: Amazon Redshift • Warm: Amazon S3 • Cold: Amazon Glacier • Simple archival of event based data warehouse DELETE FROM google_analytics_user_created WHERE timestamp < ‘2015-‐01-‐01’;

[email protected] github.com/dwolla/arbalest

Please complete the session survey from the Session Details screen
in your TC15 app

Dwolla, Tableau, & AWS: Building Scalable Event...

Dwolla, Tableau, & AWS: Building Scalable Event-Driven Data Pipelines for Payments

More Decks by Fredrick Galoso

Other Decks in Technology

Featured

Transcript