Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Dwolla, Tableau, & AWS: Building Scalable Event-Driven Data Pipelines for Payments

Dwolla, Tableau, & AWS: Building Scalable Event-Driven Data Pipelines for Payments

Events are the atomic building blocks of data. This is definitely the case at Dwolla, a payments company that allows anyone (or anything) connected to the internet to move money quickly, safely, and at the lowest cost possible. This session is a deep dive into how Dwolla manages events and data of all shapes and sizes using Amazon Web Services (EC2, EMR, RDS, Redshift, S3) and Tableau. This also introduces Dwolla's open source data pipeline orchestration tool, Arbalest (https://github.com/Dwolla/arbalest), to process all this data at scale. [This presentation was presented at the MGM Grand in Las Vegas for Tableau 15, Oct. 21, 2015.]

85c28d48194940198f232f31a06fc1d8?s=128

Fredrick Galoso

October 21, 2015
Tweet

Transcript

  1. Fredrick  Galoso Software  Developer,  Data  &  Analytics  Technical  Lead Dwolla

    # D w o l l a D a t a
  2. Software  Developer,  Data  &  Analytics   Technical  Lead fred@dwolla.com @wayoutmind

  3. DWOLLA

  4. • Launched  nationally  in  2010 • 70+  employees  across  3

     offices  (DSM,  SF,  NYC) • Direct  to  Consumer  (B2C),  Direct  to  Business  (B2B),  through  financial   institutions,  through  other  fintech companies  and  platforms • Partnerships  with  BBVA  Compass,  Comenity Capital  Bank,  US  Department   of  Treasury
  5. None
  6. Scaling  infrastructure  to  meet  demand Managed hosting provider AWS VPC

  7. Scaling  infrastructure  to  meet  demand AWS VPC • Flexibility •

    Reduce  response  time • Cost  savings • Predictability • Leverage  best  practices • Reuse  puzzle  pieces • Complexity
  8. 2015:  How  do  we  maximize  this   data  and  reduce

     time  to   insights?
  9. Key  Use  Cases • Product  management • Marketing  and  sales

    • Fraud  and  compliance • Customer  insights  and  demographics
  10. Before  Tableau:  Bank  Secrecy  Act  risk   monitoring

  11. Pain  points Which  report  has  accounts   in  a  HIFCA

     zip  code? Why  is  this  report  taking   so  long  to  load? How  do  I  manipulate  this  raw  data   to  answer  my  specific  question?
  12. Rudimentary  tools  were  good  enough  when   data  was  small

     and  simple
  13. Tools  were  getting  in  the  way  of  analyzing   larger

     amounts  of  data
  14. Data  Growing  Pains Blunt  tools • Data  discovery  difficult •

    Poor  performance • Unable  to  take  advantage  of  all  data No  ubiquity  or  consistency  in  facts • Error  prone,  labor  intensive,  manual  data  massaging
  15. Why  Tableau? Reduce  time  to  cognition • Business  intelligence,  visualization

     >  data  sheets • Dashboard  discoverability • Reports  load  in  seconds  instead  of  minutes • Support  for  all  of  our  data  sources
  16. Why  Tableau? Reduce  time  to  answers • Eliminate  BI  “chewing

     gum  and  duct  tape” • Create  dashboards  in  hours  instead  of  days • Free  up  engineering  resources
  17. Tableau  at  Dwolla Rich,  actionable  data  visualizations Immediate  success  with

     integration,  in  less  than  1  year • ~  30  Server,  5  Desktop  users • Hundreds  of  workbooks  and  dashboards • Discoverability  and  measurement  of  many  heterogeneous  data   sources
  18. None
  19. Data  Capture  and  Delivery

  20. Data  Capture  and  Delivery

  21. Data  Capture  and  Delivery

  22. Do  we  have  the  right  data?  Can  we   adapt?

     Can  we  answer,  “What  if?”
  23. Building  Flexibility • Need  to  save  enough  information  to  be

     able  to  answer  future   inquiries • Data  must  be  granular,  specific,  and  flexible  to  adaptation
  24. Typical  RDBMS  Record User  has  an  email  address  which  can

     be  two  states:  created or   verified email_address status jane@doe.com created
  25. Typical  RDBMS  Update Jane  verifies  her  email  address UPDATE Users

     SET status  =  ‘verified’  WHERE email_address =  ‘jane@doe.com’; email_address status jane@doe.com verified
  26. What  happened? Can  we  answer  the  following? • When did

     the  user  become  verified? • What was  the  verification  velocity?
  27. What  happened? Even  if  we  changed  the  schema • Context?

    • Explosion  of  individual  record  size • Tight  coupling  between  storage  of  value  and  structure
  28. Atomic  values Transaction  atomicity Operations that  are  all  or  nothing,

     indivisible  or  irreducible Semantic  atomicity Values that  have  indivisible  meaning;;  cannot  be  broken  down   further,  time  based  fact
  29. None
  30. Transaction  vs.  Semantic  Atomicity Transaction  atomicity

  31. Transaction  vs.  Semantic  Atomicity Semantic  atomicity

  32. Semantically  atomic   Can  derive  values  from  atoms,  but  

    semantically  atomic  values  cannot  be   derived
  33. Semantically  Atomic  State:  Events • Unique  identity • Specific  value

    • Structure • Time  based  fact • Immutable,  does  not  change • Separate  what  in  data  with  how it  is  stored  
  34. State  Transition  With  Events user.created { “email_address”:  “jane@doe.com”, “timestamp”:  “2015-­‐08-­‐18T06:36:40Z”

    } user.verified { “email_address”:  “jane@doe.com”, “timestamp”:  “2015-­‐08-­‐18T07:38:40Z” }
  35. Context  Specific  Event  Values user.created { “email_address”:  “jane@doe.com”, “timestamp”:  “2015-­‐08-­‐18T06:36:40Z”

    } user.verified { “email_address”:  “jane@doe.com”, “timestamp”:  “2015-­‐08-­‐18T07:38:40Z”, “workflow”:  “mobile” }
  36. Maintaining  Semantic  Atomicity • Additive  schema  changes • Use  new

     event  type  if  properties  or  changes  to  existing  events   change  the  fundamental  meaning  or  identity
  37. Embracing  Event  Streams • Lowest  common  denominator  for  downstream  systems

    • Apps  can  transform  and  store  event  streams  specific  to  their  use   cases  and  bounded  context
  38. Embracing  Event  Streams • Eliminate  breaking  schema  changes  and  side

     effects • Derived  values  can  be  destroyed  and  recreated  from  scratch • Extension  of  the  log,  big  data’s  streaming  abstraction,  but   explicitly  semi-­structured
  39. Data  Capture  and  Delivery

  40. Data  Structure  Spectrum:  Scale Unstructured • Logs • 100s  of

     millions+ Semi-­structured • Events • 100s  of  millions+ Structured • Application databases • Data   warehouse • 100s  of   millions
  41. Semi-­structured  Data  Infrastructure Unstructured • Logs • 100s  of  millions+

    Semi-­structured • Amazon  S3 • 100s  of  millions+ Structured • SQL  Server,   MySQL • Amazon   Redshift • 100s  of   millions
  42. Event  Transport  to  Amazon  S3

  43. Event  Payload s3://bucket-­‐name/[event  name]/[yyyy-­‐MM-­‐dd]/[hh]/eventId s3://bucket-­‐name/user.created/2015-­‐08-­‐18/06/ {“email_address”:   “jane@doe.com”,   “timestamp”:

      “2015-­‐08-­‐18T06:36:40Z”}
  44. None
  45. Typical  Big  Data  Batch  Analysis 1. Data  is  now  in

     an  easier  to  consume,  semi-­structured  form,  but  I   need  to  do  something  with  it 2. Write  job 3. Run  job 4. Wait 5. Get  results
  46. Typical  Big  Data  Batch  Analysis 6. Grief • Denial •

    Anger,  how  did  I  miss  this?! • Bargaining,  maybe  I  can  salvage • Acceptance:  map.  reduce.  all.  the.  things.  again
  47. “If  I  could  only  query  this  data…”

  48. Interactive  Analysis  at  Scale • SQL,  already  a  great  abstraction

    • Apache  Pig • Apache  Hive • Cloudera Impala • Shark  on  Apache  Spark • Amazon  Redshift
  49. Structured  Data  Infrastructure Unstructured • Logs • 100s  of  millions+

    Semi-­structured • Amazon  S3 • 100s  of  millions+ Structured • SQL  Server,   MySQL • Amazon   Redshift • 100s  of   millions
  50. Why  Amazon  Redshift? • Cost  effective  and  faster  than  alternatives

     (Airbnb Pinterest • Column  Store  (think  Apache  Cassandra) • ParAccel C++  backend • dist (sharding and  parallelism  hint)  and  sort  (order  hint)  keys • Speed  up  analysis  feedback  loop  (Bit.ly • Flexibility  in  data  consumption/manipulation,  talks  PostgreSQL   Kickstarter
  51. AMPLab Big  Data  Benchmark,  UC  Berkeley

  52. AMPLab Big  Data  Benchmark,  UC  Berkeley

  53. AMPLab Big  Data  Benchmark,  UC  Berkeley

  54. None
  55. None
  56. None
  57. Arbalest • Big  data  ingestion  from  S3  to  Redshift •

    Schema  creation • Highly  available  data  import  strategies • Running  data  import  jobs • Generating  and  uploading  prerequisite  artifacts  for  import • Open  source:  github.com/dwolla/arbalest
  58. Configuration  as  Code • Encapsulate  best  practices  into  a  lightweight

     Python  library • Handle  failures • Strategies  for  time  series  or  sparse  ingestion
  59. Configuration  as  Code • Validation  of  event  schemas • Transformation

     is  plain-­ole-­SQL • Idempotent  operations
  60. Configuration  as  Code self.bulk_copy metadata source schema JsonObject Property Property

    Property Property Property max_error_count
  61. Data  Archival  and  Automation • Minimize  TCO  based  on  data

     recency and  frequency  needs • Hot:  Amazon  Redshift • Warm:  Amazon  S3 • Cold:  Amazon  Glacier • Simple  archival  of  event  based  data  warehouse DELETE  FROM  google_analytics_user_created WHERE timestamp  <  ‘2015-­‐01-­‐01’;
  62. None
  63. None
  64. None
  65. None
  66. fred@dwolla.com github.com/dwolla/arbalest

  67. Please complete the session survey from the Session Details screen

    in your TC15 app
  68. None