Upgrade to Pro — share decks privately, control downloads, hide ads and more …

AWS Data Pipeline

AWS Data Pipeline

A quick overview of what AWS Data Pipeline is, what it can do for you, and how we're using it at Swipely.

Avatar for Keith Barrette

Keith Barrette

March 26, 2013
Tweet

Other Decks in Technology

Transcript

  1. Not-so-good way: maintain a Hadoop cluster, get data into Hive

    tables, schedule a Hive job OK way: use EMR, but you still have to schedule, monitor, etc So you want to process logs with Hive?
  2. So what? Cron jobs living with your app tend to

    become apps themselves - monitoring, deployment, hosting, etc Ops guy creates scheme for scheduled jobs - not bad, but it can be better
  3. Enter Data Pipeline Scheduling? Baked in Monitoring? SNS on success

    and/or failure Deployment? Pipelines definitions are declarative - Amazon handles the rest Hosting? Instances are spun up as needed and destroyed when done
  4. The problem lends itself nicely to map reduce We’re good

    at writing an testing Ruby code, so we’d like to use Ruby for the mapper and reducer We don’t want to have to administer a Hadoop cluster, or find a way to schedule and monitor an EMR flow
  5. { "id": "CopyTransactionsToS3", "type": "CopyActivity", "schedule": { "ref": "Nightly" },

    "onFail": { "ref": "FailureNotify" }, "input": { "ref": “TransactionsSelect" }, "output": { "ref": "S3Transactions" } }, { "id": "S3Transactions", "type": "S3DataNode", "schedule": { "ref": "Nightly" }, "onFail": { "ref": "FailureNotify" }, "filePath": "s3://.../transactions.csv" }, { "id": "TransactionsSelect", "type": "MySqlDataNode", "schedule": { "ref": "Nightly" }, "onFail": { "ref": "FailureNotifyureNotify" }, "table": "transactions", "connectionString": "jdbc:mysql://...", "selectQuery": "SELECT ... FROM #{table} WHERE ...;" },
  6. { "id": "HeatmapEMRCluster", "type": "EmrCluster", "schedule": { "ref": "Nightly" },

    }, { "id": "GenerateHeatmap", "type": "EmrActivity", "onFail": { "ref": "FailureNotify" }, "schedule": { "ref": "Nightly" }, "runsOn": { "ref": "HeatmapEMRCluster" }, "dependsOn": { "ref": "CopyTransactionsToS3" }, "step": "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-input,s3n://.../ transactions.csv,-output,s3://.../heatmap_aggregations,-mapper,s3n://.../map.rb,- reducer,s3n://.../reduce.rb" },
  7. Monitored: get an email (via SNS) when the pipeline completes

    (or fails) Scalable: we can adjust the EMR cluster and scheduling as needed Deployable: devs can work in Ruby, deploy with our usual machinery Simple: pipeline definition is just JSON (albeit 150 lines)
  8. We want to make nightly DB backups We also want

    to protect our customers’ data We want production-ish data in our staging DB, and on our dev laptops
  9. Nightly DB dumps 1. Bootstrap a Ruby environment 2. Create

    backups, store them on S3 3. Use Fog to create various RDS clones 4. Sanitize and truncate data 5. Create DB dumps for use by devs 6. Switch DNS
  10. Arbitrary shell commands: bootstrap Ruby, then interact with various libraries

    Concurrency: manipulating data in the “scratch” DB can happen while the new staging DB is being set up Automatic S3 staging to instance local FS: write backups, etc to local directories, and they’ll be moved to S3 automatically
  11. But...you don’t want to see this definition ...and don’t even

    try the GUI: 16 shell command nodes, various S3 data nodes, schedules, SNS nodes, etc add up to 250 lines of dense JSON
  12. Testing and debugging can be a pain - what if

    the last step of a big pipeline fails? Config gets rather big - DSL anyone? /cc @bigconfig Where are your logs?
  13. Final Thoughts Got scheduled tasks that move data around AWS?

    Give it a try Still plenty of room for Amazon and third parties to build tooling