AWS Data Pipeline

AWS Data Pipeline http://aws.amazon.com/datapipeline/ Keith Barrette [email protected]

“process and move data...at speciﬁed intervals”

Sources “Data nodes” Transforms “Activities” Sinks “Data nodes”

Not-so-good way: maintain a Hadoop cluster, get data into Hive
tables, schedule a Hive job OK way: use EMR, but you still have to schedule, monitor, etc So you want to process logs with Hive?

So what? Cron jobs living with your app tend to
become apps themselves - monitoring, deployment, hosting, etc Ops guy creates scheme for scheduled jobs - not bad, but it can be better

Enter Data Pipeline Scheduling? Baked in Monitoring? SNS on success
and/or failure Deployment? Pipelines deﬁnitions are declarative - Amazon handles the rest Hosting? Instances are spun up as needed and destroyed when done

Easy inter-task dependencies, retries, timeouts, integration between RDS, EMR, S3,
etc.

A real-world example: Swipely Merchant Sales Heatmap

RDS EMR RDS

The problem lends itself nicely to map reduce We’re good
at writing an testing Ruby code, so we’d like to use Ruby for the mapper and reducer We don’t want to have to administer a Hadoop cluster, or ﬁnd a way to schedule and monitor an EMR ﬂow

RDS Transactions S3 Transactions S3 Heatmap RDS App S3 Code
EMR Heatmap Generation

{ "id": "CopyTransactionsToS3", "type": "CopyActivity", "schedule": { "ref": "Nightly" },
"onFail": { "ref": "FailureNotify" }, "input": { "ref": “TransactionsSelect" }, "output": { "ref": "S3Transactions" } }, { "id": "S3Transactions", "type": "S3DataNode", "schedule": { "ref": "Nightly" }, "onFail": { "ref": "FailureNotify" }, "filePath": "s3://.../transactions.csv" }, { "id": "TransactionsSelect", "type": "MySqlDataNode", "schedule": { "ref": "Nightly" }, "onFail": { "ref": "FailureNotifyureNotify" }, "table": "transactions", "connectionString": "jdbc:mysql://...", "selectQuery": "SELECT ... FROM #{table} WHERE ...;" },

{ "id": "HeatmapEMRCluster", "type": "EmrCluster", "schedule": { "ref": "Nightly" },
}, { "id": "GenerateHeatmap", "type": "EmrActivity", "onFail": { "ref": "FailureNotify" }, "schedule": { "ref": "Nightly" }, "runsOn": { "ref": "HeatmapEMRCluster" }, "dependsOn": { "ref": "CopyTransactionsToS3" }, "step": "/home/hadoop/contrib/streaming/hadoop-streaming.jar,-input,s3n://.../ transactions.csv,-output,s3://.../heatmap_aggregations,-mapper,s3n://.../map.rb,- reducer,s3n://.../reduce.rb" },

Monitored: get an email (via SNS) when the pipeline completes
(or fails) Scalable: we can adjust the EMR cluster and scheduling as needed Deployable: devs can work in Ruby, deploy with our usual machinery Simple: pipeline deﬁnition is just JSON (albeit 150 lines)

A bigger real-world example: Swipely Nightly DB Dumps

We want to make nightly DB backups We also want
to protect our customers’ data We want production-ish data in our staging DB, and on our dev laptops

Nightly DB dumps 1. Bootstrap a Ruby environment 2. Create
backups, store them on S3 3. Use Fog to create various RDS clones 4. Sanitize and truncate data 5. Create DB dumps for use by devs 6. Switch DNS

Arbitrary shell commands: bootstrap Ruby, then interact with various libraries
Concurrency: manipulating data in the “scratch” DB can happen while the new staging DB is being set up Automatic S3 staging to instance local FS: write backups, etc to local directories, and they’ll be moved to S3 automatically

But...you don’t want to see this deﬁnition ...and don’t even
try the GUI: 16 shell command nodes, various S3 data nodes, schedules, SNS nodes, etc add up to 250 lines of dense JSON

Testing and debugging can be a pain - what if
the last step of a big pipeline fails? Conﬁg gets rather big - DSL anyone? /cc @bigconﬁg Where are your logs?

Final Thoughts Got scheduled tasks that move data around AWS?
Give it a try Still plenty of room for Amazon and third parties to build tooling

AWS Data Pipeline

AWS Data Pipeline

Keith Barrette

Other Decks in Technology

Featured

Transcript