Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Lessons Learned in AWS (the Hard Way)

Jeremy Katz
November 12, 2012

Lessons Learned in AWS (the Hard Way)

Jeremy Katz

November 12, 2012
Tweet

More Decks by Jeremy Katz

Other Decks in Technology

Transcript

  1. 6+ Years in the Cloud • Previously at Red Hat

    working on a lot of open source software ◦ anaconda and yum ◦ Original Fedora team ◦ Xen • Then led Cloud Infrastructure team at HubSpot • Now engineer at Stackdriver
  2. Joined Stackdriver to Simplify Management in the Cloud • Lots

    of tools, custom code • Still, #monitoringsucks • Plenty of data, just not enough information • 10 people, based downtown • Funded • Hiring!
  3. • Don't just move your existing app from DC ->

    AWS • State is a bug • Everything fails. Plan on it • Learn about all the AWS services • Embrace and become expert in those you use • Just say no to EBS Major Takeaways
  4. • You can't just pick up your existing data center

    application ◦ Oracle RAC or similar just aren't practical ◦ IO performance significantly lower ◦ Much lower limits to vertical scaling • Even off the shelf Internet applications aren't great ◦ Wordpress ◦ Drupal ◦ ... • Two key things to keep in mind ◦ Horizontally scalable ◦ No single point of failure Build for AWS
  5. State is a Bug • State is the killer with

    AWS • Nodes should be stateless • Lack of state lets you spin up (or down) capacity as you need • Avoid sticky sessions if you can • Session info in cookies or shared storage (db) are going to be more reliable long-term
  6. Everything Fails • There is no such thing as a

    100% service • Some more fault tolerant than others (S3, Route53) • But they all fail in some way
  7. Failure Mitigation • Spare capacity for redundancy • Failover •

    Ensure your code retries with a backoff Read Release It! (http: //bit.ly/oIBSuC) for non- AWS specific tips
  8. Failure might be okay • Avoiding all failures is expensive

    ◦ $$$ to operate across regions ◦ $$$ to build to run across regions • How much does downtime cost your business? • What SLA do you promise your customers? • Figure out if the tradeoff is worth it
  9. Where to Start • Use platform features (ELB, RDS, S3,

    Cloudfront, etc) to help mitigate failures • Distribute your application across Availability Zones • Use IAM. ◦ Don't share master keys with everyone in your org • Don't do things by hand ◦ Rich APIs are available for everything with bindings for every language you might want ◦ If using the console, you are setting yourself up for future problems • Automate setup of instances too ◦ chef, puppet, cfengine, salt, ansible, fabric, bash ◦ It doesn't matter. Pick one.
  10. • Elastic IPs can help for reliability ◦ But require

    manual intervention for failover • Autoscaling can help but lags demand • Capacity not always available to add more ◦ Especially during problem periods ◦ Reserved instances guarantee capacity for you ◦ But it'll cost you to have that capacity ◦ And it doesn't help in the case of API problems EC2
  11. • Would you run every server in your data center

    with no local disks and everything via NAS? EBS
  12. • Would you run every server in your data center

    with no local disks and everything via NAS? • That's (basically) what EBS is. • Failures regularly cascade • (Historically) Inconsistent performance ◦ Provisioned IOPS help but low guaranteed perf • At the very least, avoid EBS root instances Just Say No to EBS
  13. • SQS for many simple cases • "Real" queue (rabbitmq,

    openmq, etc) for others • Allows you to horizontally scale different parts of your system independently ◦ Web tier ◦ Backend worker tier ◦ Database tier Queues
  14. RDS • RDS saves a ton of pain managing database

    clusters • MultiAZ works great 99+% of the time • Hard to import large data sets • Don't just trust the Amazon backups ◦ Have a read-only replica that you mysqldump from
  15. • Great, relatively simple service • Can front multiple web

    nodes to round-robin and help isolate on instance failures • But during EBS problems, ELBs sometimes get stuck • ELB doesn't help with cross-region availability either • Can also use for non-web traffic • Note: idle connections dropped after 60 seconds Elastic Load Balancer
  16. • Great for periodic batch jobs • Reduce overhead of

    running a Hadoop cluster • Debugging failures can be difficult ◦ SSH escape hatch will let you look a little more • Use the ganglia bootstrap action to analyze performance ◦ http://docs.amazonwebservices. com/ElasticMapReduce/latest/DeveloperGuide/UsingE MR_Ganglia.html • Spot instances can increase job throughput at minimal cost increase ◦ Use on-demand master + data nodes + some number of spot job nodes Elastic Map Reduce
  17. • Don't just move your existing app from DC ->

    AWS • State is a bug • Everything fails. Plan on it • Learn about all the AWS services • Embrace and become expert in those you use • Just say no to EBS Major Takeaways Twitter: @katzj [email protected] Slides: http://speakerdeck.com/katzj/