Lessons Learned in AWS (the Hard Way)

Lessons Learned in AWS (The Hard Way) Jeremy Katz [email protected]
@katzj 12 November 2012

6+ Years in the Cloud • Previously at Red Hat
working on a lot of open source software ◦ anaconda and yum ◦ Original Fedora team ◦ Xen • Then led Cloud Infrastructure team at HubSpot • Now engineer at Stackdriver

Joined Stackdriver to Simplify Management in the Cloud • Lots
of tools, custom code • Still, #monitoringsucks • Plenty of data, just not enough information • 10 people, based downtown • Funded • Hiring!

• Don't just move your existing app from DC ->
AWS • State is a bug • Everything fails. Plan on it • Learn about all the AWS services • Embrace and become expert in those you use • Just say no to EBS Major Takeaways

• You can't just pick up your existing data center
application ◦ Oracle RAC or similar just aren't practical ◦ IO performance significantly lower ◦ Much lower limits to vertical scaling • Even off the shelf Internet applications aren't great ◦ Wordpress ◦ Drupal ◦ ... • Two key things to keep in mind ◦ Horizontally scalable ◦ No single point of failure Build for AWS

State is a Bug • State is the killer with
AWS • Nodes should be stateless • Lack of state lets you spin up (or down) capacity as you need • Avoid sticky sessions if you can • Session info in cookies or shared storage (db) are going to be more reliable long-term

Everything Fails • There is no such thing as a
100% service • Some more fault tolerant than others (S3, Route53) • But they all fail in some way

No, Everything

Failure Mitigation • Spare capacity for redundancy • Failover •
Ensure your code retries with a backoff Read Release It! (http: //bit.ly/oIBSuC) for non- AWS specific tips

Aside: Not Just AWS

Failure might be okay • Avoiding all failures is expensive
◦ $$$ to operate across regions ◦ $$$ to build to run across regions • How much does downtime cost your business? • What SLA do you promise your customers? • Figure out if the tradeoff is worth it

AWS Services

Where to Start • Use platform features (ELB, RDS, S3,
Cloudfront, etc) to help mitigate failures • Distribute your application across Availability Zones • Use IAM. ◦ Don't share master keys with everyone in your org • Don't do things by hand ◦ Rich APIs are available for everything with bindings for every language you might want ◦ If using the console, you are setting yourself up for future problems • Automate setup of instances too ◦ chef, puppet, cfengine, salt, ansible, fabric, bash ◦ It doesn't matter. Pick one.

• Elastic IPs can help for reliability ◦ But require
manual intervention for failover • Autoscaling can help but lags demand • Capacity not always available to add more ◦ Especially during problem periods ◦ Reserved instances guarantee capacity for you ◦ But it'll cost you to have that capacity ◦ And it doesn't help in the case of API problems EC2

• Would you run every server in your data center
with no local disks and everything via NAS? EBS

• Would you run every server in your data center
with no local disks and everything via NAS? • That's (basically) what EBS is. • Failures regularly cascade • (Historically) Inconsistent performance ◦ Provisioned IOPS help but low guaranteed perf • At the very least, avoid EBS root instances Just Say No to EBS

• SQS for many simple cases • "Real" queue (rabbitmq,
openmq, etc) for others • Allows you to horizontally scale different parts of your system independently ◦ Web tier ◦ Backend worker tier ◦ Database tier Queues

RDS • RDS saves a ton of pain managing database
clusters • MultiAZ works great 99+% of the time • Hard to import large data sets • Don't just trust the Amazon backups ◦ Have a read-only replica that you mysqldump from

• Great, relatively simple service • Can front multiple web
nodes to round-robin and help isolate on instance failures • But during EBS problems, ELBs sometimes get stuck • ELB doesn't help with cross-region availability either • Can also use for non-web traffic • Note: idle connections dropped after 60 seconds Elastic Load Balancer

• Great for periodic batch jobs • Reduce overhead of
running a Hadoop cluster • Debugging failures can be difficult ◦ SSH escape hatch will let you look a little more • Use the ganglia bootstrap action to analyze performance ◦ http://docs.amazonwebservices. com/ElasticMapReduce/latest/DeveloperGuide/UsingE MR_Ganglia.html • Spot instances can increase job throughput at minimal cost increase ◦ Use on-demand master + data nodes + some number of spot job nodes Elastic Map Reduce

• Don't just move your existing app from DC ->
AWS • State is a bug • Everything fails. Plan on it • Learn about all the AWS services • Embrace and become expert in those you use • Just say no to EBS Major Takeaways Twitter: @katzj [email protected] Slides: http://speakerdeck.com/katzj/

Lessons Learned in AWS (the Hard Way)

Lessons Learned in AWS (the Hard Way)

Jeremy Katz

More Decks by Jeremy Katz

Other Decks in Technology

Featured

Transcript