working on a lot of open source software ◦ anaconda and yum ◦ Original Fedora team ◦ Xen • Then led Cloud Infrastructure team at HubSpot • Now engineer at Stackdriver
AWS • State is a bug • Everything fails. Plan on it • Learn about all the AWS services • Embrace and become expert in those you use • Just say no to EBS Major Takeaways
application ◦ Oracle RAC or similar just aren't practical ◦ IO performance significantly lower ◦ Much lower limits to vertical scaling • Even off the shelf Internet applications aren't great ◦ Wordpress ◦ Drupal ◦ ... • Two key things to keep in mind ◦ Horizontally scalable ◦ No single point of failure Build for AWS
AWS • Nodes should be stateless • Lack of state lets you spin up (or down) capacity as you need • Avoid sticky sessions if you can • Session info in cookies or shared storage (db) are going to be more reliable long-term
◦ $$$ to operate across regions ◦ $$$ to build to run across regions • How much does downtime cost your business? • What SLA do you promise your customers? • Figure out if the tradeoff is worth it
Cloudfront, etc) to help mitigate failures • Distribute your application across Availability Zones • Use IAM. ◦ Don't share master keys with everyone in your org • Don't do things by hand ◦ Rich APIs are available for everything with bindings for every language you might want ◦ If using the console, you are setting yourself up for future problems • Automate setup of instances too ◦ chef, puppet, cfengine, salt, ansible, fabric, bash ◦ It doesn't matter. Pick one.
manual intervention for failover • Autoscaling can help but lags demand • Capacity not always available to add more ◦ Especially during problem periods ◦ Reserved instances guarantee capacity for you ◦ But it'll cost you to have that capacity ◦ And it doesn't help in the case of API problems EC2
with no local disks and everything via NAS? • That's (basically) what EBS is. • Failures regularly cascade • (Historically) Inconsistent performance ◦ Provisioned IOPS help but low guaranteed perf • At the very least, avoid EBS root instances Just Say No to EBS
openmq, etc) for others • Allows you to horizontally scale different parts of your system independently ◦ Web tier ◦ Backend worker tier ◦ Database tier Queues
clusters • MultiAZ works great 99+% of the time • Hard to import large data sets • Don't just trust the Amazon backups ◦ Have a read-only replica that you mysqldump from
nodes to round-robin and help isolate on instance failures • But during EBS problems, ELBs sometimes get stuck • ELB doesn't help with cross-region availability either • Can also use for non-web traffic • Note: idle connections dropped after 60 seconds Elastic Load Balancer
running a Hadoop cluster • Debugging failures can be difficult ◦ SSH escape hatch will let you look a little more • Use the ganglia bootstrap action to analyze performance ◦ http://docs.amazonwebservices. com/ElasticMapReduce/latest/DeveloperGuide/UsingE MR_Ganglia.html • Spot instances can increase job throughput at minimal cost increase ◦ Use on-demand master + data nodes + some number of spot job nodes Elastic Map Reduce
AWS • State is a bug • Everything fails. Plan on it • Learn about all the AWS services • Embrace and become expert in those you use • Just say no to EBS Major Takeaways Twitter: @katzj [email protected] Slides: http://speakerdeck.com/katzj/