Server architecture & scaling strategy for a sports website

server architecture, availability & scaling strategy CTO leonidas tsementzis

leonidas tsementzis aka @goldstein CTO Mobile architect CTO # who’s
talking * software architect, engineer [all major web/mobile platforms] * devOps [enthusiast, not a real sysadmin] * entrepreneur [n00b]

# the high-level requirements * 2007 * take sport.gr to
the next level... * ... make sure it works smoothly... * ... and fast enough

# i can see clearly now :) * videos [goals,
match coverage] * comments [the blogging age, remember?] * live streaming [ustream does not exist, yet] * live coverage of events [cover it live does not exist, yet] * user-centric design [personalization, ratings] * even more videos [I can haz more LOLCats]

# the problem :( * we are planning for a
150% traffic growth but [6 months planning ahead] $ video costs [bandwidth cost: 1€/GB] * comments costs [DB writes, CPU, disk i/o] * live streaming costs [bandwidth cost: 1€/GB] * limited iron resources, not happy with our current host [dedicated managed servers in top GR Datacenter]

# S3 to the rescue * 87% cost reduction [0.13€/GB
VS 1€/GB] * made videos section possible... * ...and advertisers loved it ($$$+) * first GR site to focus on video, key competitive advantage * 6TB video traffic in the first month * hired a video editing team to support the demand

# EC2 servers on demand * 3x(n) Application servers for
the main website [Windows 2003, IIS 6] * 2x(n) Application servers for APIs [Windows 2003, IIS 6] * 2x(n) Servers for banner managers [CentOS, Apache, OpenX] * 1x Storage server * 2x Database servers [MS SQL Server 2008 with failover] * 2x Reverse Proxy cache servers [Squid] * 2x Load Balancers [HAProxy with failover] * 1x monitoring server [munin with a lot of custom plugins]

# a nice headache :( :( :’(

# a typical week * peaks at 3k hits/sec once
or twice/week * normal ratio at 300 hits/sec * you can’t afford the 1st * you can’t deliver on the 2nd

# auto-scaling to the rescue * if average CPU usage
grows over 60% for 2 minutes, add another application server * if average CPU usage falls below 30% for 5 minutes, kill gracefully an application server * 20 instances on peaks * 3 instances (minimum) on normal operations * no more “Server is busy” errors * pay only what you (really) need * you can now sleep at nights * 60% overall cost reduction

# wait, there’s more! * CDN & media streaming with
CloudFront * use multiple CNAMES with CloudFront to boost HTTP requests [as per YSlow recommends] * CloudFront custom domains are sexy * robust DNS with Route 53 * simple monitoring with CloudWatch [you still need an external monitoring tool]

# SUM() * S3 Photos Videos Static banners * EC2
Main website SQL Databases Backoffice APIs Banner Managers Cache servers Load Balancers * CloudWatch Auto-scaling Simple Monitoring * CloudFront Video streaming CDN * ELB Load balancing * RDS MySQL databases * Route 53 DNS Resolution

# lessons learned * test, iterate, test, iterate * reserved
instances saves you $$ * EC2 is a hacker playground [prepare for DOS attacks] * backup entire AMIs to S3 [instances *WILL* #FAIL] * EBS disk I/O is slow, but amazon is working on this [problems with DB writes] * spawning new instances is slow [15 mins provisioning can be a show stopper on scaling] * S3 uploads/downloads are slow * sticky session is a must [we replaced AWS ELB with HAProxy just for this] * SLAs can't guarantee high availability [AWS *WILL* #FAIL]

# more lessons learned * devOps are hard to find
[interested? I’m hiring] * automate everything [makes you sleep at night] * monitor everything [munin is your friend] * disaster prevention [work *ALWAYS* around the worst case scenario] * windows server administration is a mess [and AWS is not making this prettier] * DB scale is the hardest part [code changes] * legacy software *IS* a problem ** on scaling ** on hiring ** on growing (have you tried to use XMPP via ASP?)

# AWS is not perfect * AKAMAI is still faster
compared to CloudFront [especially in Greece] * not affordable for large architectures [if you’re running 300+ instances, you should consider making your own datacenter]

# questions? challenges? ? @goldstein aka leonidas tsementzis leotsem [at]
gmail.com

# thank you @goldstein aka leonidas tsementzis leotsem [at] gmail.com
!

Server architecture & scaling strategy for a sp...

Server architecture & scaling strategy for a sports website

Leonidas Tsementzis

More Decks by Leonidas Tsementzis

Other Decks in Technology

Featured

Transcript

server architecture, availability & scaling strategy CTO leonidas tsementzis

leonidas tsementzis aka @goldstein CTO Mobile architect CTO # who’s

# the high-level requirements * 2007 * take sport.gr to

# i can see clearly now :) * videos [goals,

# the problem :( * we are planning for a

# S3 to the rescue * 87% cost reduction [0.13€/GB

# EC2 servers on demand * 3x(n) Application servers for

# a nice headache :( :( :’(

# a typical week * peaks at 3k hits/sec once

# auto-scaling to the rescue * if average CPU usage

# wait, there’s more! * CDN & media streaming with

# SUM() * S3 Photos Videos Static banners * EC2

# lessons learned * test, iterate, test, iterate * reserved

# more lessons learned * devOps are hard to find

# AWS is not perfect * AKAMAI is still faster

# questions? challenges? ? @goldstein aka leonidas tsementzis leotsem [at]

# thank you @goldstein aka leonidas tsementzis leotsem [at] gmail.com