Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Server architecture & scaling strategy for a sp...

Server architecture & scaling strategy for a sports website

sport.gr server architecture, availability & scaling strategy

Leonidas Tsementzis

March 09, 2012
Tweet

More Decks by Leonidas Tsementzis

Other Decks in Technology

Transcript

  1. leonidas tsementzis aka @goldstein CTO Mobile architect CTO # who’s

    talking * software architect, engineer [all major web/mobile platforms] * devOps [enthusiast, not a real sysadmin] * entrepreneur [n00b]
  2. # the high-level requirements * 2007 * take sport.gr to

    the next level... * ... make sure it works smoothly... * ... and fast enough
  3. # i can see clearly now :) * videos [goals,

    match coverage] * comments [the blogging age, remember?] * live streaming [ustream does not exist, yet] * live coverage of events [cover it live does not exist, yet] * user-centric design [personalization, ratings] * even more videos [I can haz more LOLCats]
  4. # the problem :( * we are planning for a

    150% traffic growth but [6 months planning ahead] $ video costs [bandwidth cost: 1€/GB] * comments costs [DB writes, CPU, disk i/o] * live streaming costs [bandwidth cost: 1€/GB] * limited iron resources, not happy with our current host [dedicated managed servers in top GR Datacenter]
  5. # S3 to the rescue * 87% cost reduction [0.13€/GB

    VS 1€/GB] * made videos section possible... * ...and advertisers loved it ($$$+) * first GR site to focus on video, key competitive advantage * 6TB video traffic in the first month * hired a video editing team to support the demand
  6. # EC2 servers on demand * 3x(n) Application servers for

    the main website [Windows 2003, IIS 6] * 2x(n) Application servers for APIs [Windows 2003, IIS 6] * 2x(n) Servers for banner managers [CentOS, Apache, OpenX] * 1x Storage server * 2x Database servers [MS SQL Server 2008 with failover] * 2x Reverse Proxy cache servers [Squid] * 2x Load Balancers [HAProxy with failover] * 1x monitoring server [munin with a lot of custom plugins]
  7. # a typical week * peaks at 3k hits/sec once

    or twice/week * normal ratio at 300 hits/sec * you can’t afford the 1st * you can’t deliver on the 2nd
  8. # auto-scaling to the rescue * if average CPU usage

    grows over 60% for 2 minutes, add another application server * if average CPU usage falls below 30% for 5 minutes, kill gracefully an application server * 20 instances on peaks * 3 instances (minimum) on normal operations * no more “Server is busy” errors * pay only what you (really) need * you can now sleep at nights * 60% overall cost reduction
  9. # wait, there’s more! * CDN & media streaming with

    CloudFront * use multiple CNAMES with CloudFront to boost HTTP requests [as per YSlow recommends] * CloudFront custom domains are sexy * robust DNS with Route 53 * simple monitoring with CloudWatch [you still need an external monitoring tool]
  10. # SUM() * S3 Photos Videos Static banners * EC2

    Main website SQL Databases Backoffice APIs Banner Managers Cache servers Load Balancers * CloudWatch Auto-scaling Simple Monitoring * CloudFront Video streaming CDN * ELB Load balancing * RDS MySQL databases * Route 53 DNS Resolution
  11. # lessons learned * test, iterate, test, iterate * reserved

    instances saves you $$ * EC2 is a hacker playground [prepare for DOS attacks] * backup entire AMIs to S3 [instances *WILL* #FAIL] * EBS disk I/O is slow, but amazon is working on this [problems with DB writes] * spawning new instances is slow [15 mins provisioning can be a show stopper on scaling] * S3 uploads/downloads are slow * sticky session is a must [we replaced AWS ELB with HAProxy just for this] * SLAs can't guarantee high availability [AWS *WILL* #FAIL]
  12. # more lessons learned * devOps are hard to find

    [interested? I’m hiring] * automate everything [makes you sleep at night] * monitor everything [munin is your friend] * disaster prevention [work *ALWAYS* around the worst case scenario] * windows server administration is a mess [and AWS is not making this prettier] * DB scale is the hardest part [code changes] * legacy software *IS* a problem ** on scaling ** on hiring ** on growing (have you tried to use XMPP via ASP?)
  13. # AWS is not perfect * AKAMAI is still faster

    compared to CloudFront [especially in Greece] * not affordable for large architectures [if you’re running 300+ instances, you should consider making your own datacenter]