Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Tales from the Ops Side - Celebs vs Ops

VM Farms Inc.
September 16, 2015

Tales from the Ops Side - Celebs vs Ops

A continuing series of talks presented by VM Farms, Tales from the Ops Side serves to educate developers on operational best practices for maintaining application performance, stability, and system security, as well as to share frontline stories from the battlefield.

In this special FITC instalment, Hany Fahim, founder and CEO of VM Farms, discusses the impact celebrity influence can have on web traffic, and the devastating effects this can have on sites and applications. Using real world examples, Hany discusses the best methods to safe guard and protect against these spikes, while keeping online properties performant and available throughout.

VM Farms Inc.

September 16, 2015
Tweet

More Decks by VM Farms Inc.

Other Decks in Technology

Transcript

  1. CELEBRITIES V.S. OPS The great battle for uptime Hany Fahim

    (@iHandroid) Founder and CEO of VM Farms (@vmfarms) Tales from the Ops Side
  2. Introduction Take you through an epic historical fight for uptime

    between: A client's site and The might of celebrities.
  3. Year: 2012 • Our company had just on-boarded a new

    client - a popular annual music award show. • Our mission: Help scale their site to stand up to the traffic from an upcoming event. • Each year prior, their site collapsed due to load during their event.
  4. Corner #1 - Event Website • Event website with traffic

    spikes 1000X during the annual event. • The main marketing website of the event was built with Wordpress with lots of plugins. • Wordpress is a great CMS tool and wildly popular.
  5. Not very scalable • Can be bloated (lots of memory

    usage). • Can be intensive (rendering pages use a lot of CPU cycles). • Vast plugin ecosystem means you get a variety of quality.
  6. Corner #2 - Reigning Champion • Celebrity might. • Millions

    of fans armed with social media. • Like a flash mob, fans can come and go in the blink of an eye. • Tend to make Ops unhappy.
  7. Training • Our first goal was to analyze their stack

    and look for “quick wins” (caching, caching, and more caching). • Scaling strategy: Load balance Wordpress across several servers. • We were going to need a lot of caching. • Keep an eye on the database as it’s the single point of failure.
  8. The Anti-Load Balancer • User uploads must be stored locally.

    • Wordpress makes a lot of DB queries (59 queries on average[1] ). • Plugins complicate matters - many rely on local files to function. • Most guides online tell you to use shared storage. [1] http://wordpress.stackexchange.com/a/154895
  9. Options 1. Don't share files. Running everything on one really

    really large server. Not very stable, limited flexibility. 2. Syncing files around. Many tools available here. 3. Replicating filesystems. Different from shared filesystems - everyone has a copy.
  10. Opted File Replication • Came across GlusterFS as a possible

    solution. • Researching options, stumbled on a neat “mode” of GlusterFS - “replicated”. • In this mode, files are replicated to all servers. • Gotcha: files are replicated on reads.
  11. Writes are directed to a single server and the request

    is returned immediately to the user.
  12. Nominations - Round 1 • Monday 9am event starts. •

    30 minutes in, servers holding their ground. • Traffic went up, so did load, but it was manageable. • Spare servers were on standby ready to tag in.
  13. Nominations - Round 2 • In the blink of an

    eye, huge influx of traffic arrives and cluster collapses. • High load on all web servers. • We threw in the spare servers, but it seemed to make the problem worse. • Where was this traffic coming from? Referral traffic indicated it was coming from Twitter.
  14. This guy • It was from
 Justin Bieber with 40M

    Twitter followers. • Traced it to a tweet along the lines of "Im nominated. Go here <link>”. • This single tweet caused 70Mbits of traffic (coined a Bieberbit).
  15. Cluster down for the count • On closer look, all

    the load was coming from Gluster itself. • We scrambled to dismantle the cluster, and fall back to the sync strategy, but it was too late. • Event was already over. • Client was displeased.
  16. Regroup • The fan army took us by surprise. •

    Like a flash mob, the fan army came and dissipated in about 10 minutes, however load persisted for upwards of an hour. • Many questions. Why did everything collapse so badly? Why did load persist? • We had to find a way to reproduce.
  17. Test Lab • Duplicated the environment in a lab. •

    Built a series of “fan bot” servers which will simulate load. • Grabbed the logs from the event and replayed them against this new cluster. • We could reliably crash the cluster every time.
  18. No time to spare • We had to rework our

    strategy. • Main event was less than a month away. • We removed Gluster from the equation. • Opted for syncing (lsyncd) and load balancer tricks to direct updates to a “source” server.
  19. /wp-admin.php lsyncd detects changes and syncs them across the cluster.

    Load Balancer directs “admin” traffic to a dedicated “source”
  20. Wordpress Sabotage! • Soon after putting the new setup in

    production, client reported issues where changes were not propagating. • Investigated and turned out to be stale cache files from the Supercache plugin. • Cache files are generated on end-user requests. • Each server maintained it’s own cache, and there was no way to notify servers to invalidate.
  21. Solution! • Resorted to writing a Wordpress plugin that would

    force lsyncd to clear cache on updates. • It would delete, create, then purge cache files on the source server, which triggers syncs to the slaves and invalidate their cache. • The deletes were selective.
  22. function create_and_delete_cache( $post_id ) { … unlink($cache_path."new_post"); touch($cache_path."new_post"); prune_super_cache($dir .

    $permalink, $force = true); … } add_action('publish_post', 'create_and_delete_cache', 0); add_action('edit_post', 'create_and_delete_cache', 0); add_action('delete_post', 'create_and_delete_cache', 0); add_action('publish_phone','create_and_delete_cache', 0);
  23. lsyncd would detect the changes and cause the other servers

    in the cluster to flush their cache.
  24. The Event - Round 1 • Sunday 7pm. • As

    the event kicks off, traffic spikes as expected. • Servers held their ground. Load was lower than last event. • Syncing was working, cache was being invalidated on updates. • Our solution was working!
  25. The Event - Round 2 • Entering hour 2, alerts

    go off as a surge of traffic hits. • It was him. "Hopefully I win tonight!” • 2.57 Bieberbits (180Mbits).
  26. Victorious • Bieber ended up winning 3 awards that night.

    • Biggest spike registered at 3.71 Bieberbits (260Mbits). • Servers didn’t break a sweat.
  27. Lessons Learned • Don’t use shared storage. • Syncing and

    decentralization ftw! • lsyncd with some load balancer tricks, plus a lot of caching was an effective strategy. • Could easily scale to 30+ servers. • Database was not really a concern.