Large Scale Data Service as a Service (RICON East 2013)

Large Scale Data Service as a Service (RICON East 2013)

Presented by Brian Akins at RICON East 2013

Turner Broadcasting hosts several large sites that need to serve "data" to millions of clients over HTTP. A couple of years ago, we started building a generic service to solve this and to retire several legacy systems. We will discuss the general architecture, the growing pains, and why we decided to use Riak. We will also share some implementation details and the use of the service for a few large internet events.

About Brian

Brian Akins is Senior Principal Architect at Turner Broadcasting where he is focused primarily on scaling web applications and the organizations that build and support them. He is an old school C guy who has fell in love with Lua. He lives with his wife and four children in the suburbs of Atlanta.

E0f4dbccf64a1d37a92e224b070ee84f?s=128

Basho Technologies

May 13, 2013
Tweet

Transcript

  1. Large Scale Data Service as a Service

  2. Disclaimer • All  opinions  stated  are  those  of  the  presenter

     and  do  not   necessarily  reflect  those  of  Turner  or  any  of  its  affiliates  or   partners.
  3. About @bakins • Horrible  public  speaker • Born  and  raised

     in  Alabama • Husband  and  father  of  four • Senior  Principal  Architect,  Turner  Broadcas?ng • “Large”  web  site  and  systems  opera?ons • C  and  Lua  hacker • Learning  Go,  Erlang,  and  MIG  welding
  4. About  Turner  MPTO • Media  PlaKorm  Technology  &  Opera?ons •

    Turner’s  “web  people” – CNN,  NCAA,  NBA,  Adultswim,  Cartoon  Network,  Money,  iReport,   etc • Five  engineering  groups • One  opera?ons  group  
  5. The Problem • Lots of small, frequently updated “data files”

    • XML, JSON, CSV, etc • Every site implemented something slightly different
  6. CNN - Election 2008 • election.cnn.com • Simple reverse proxy-cache

    • One data center devoted to it • “moved” to www.cnn.com after event
  7. data.nba.com • scoreboards, stats, etc • XML, JSON, “.dat” etc

    • Simple Apache cluster • NFS stale filehandle issues • Fronted with HTTP cache
  8. An Opportunity • NCAA March Madness 2011 • Highest planned

    traffic event to date • Chance to build a scalable, generic solution
  9. Large Scale Data • “LSD” • Simple Architecture • GET/PUT

    interface • Publishing system handles multiple data centers
  10. Initial Implementation • nginx • memc-nginx-module • membase • decently

    large boxes - 24 core, 96GB RAM
  11. It Worked! • Extremely successful event • migrated data.nba.com before

    2011 playoffs • Added “fail back” to NFS
  12. “... as a service” • Managed by Chef • “data

    bag” - Simple json file • DNS Entry - CNAME to LSD • 15 minutes to get a service scaled for Elections traffic
  13. Why not websockets? • HTTP polling • Websockets were still

    “bleeding edge” • Browser support • Corporate environments • Proxies and firewalls
  14. Time for a change • membase operational issues • membase

    + couchdb merger
  15. Membase • Extremely fast • Every box is a single

    point of failure • Auto-failover is/was “scary” and error- prone • Data corruption and loss • Thought about professional support, then...
  16. CouchBase • membase front end, couchDB backend • 90% of

    our writes are updates • Reevaluate our data store choices
  17. Alternatives • Data store performance was not our major concern

    • we were blinded by the awesome performance of membase, so we over looked some warts
  18. Alternatives • CouchDB - append-only not a good fit •

    MongoDB - previous operational issues • Relational Database - Clustering/failover • Redis - roll our own sharding • Homegrown - um, no
  19. Why Riak? • Operational Stability • It just works •

    Operationally “Simple” • Performance is good enough • Map/Reduce, 2i, and Search - future uses
  20. Meanwhile... • LSD got more complex • openresty - nginx

    + Lua • rewrite rules, jsonp, etc • Business logic • More of a Lua app now
  21. Riak Implementation • Tried HTTP first • Keys issues -

    our keys were uri’s • decent performance
  22. Riak + PBC + Lua • Twice the performance •

    no support in nginx • we wrote our own and released it • https://github.com/bakins/lua-resty-riak
  23. Current LSD Architecture • nginx • Lua • haproxy for

    PBC loadbalancing • Riak
  24. Riak Infrastructure • Two independent clusters • Five “large-ish” nodes

    per cluster • Bucket Per Site • LevelDB • Chef
  25. 2 second cache • “protects” Riak • consistent performance •

    10+ times the performance • ngx_lua shared dictionaries • spin lock bottleneck • “sharded” shared dictionaries
  26. Haproxy • Works well, no need to replicate inside nginx

    • PBC - TCP load balance. Healthcheck HTTP and port ping • cache made any performance difference negligible here • well instrumented and supported
  27. Recent Events • 2012 Presidential Elections • CNN Homepage Video

    - Breaking News • 2013 March Madness • 2013 NBA All-Star and Playoffs
  28. The Future • Revisit websockets • Testing Riak multi-datacenter replication

    • offer “canned” 2i/MR queries • Redis as cache • Riak for other projects
  29. The Verdict • Riak - it just works (mostly) •

    We “take it for granted” • We’re hiring. Work remote.