Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Large Scale Data Service as a Service

Large Scale Data Service as a Service

Turner Broadcasting hosts several large sites that need to serve "data" to millions of clients over HTTP. A couple of years ago, we started building a generic service to solve this and to retire several legacy systems. We will discuss the general architecture, the growing pains, and why we decided to use Riak. We will also share some implementation details and the use of the service for a few large internet events.

Brian Akins

May 13, 2013
Tweet

More Decks by Brian Akins

Other Decks in Technology

Transcript

  1. Disclaimer • All  opinions  stated  are  those  of  the  presenter

     and  do  not   necessarily  reflect  those  of  Turner  or  any  of  its  affiliates  or   partners. Monday, May 13, 13
  2. About @bakins • Horrible  public  speaker • Born  and  raised

     in  Alabama • Husband  and  father  of  four • Senior  Principal  Architect,  Turner  Broadcas?ng • “Large”  web  site  and  systems  opera?ons • C  and  Lua  hacker • Learning  Go,  Erlang,  and  MIG  welding Monday, May 13, 13
  3. About  Turner  MPTO • Media  PlaKorm  Technology  &  Opera?ons •

    Turner’s  “web  people” – CNN,  NCAA,  NBA,  Adultswim,  Cartoon  Network,  Money,  iReport,   etc • Five  engineering  groups • One  opera?ons  group   Monday, May 13, 13
  4. The Problem • Lots of small, frequently updated “data files”

    • XML, JSON, CSV, etc • Every site implemented something slightly different Monday, May 13, 13
  5. CNN - Election 2008 • election.cnn.com • Simple reverse proxy-cache

    • One data center devoted to it • “moved” to www.cnn.com after event Monday, May 13, 13
  6. data.nba.com • scoreboards, stats, etc • XML, JSON, “.dat” etc

    • Simple Apache cluster • NFS stale filehandle issues • Fronted with HTTP cache Monday, May 13, 13
  7. An Opportunity • NCAA March Madness 2011 • Highest planned

    traffic event to date • Chance to build a scalable, generic solution Monday, May 13, 13
  8. Large Scale Data • “LSD” • Simple Architecture • GET/PUT

    interface • Publishing system handles multiple data centers Monday, May 13, 13
  9. Initial Implementation • nginx • memc-nginx-module • membase • decently

    large boxes - 24 core, 96GB RAM Monday, May 13, 13
  10. It Worked! • Extremely successful event • migrated data.nba.com before

    2011 playoffs • Added “fail back” to NFS Monday, May 13, 13
  11. “... as a service” • Managed by Chef • “data

    bag” - Simple json file • DNS Entry - CNAME to LSD • 15 minutes to get a service scaled for Elections traffic Monday, May 13, 13
  12. Why not websockets? • HTTP polling • Websockets were still

    “bleeding edge” • Browser support • Corporate environments • Proxies and firewalls Monday, May 13, 13
  13. Membase • Extremely fast • Every box is a single

    point of failure • Auto-failover is/was “scary” and error- prone • Data corruption and loss • Thought about professional support, then... Monday, May 13, 13
  14. CouchBase • membase front end, couchDB backend • 90% of

    our writes are updates • Reevaluate our data store choices Monday, May 13, 13
  15. Alternatives • Data store performance was not our major concern

    • we were blinded by the awesome performance of membase, so we over looked some warts Monday, May 13, 13
  16. Alternatives • CouchDB - append-only not a good fit •

    MongoDB - previous operational issues • Relational Database - Clustering/failover • Redis - roll our own sharding • Homegrown - um, no Monday, May 13, 13
  17. Why Riak? • Operational Stability • It just works •

    Operationally “Simple” • Performance is good enough • Map/Reduce, 2i, and Search - future uses Monday, May 13, 13
  18. Meanwhile... • LSD got more complex • openresty - nginx

    + Lua • rewrite rules, jsonp, etc • Business logic • More of a Lua app now Monday, May 13, 13
  19. Riak Implementation • Tried HTTP first • Keys issues -

    our keys were uri’s • decent performance Monday, May 13, 13
  20. Riak + PBC + Lua • Twice the performance •

    no support in nginx • we wrote our own and released it • https://github.com/bakins/lua-resty-riak Monday, May 13, 13
  21. Current LSD Architecture • nginx • Lua • haproxy for

    PBC loadbalancing • Riak Monday, May 13, 13
  22. Riak Infrastructure • Two independent clusters • Five “large-ish” nodes

    per cluster • Bucket Per Site • LevelDB • Chef Monday, May 13, 13
  23. 2 second cache • “protects” Riak • consistent performance •

    10+ times the performance • ngx_lua shared dictionaries • spin lock bottleneck • “sharded” shared dictionaries Monday, May 13, 13
  24. Haproxy • Works well, no need to replicate inside nginx

    • PBC - TCP load balance. Healthcheck HTTP and port ping • cache made any performance difference negligible here • well instrumented and supported Monday, May 13, 13
  25. Recent Events • 2012 Presidential Elections • CNN Homepage Video

    - Breaking News • 2013 March Madness • 2013 NBA All-Star and Playoffs Monday, May 13, 13
  26. The Future • Revisit websockets • Testing Riak multi-datacenter replication

    • offer “canned” 2i/MR queries • Redis as cache • Riak for other projects Monday, May 13, 13
  27. The Verdict • Riak - it just works (mostly) •

    We “take it for granted” • We’re hiring. Work remote. Monday, May 13, 13