$30 off During Our Annual Pro Sale. View Details »

Large Scale Data Service as a Service (RICON East 2013)

Large Scale Data Service as a Service (RICON East 2013)

Presented by Brian Akins at RICON East 2013

Turner Broadcasting hosts several large sites that need to serve "data" to millions of clients over HTTP. A couple of years ago, we started building a generic service to solve this and to retire several legacy systems. We will discuss the general architecture, the growing pains, and why we decided to use Riak. We will also share some implementation details and the use of the service for a few large internet events.

About Brian

Brian Akins is Senior Principal Architect at Turner Broadcasting where he is focused primarily on scaling web applications and the organizations that build and support them. He is an old school C guy who has fell in love with Lua. He lives with his wife and four children in the suburbs of Atlanta.

Basho Technologies

May 13, 2013
Tweet

More Decks by Basho Technologies

Other Decks in Technology

Transcript

  1. Large Scale Data
    Service as a Service

    View Slide

  2. Disclaimer
    • All  opinions  stated  are  those  of  the  presenter  and  do  not  
    necessarily  reflect  those  of  Turner  or  any  of  its  affiliates  or  
    partners.

    View Slide

  3. About @bakins
    • Horrible  public  speaker
    • Born  and  raised  in  Alabama
    • Husband  and  father  of  four
    • Senior  Principal  Architect,  Turner  Broadcas?ng
    • “Large”  web  site  and  systems  opera?ons
    • C  and  Lua  hacker
    • Learning  Go,  Erlang,  and  MIG  welding

    View Slide

  4. About  Turner  MPTO
    • Media  PlaKorm  Technology  &  Opera?ons
    • Turner’s  “web  people”
    – CNN,  NCAA,  NBA,  Adultswim,  Cartoon  Network,  Money,  iReport,  
    etc
    • Five  engineering  groups
    • One  opera?ons  group  

    View Slide

  5. The Problem
    • Lots of small, frequently updated “data files”
    • XML, JSON, CSV, etc
    • Every site implemented something slightly
    different

    View Slide

  6. CNN - Election 2008
    • election.cnn.com
    • Simple reverse proxy-cache
    • One data center devoted to it
    • “moved” to www.cnn.com after event

    View Slide

  7. data.nba.com
    • scoreboards, stats, etc
    • XML, JSON, “.dat” etc
    • Simple Apache cluster
    • NFS stale filehandle issues
    • Fronted with HTTP cache

    View Slide

  8. An Opportunity
    • NCAA March Madness 2011
    • Highest planned traffic event to date
    • Chance to build a scalable, generic solution

    View Slide

  9. Large Scale Data
    • “LSD”
    • Simple Architecture
    • GET/PUT interface
    • Publishing system handles multiple data
    centers

    View Slide

  10. Initial Implementation
    • nginx
    • memc-nginx-module
    • membase
    • decently large boxes - 24 core, 96GB RAM

    View Slide

  11. It Worked!
    • Extremely successful event
    • migrated data.nba.com before 2011 playoffs
    • Added “fail back” to NFS

    View Slide

  12. “... as a service”
    • Managed by Chef
    • “data bag” - Simple json file
    • DNS Entry - CNAME to LSD
    • 15 minutes to get a service scaled for
    Elections traffic

    View Slide

  13. Why not websockets?
    • HTTP polling
    • Websockets were still “bleeding edge”
    • Browser support
    • Corporate environments
    • Proxies and firewalls

    View Slide

  14. Time for a change
    • membase operational issues
    • membase + couchdb merger

    View Slide

  15. Membase
    • Extremely fast
    • Every box is a single point of failure
    • Auto-failover is/was “scary” and error-
    prone
    • Data corruption and loss
    • Thought about professional support, then...

    View Slide

  16. CouchBase
    • membase front end, couchDB backend
    • 90% of our writes are updates
    • Reevaluate our data store choices

    View Slide

  17. Alternatives
    • Data store performance was not our major
    concern
    • we were blinded by the awesome
    performance of membase, so we over
    looked some warts

    View Slide

  18. Alternatives
    • CouchDB - append-only not a good fit
    • MongoDB - previous operational issues
    • Relational Database - Clustering/failover
    • Redis - roll our own sharding
    • Homegrown - um, no

    View Slide

  19. Why Riak?
    • Operational Stability
    • It just works
    • Operationally “Simple”
    • Performance is good enough
    • Map/Reduce, 2i, and Search - future uses

    View Slide

  20. Meanwhile...
    • LSD got more complex
    • openresty - nginx + Lua
    • rewrite rules, jsonp, etc
    • Business logic
    • More of a Lua app now

    View Slide

  21. Riak Implementation
    • Tried HTTP first
    • Keys issues - our keys were uri’s
    • decent performance

    View Slide

  22. Riak + PBC + Lua
    • Twice the performance
    • no support in nginx
    • we wrote our own and released it
    • https://github.com/bakins/lua-resty-riak

    View Slide

  23. Current LSD
    Architecture
    • nginx
    • Lua
    • haproxy for PBC loadbalancing
    • Riak

    View Slide

  24. Riak Infrastructure
    • Two independent clusters
    • Five “large-ish” nodes per cluster
    • Bucket Per Site
    • LevelDB
    • Chef

    View Slide

  25. 2 second cache
    • “protects” Riak
    • consistent performance
    • 10+ times the performance
    • ngx_lua shared dictionaries
    • spin lock bottleneck
    • “sharded” shared dictionaries

    View Slide

  26. Haproxy
    • Works well, no need to replicate inside nginx
    • PBC - TCP load balance. Healthcheck HTTP and
    port ping
    • cache made any performance difference negligible
    here
    • well instrumented and supported

    View Slide

  27. Recent Events
    • 2012 Presidential Elections
    • CNN Homepage Video - Breaking News
    • 2013 March Madness
    • 2013 NBA All-Star and Playoffs

    View Slide

  28. The Future
    • Revisit websockets
    • Testing Riak multi-datacenter replication
    • offer “canned” 2i/MR queries
    • Redis as cache
    • Riak for other projects

    View Slide

  29. The Verdict
    • Riak - it just works (mostly)
    • We “take it for granted”
    • We’re hiring. Work remote.

    View Slide