Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Scaling the Web: Databases & NoSQL

Scaling the Web: Databases & NoSQL

This is an introduction to relational and non-relational databases and how their performance affects to scaling a web application.

This is a recording of a guest Lecture I gave at the University of Texas school of Information.

In this talk I address the technologies and tools Gowalla (gowalla.com) uses including memcache, redis and cassandra.

Find more on my blog:
http://schneems.com

Richard Schneeman

November 10, 2011
Tweet

More Decks by Richard Schneeman

Other Decks in Programming

Transcript

  1. Scaling the Web:
    Databases &
    NoSQL
    Richard Schneeman
    @schneems works for @Gowalla
    Wed Nov 10
    2011

    View full-size slide

  2. whoami
    • @Schneems
    • BSME with Honors from Georgia Tech
    • 5 + years experience Ruby & Rails
    • Work for @Gowalla
    • Rails 3.1 contributor : )
    • 3 + years technical teaching

    View full-size slide

  3. Compounding Traffic
    ex. Wikipedia

    View full-size slide

  4. Compounding Traffic
    ex. Wikipedia

    View full-size slide

  5. Gowalla
    • 50 best websites NYTimes 2010
    • Founded 2009 @ SXSW
    • 1 million+ Users
    • Undisclosed Visitors
    • Loves/highlights/comments/stories/guides
    • Facebook/Foursquare/Twitter integration
    • iphone/android/web apps
    • public API

    View full-size slide

  6. Gowalla Backend
    • Ruby on Rails
    • Uses the Ruby Language
    • Rails is the Framework

    View full-size slide

  7. The Web is Data
    • Username => String
    • Birthday => Int/ Int/ Int
    • Blog Post => Text
    • Image => Binary-file/blob
    Data needs to be stored
    to be useful

    View full-size slide

  8. Gowalla Database
    • PostgreSQL
    • Relational (RDBMS)
    • Open Source
    • Competitor to MySQL
    • ACID compliant
    • Running on a Dedicated Managed Server

    View full-size slide

  9. Need for Speed
    • Throughput:
    • The number of operations per minute that
    can be performed
    • Pure Speed:
    • How long an individual operation takes.

    View full-size slide

  10. Potential Problems
    • Hardware
    • Slow Network
    • Slow hard-drive
    • Insufficient CPU
    • Insufficient Ram
    • Software
    • too many Reads
    • too many Writes

    View full-size slide

  11. Scaling Up versus Out
    • Scale Up:
    • More CPU, Bigger HD, More Ram etc.
    • Scale Out:
    • More machines
    • More machines
    • More machines
    • ...

    View full-size slide

  12. Scale Up
    • Bigger faster machine
    • More Ram
    • More CPU
    • Bigger ethernet bus
    • ...
    • Moores Law
    • Diminishing returns

    View full-size slide

  13. Scale Out
    • Forget Moores law...
    • Add more nodes
    • Master/ Slave Database
    • Sharding

    View full-size slide

  14. Master DB
    Slave DB Slave DB Slave DB Slave DB
    Write
    Copy
    Read
    Master/Slave

    View full-size slide

  15. Master & Slave +/-
    • Pro
    • Increased read speed
    • Takes read load off of master
    • Allows us to Join across all tables
    • Con
    • Doesn’t buy increased write throughput
    • Single Point of Failure in Master Node

    View full-size slide

  16. Users in
    USA
    Read
    Sharding
    Write
    Users in
    Europe
    Users in
    Asia
    Users in
    Africa

    View full-size slide

  17. Sharding +/-
    • Pro
    • Increased Write & Read throughput
    • No Single Point of failure
    • Individual features can fail
    • Con
    • Cannot Join queries between shards

    View full-size slide

  18. What is a Database?
    • Relational Database Managment System
    (RDBMS)
    • Stores Data Using Schema
    • A.C.I.D. compliant
    • Atomic
    • Consistent
    • Isolated
    • Durable

    View full-size slide

  19. RDBMS
    • Relational
    • Matches data on common characteristics
    in data
    • Enables “Join” & “Union” queries
    • Makes data modular

    View full-size slide

  20. Relational +/-
    • Pros
    • Data is modular
    • Highly flexible data layout
    • Cons
    • Getting desired data can be tricky
    • Over modularization leads to many join
    queries
    • Trade off performance for search-ability

    View full-size slide

  21. Schema Storage
    • Blueprint for data storage
    • Break data into tables/columns/rows
    • Give data types to your data
    • Integer
    • String
    • Text
    • Boolean
    • ...

    View full-size slide

  22. Schema +/-
    • Pros
    • Regularize our data
    • Helps keep data consistent
    • Converts to programming “types” easily
    • Cons
    • Must seperatly manage schema
    • Adding columns & indexes to existing
    large tables can be painful & slow

    View full-size slide

  23. ACID
    • Properties that guarante a database
    transaction are processed reliably
    • Atomic
    • Consistent
    • Isolated
    • Durable

    View full-size slide

  24. ACID
    • Atomic
    • Any database Transaction is all or nothing.
    • If one part of the transaction fails it all fails
    “An Incomplete Transaction Cannot Exist”

    View full-size slide

  25. ACID
    • Consistent
    • Any transaction will take the database
    from one consistent state to another
    “Only Consistent data is allowed to be
    written”

    View full-size slide

  26. ACID
    • Isolated
    • No transaction should be able to interfere
    with another transaction
    “the same field cannot be updated by two
    sources at the exact same time”
    a = 0
    a += 1
    a += 2
    } a = ??

    View full-size slide

  27. ACID
    • Durable
    • Once a transaction Is committed it will stay
    that way
    “Save it once, read it forever”

    View full-size slide

  28. What is a Database?
    • RDBMS
    • Relational
    • Flexible
    • Has a schema
    • Most likely ACID compliant
    • Typically fast under low load or when
    optimized

    View full-size slide

  29. What is SQL?
    • Structured Query Language
    • The language databases speak
    • Based on relational algebra
    • Insert
    • Query
    • Update
    • Delete
    “SELECT Company, Country FROM Customers
    WHERE Country = 'USA' ”

    View full-size slide

  30. Why people <3 SQL
    • Relational algebra is powerful
    • SQL is proven
    • well understood
    • well documented

    View full-size slide

  31. Why people 3 SQL
    • Relational algebra Is hard
    • Different databases support different SQL
    syntax
    • Yet another programming language to learn

    View full-size slide

  32. SQL != Database
    • SQL is used to talk to a RDBMS (database)
    • SQL is not a RDBMS

    View full-size slide

  33. What is NoSQL?
    Not A
    Relational
    Database

    View full-size slide

  34. Types of NoSQL
    • Distributed Systems
    • Document Store
    • Graph Database
    • Key-Value Store
    • Eventually Consistent Systems
    Mix And Match ↑

    View full-size slide

  35. Key Value Stores
    • Non Relational
    • Typically No Schema
    • Map one Key (a string) to a Value (some
    object)
    Example: Redis

    View full-size slide

  36. Key Value Example
    redis = Redis.new
    redis.set(“foo”, “bar”)
    redis.get(“foo”)
    >> “bar”

    View full-size slide

  37. Key Value Example
    redis = Redis.new
    redis.set(“foo”, “bar”)
    redis.get(“foo”)
    >> “bar”
    Key Value
    Key
    Value

    View full-size slide

  38. Key Value
    • Like a databse that can only ever use
    primary Key (id)
    YES
    select * from users where id = ‘3’;
    NO
    select * from users where name = ‘schneems’;

    View full-size slide

  39. NoSQL @ Gowalla
    • Redis (key-value store)
    • Store “Likes” & Analytics
    • Memcache (key-value store)
    • Cache Database results
    • Cassandra
    • (eventually consistent, with-schema, key
    value store)
    • Store “feeds” or “timelines”
    • Solr (search index)

    View full-size slide

  40. Memcache
    • Key-Value Store
    • Open Source
    • Distributed
    • In memory (ram) only
    • fast, but volatile
    • Not ACID
    • Memory object caching system

    View full-size slide

  41. Memcache Example
    memcache = Memcache.new
    memcache.set(“foo”, “bar”)
    memcache.get(“foo”)
    >> “bar”

    View full-size slide

  42. Memcache
    • Can store whole objects
    memcache = Memcache.new
    user = User.where(:username => “schneems”)
    memcache.set(“user:3”, user)
    user_from_cache = memcache.get(“user:3”)
    user_from_cache == user
    >> true
    user_from_cache.username
    >> “Schneems”

    View full-size slide

  43. Memcache @ Gowalla
    • Cache Common Queries
    • Decreases Load on DB (postgres)
    • Enables higher throughput from DB
    • Faster response than DB
    • Users see quicker page load time

    View full-size slide

  44. What to Cache?
    • Objects that change infrequently
    • users
    • spots (places)
    • etc.
    • Expensive(ish) sql queries
    • Friend ids for users
    • User ids for people visiting spots
    • etc.

    View full-size slide

  45. Memcache Distributed
    B
    C
    A

    View full-size slide

  46. Memcache Distributed
    B C
    A
    Easily add more nodes
    D

    View full-size slide

  47. Memcache <3’s DB
    • We use them Together
    • If memcache doesn’t have a value
    • Fetch from the database
    • Set the key from database
    • Hard
    • Cache Invalidation : (

    View full-size slide

  48. Redis
    • Key Value Store
    • Open Source
    • Not Distributed (yet)
    • Extremely Quick
    • “Data structure server”

    View full-size slide

  49. Redis Example, again
    redis = Redis.new
    redis.set(“foo”, “bar”)
    redis.get(“foo”)
    >> “bar”

    View full-size slide

  50. Redis - Has Data Types
    • Strings
    • Hashes
    • Lists
    • Sets
    • Sorted Sets

    View full-size slide

  51. Redis Example, sets
    redis = Redis.new
    redis.sadd(“foo”, “bar”)
    redis.members(“foo”)
    >> [“bar”]
    redis.sadd(“foo”, “fly”)
    redis.members(“foo”)
    >> [“bar”, “fly”]

    View full-size slide

  52. Redis => Likeable
    • Very Fast response
    • ~ 50 queries per page view
    • ~ 1 ms per query
    • http://github.com/Gowalla/likeable

    View full-size slide

  53. Cassandra
    • Open Source
    • Distributed
    • Key Value Store
    • Eventually Consistent
    • Sortof not ACID
    • Uses A Schema
    • ColumnFamilies

    View full-size slide

  54. Cassandra Distributed
    B C
    A
    Eventual Consistency
    D
    Data In
    Copied To
    Extra
    Nodes ...
    Eventually

    View full-size slide

  55. Cassandra
    @ Gowalla{
    Activity
    Feeds

    View full-size slide

  56. Cassandra @ Gowalla
    • Chronologic
    • http://github.com/Gowalla/chronologic

    View full-size slide

  57. Should I use
    NoSQL?

    View full-size slide

  58. Pick the
    right tool

    View full-size slide

  59. Tradeoffs
    • Every Data store has them
    • Know your data store
    • Strengths
    • Weaknesses

    View full-size slide

  60. NoSQL vs. RDBMS
    • No Magic Bullet
    • Use Both!!!
    • Model data in a datastore you understand
    • Switch to when/if you need to
    • Understand Your Options

    View full-size slide

  61. Questions?
    Richard Schneeman
    @schneems works for @Gowalla

    View full-size slide