$30 off During Our Annual Pro Sale. View Details »

DON'T PANIC: Large scale web development

Tuenti
October 19, 2013

DON'T PANIC: Large scale web development

In Tuenti, all features we develop have to be ready to scale to millions of users when the code is released to production. By using caching patterns, priming caches, dark-launches and other strategies we try to ensure that the infrastructure will cope with the load. In this talk we want to give real, not-trivial examples of code written to scale... and other cool things about scalability in general.

Tuenti

October 19, 2013
Tweet

More Decks by Tuenti

Other Decks in Technology

Transcript

  1. Jorge Lería & William Viana
    DON’T PANIC! Large scale web development
    {[email protected], [email protected]} - Tuenti

    View Slide

  2. View Slide

  3. in general

    View Slide

  4. Jorge Lería
    @jorgeleria
    [email protected]

    View Slide

  5. William Viana
    @vianasw
    [email protected]

    View Slide

  6. View Slide

  7. WILL

    View Slide

  8. View Slide

  9. View Slide

  10. View Slide

  11. View Slide

  12. View Slide

  13. View Slide

  14. View Slide

  15. View Slide

  16. View Slide

  17. View Slide

  18. View Slide

  19. View Slide

  20. View Slide

  21. Category: scalability
    Level: advanced
    Language: spanish
    Speakers: 2

    View Slide

  22. THIS IS NOT
    ROCKET SCIENCE
    }
    {

    View Slide

  23. BUT COULD
    BE DANGEROUS
    }
    {

    View Slide

  24. “The effect of a small mistake with that
    amount of users can be a disaster…”

    View Slide

  25. Disaster example
    Back in 2010 we had games in Tuenti
    We had a system to share sessions with third parties
    At first it wasn’ t using our Backend Framework
    And we launched GreenFarm

    View Slide

  26. Disaster example
    /; ;\
    __ \\____//
    COW /{_\_/ `'\____
    \___ (o) (o }
    _____________________________/ :--'
    ,-,'`@@@@@@@@ @@@@@@ \_ `__\
    ;:( @@@@@@@@@ @@@ \___(o'o)
    :: ) @@@@ @@@@@@ ,'@@( `===='
    :: : @@@@@: @@@@ `@@@:
    :: \ @@@@@: @@@@@@@) ( '@@@'
    ;; /\ /`, @@@@@@@@@\ :@@@@@)
    ::/ ) {_----------------: :~`,~~;
    ;;'`; : ) : / `; ;
    ;;;; : : ; : ; ; :
    `'`' / : : : : : :
    )_ \__; ";" :_ ; \_\ `,','
    :__\ \ * `,'* \ \ : \ * 8`;'* *
    `^' \ :/ `^' `-^-' \v/ : \/
    Back in 2010 we had games in Tuenti
    We had a system to share sessions with third parties
    At first it wasn’ t using our Backend Framework
    And we launched GreenFarm

    View Slide

  27. Disaster example
    /; ;\
    __ \\____//
    COW /{_\_/ `'\____
    \___ (o) (o }
    _____________________________/ :--'
    ,-,'`@@@@@@@@ @@@@@@ \_ `__\
    ;:( @@@@@@@@@ @@@ \___(o'o)
    :: ) @@@@ @@@@@@ ,'@@( `===='
    :: : @@@@@: @@@@ `@@@:
    :: \ @@@@@: @@@@@@@) ( '@@@'
    ;; /\ /`, @@@@@@@@@\ :@@@@@)
    ::/ ) {_----------------: :~`,~~;
    ;;'`; : ) : / `; ;
    ;;;; : : ; : ; ; :
    `'`' / : : : : : :
    )_ \__; ";" :_ ; \_\ `,','
    :__\ \ * `,'* \ \ : \ * 8`;'* *
    `^' \ :/ `^' `-^-' \v/ : \/
    We had a problem with concurrency
    We detected duplicated gamer IDs
    The database was corrupted
    No backup (it happened during the launch)

    View Slide

  28. Disaster example
    /; ;\
    __ \\____//
    DEAD COW /{_\_/ `'\____
    \___ X X }
    _____________________________/ :--'
    ,-,'`@@@@@@@@ @@@@@@ \_ `__\
    ;:( @@@@@@@@@ @@@ \___(o'o)
    :: ) @@@@ @@@@@@ ,'@@( `----'
    :: : @@@@@: @@@@ `@@@: |_|
    :: \ @@@@@: @@@@@@@) ( '@@@'
    ;; /\ /`, @@@@@@@@@\ :@@@@@)
    ::/ ) {_----------------: :~`,~~;
    ;;'`; : ) : / `; ;
    ;;;; : : ; : ; ; :
    `'`' / : : : : : :
    )_ \__; ";" :_ ; \_\ `,','
    :__\ \ * `,'* \ \ : \ * 8`;'* *
    `^' \ :/ `^' `-^-' \v/ : \/
    We had a problem with concurrency
    We detected duplicated gamer IDs
    The database was corrupted
    No backup (it happened during the launch)
    =
    RESET THE GAME

    View Slide

  29. View Slide

  30. a.- A hat
    b.- A boa constrictor digesting an elephant
    c.- Something much more frightening

    View Slide

  31. a.- A hat
    b.- A boa constrictor digesting an elephant
    c.- Something much more frightening

    View Slide

  32. View Slide

  33. Tuenti... do you still use that?

    View Slide

  34. Three days ago
    we were #2

    View Slide

  35. ~16.000.000 users

    View Slide

  36. ~3.200.000.000 images

    View Slide

  37. How do we manage all of that?

    View Slide

  38. With a ton of servers...
    +1000 servers
    550 frontend (PHP) servers.
    300 database (MySQL) servers.
    100 memcached servers.
    100 image servers.
    etc. (thumbnails, photo processors, chat, batch,
    dev, stats, backups, HBase).

    View Slide

  39. ...and a great team

    View Slide

  40. Technologies
    PHP
    MySQL
    memcache
    HBase
    Hadoop
    Hive
    HipHop
    nginx
    Erlang
    ...

    View Slide

  41. Technologies
    PHP
    MySQL
    memcache
    HBase
    Hadoop
    Hive
    HipHop
    nginx
    Erlang
    ...
    }Free software
    (free as in freedom)

    View Slide

  42. In software engineering, scalability is the ability of a system,
    network, or process, to handle growing amounts of work in a
    graceful manner or its ability to be enlarged to accommodate
    that growth.
    It's all about handling growth

    View Slide

  43. What is NOT scalability
    Scalability ≠ Speed
    Scalability ≠ Throughput
    (Sometimes slower code is better for scalability)

    View Slide

  44. View Slide

  45. social network != other sites

    View Slide

  46. Photos Notifications
    Master slave architecture
    Schema divided into
    logical blocks
    One cluster per data type

    View Slide

  47. Partitioning not
    limited to databases
    Partitioned memcache
    and frontend farms
    Databases

    View Slide

  48. “All magic comes with a price…
    ...in complexity”
    - No longer relational (no joins or counts)
    - Denormalization

    View Slide

  49. Thankfully, we have a very good
    Backend Framework

    View Slide

  50. “Backend Framework team is the team responsible for the data
    access layer in Tuenti, providing services for persistent storage
    interaction and their support.”

    View Slide

  51. Backend framework
    All data partitioning is transparent
    Fully integrated memcache as a storage layer
    Data structures like “Cached lists”

    View Slide

  52. Feature disabling/enabling
    “Allows you to gradually open a feature or
    disable one that is broken”

    View Slide

  53. View Slide

  54. Dark launches
    Usually done when we want to fill a cache, before a
    feature is actually released

    View Slide

  55. Asynchronous jobs
    Running on a queue system with throthing enable to
    be used for heavy operations.

    View Slide

  56. Staging
    “Test and review a newer version before it is moved
    into production by making users go to a concrete
    server where the new code is deployed”

    View Slide

  57. View Slide

  58. Before we start:
    what makes a photo popular?
    how much is it going to be used?
    is it going to be slow?
    what are our constraints?

    View Slide

  59. So, we run some numbers, and decide how to
    implement a MVP

    View Slide

  60. 1st approach: Let’s build something quick with
    the existing backend (but that still scales)

    View Slide

  61. For every user friend...

    View Slide

  62. Photos
    For every user friend...
    Retrieve all of their
    photos

    View Slide

  63. Photos
    Likes
    Comments
    For every user friend...
    Retrieve all of their
    photos
    And for every photo,
    retrieve all the likes
    and comments

    View Slide

  64. Of course, this doesn’t scale very well

    View Slide

  65. We can do better

    View Slide

  66. B
    F
    H
    Users with
    activities

    View Slide

  67. B
    F
    H
    Status
    Photo
    Comment
    Status
    Photo
    Photo
    Users with
    activities
    User Events

    View Slide

  68. B
    F
    H
    Status
    Photo
    Comment
    Status
    Photo
    Photo
    Users with
    activities
    User Events
    Likes
    Comments

    View Slide

  69. Store the result in a cache

    View Slide

  70. So we released it to the world, crossed our fingers, and
    opened it up little by little, with an eye in the graphs

    View Slide

  71. Well, at least that was the
    plan...

    View Slide

  72. Well, at least that was the
    plan...

    View Slide

  73. After the hotfix, it kind of worked and it was
    enough to validate the product.

    View Slide

  74. View Slide

  75. “Don't listen to the direct reaction,
    just look at the numbers”
    Facebook guy (yesterday in the speech)

    View Slide

  76. 2nd approach: The generative approach

    View Slide

  77. For every user, store every like and comment in
    a photo into a memcache storage

    View Slide

  78. B
    F
    H
    Like on Photo
    Comment
    Comment
    Like on Photo
    Comment
    Comment
    Users with
    activities
    Photo events
    Much easier
    Smaller memory
    footprint

    View Slide

  79. View Slide

  80. View Slide

  81. View Slide

  82. View Slide

  83. Do the dark launch, wait to the caches to fill
    in, and switch to the new implementation

    View Slide

  84. Profit

    View Slide

  85. Until the beliebers came...

    View Slide

  86. “It’s not enough to make it work for the average
    user, you’ll have to make it for the power users
    with power user friends as well.”
    Socrates -

    View Slide

  87. Any questions?
    @jorgeleria @vianasw @TuentiEng

    View Slide