Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Elastic scaling in a (micro)service oriented architecture

Elastic scaling in a (micro)service oriented architecture

Splitting an application up into multiple independent services can be a good way to keep it scaling and ensure stability and developer productivity in larger, growing teams. But just splitting the codebase, creating APIs and deploying the code on some servers is not enough, somehow your services need to know where and how other services are accessible. Classical approaches like hardcoding everything in every service or having a central load-balancer can quickly lead to problems in terms of scalability and maintainability. In this talk I'll show how we at ResearchGate tackled this challenge. With the help of tools like Consul and haproxy we created a setup that allows us to quickly boot and shutdown services. This ensures that all servers are utilized optimally and load spikes can be reacted upon quickly and automatically.

Bastian Hofmann

February 18, 2016
Tweet

More Decks by Bastian Hofmann

Other Decks in Programming

Transcript

  1. Elastic Scaling in a
    (Micro)service
    oriented
    Architecture
    @BastianHofmann

    View Slide

  2. View Slide

  3. View Slide

  4. Microservices

    View Slide

  5. View Slide

  6. Service Oriented
    Architecture

    View Slide

  7. Monolith

    View Slide

  8. http://blog.philipphauer.de/microservices-nutshell-pros-cons/
    Monolith Microservices

    View Slide

  9. Benefits

    View Slide

  10. Problems

    View Slide

  11. Problems

    View Slide

  12. Challenges

    View Slide

  13. Microservice

    View Slide

  14. View Slide

  15. Cloud Solutions

    View Slide

  16. View Slide

  17. Using the cloud is
    not always possible

    View Slide

  18. … or even desirable

    View Slide

  19. Doing it yourself
    creates challenges

    View Slide

  20. Performance

    View Slide

  21. Latency

    View Slide

  22. Stability

    View Slide

  23. Reliability

    View Slide

  24. Transparency

    View Slide

  25. Learning Curves

    View Slide

  26. Code Reuse

    View Slide

  27. Maintenance

    View Slide

  28. Elastic Scaling?

    View Slide

  29. How can we solve
    them

    View Slide

  30. A lot of this is also
    useful for monoliths

    View Slide

  31. View Slide

  32. View Slide

  33. •A big monolith
    •Multiple small to medium sized
    services
    •Lots of shared libraries
    •Tools and utilities
    •Hadoop jobs
    •Flink jobs
    •Server Provisioning

    View Slide

  34. •PHP
    •Javascript
    •Java
    •Scala
    •Bash
    •Python
    •Puppet
    •Ruby
    •Go

    View Slide

  35. •Nginx
    •PHP-FPM
    •Glassfish
    •Jetty
    •Dropwizard
    •haproxy
    •PostgreSQL
    •MongoDB
    •Memcached
    •Infinispan
    •Solr
    •Zookeeper
    •Elasticsearch
    •Logstash
    •Kibana
    •Graphite
    •StatsD
    •RabbitMQ
    •Hortonworks
    Data
    Platform
    •HBase
    •Hive
    •Consul
    •Vault
    •CheckMK
    •Azkaban
    •ActiveMQ
    •Apache
    HTTPD
    •Docker
    •Kafka

    View Slide

  36. Several hundred
    servers

    View Slide

  37. View Slide

  38. Questions? Ask

    View Slide

  39. http://speakerdeck.com/u/bastianhofmann

    View Slide

  40. https://www.flickr.com/photos/npobre/2601582256/

    View Slide

  41. Deployment

    View Slide

  42. How to get the
    services on our
    servers?

    View Slide

  43. Diverse technology
    stacks

    View Slide

  44. The same for every
    service

    View Slide

  45. One Click
    Deployment

    View Slide

  46. View Slide

  47. Automation

    View Slide

  48. Build/Test/Release
    pipeline

    View Slide

  49. View Slide

  50. https://www.flickr.com/photos/[email protected]/5580348753/

    View Slide

  51. Base boxes

    View Slide

  52. Services installed in
    a sandbox

    View Slide

  53. https://www.docker.com/

    View Slide

  54. https://twitter.com/mfdii/status/697532387240996864

    View Slide

  55. Docker & PHP - development and
    deployment
    Szymon Skórczyński
    Thursday 16:00

    View Slide

  56. Availability

    View Slide

  57. Zero Downtime
    Deployments

    View Slide

  58. Server
    Server Server
    Server

    View Slide

  59. Stability

    View Slide

  60. Canary
    environments

    View Slide

  61. Server
    Server Server
    Server

    View Slide

  62. Fast rollbacks

    View Slide

  63. •Ansible
    •Capistrano
    •Saltstack
    •Custom
    •….

    View Slide

  64. Running the service

    View Slide

  65. How do I stop and
    start a service and
    ensure it keeps
    running?

    View Slide

  66. Diverse technology
    stacks

    View Slide

  67. The same for every
    service

    View Slide

  68. •Supervisord
    •Upstart
    •S6
    •Ruine
    •Monit
    •Circus
    •Restartd
    •…

    View Slide

  69. Releases

    View Slide

  70. How to synchronize
    changes over
    services?

    View Slide

  71. API Versioning

    View Slide

  72. GET /v23/foo/abr
    Host: myservice.local

    View Slide

  73. GET /foo/abr
    Host: myservice.local
    X-Version: 23

    View Slide

  74. GET /foo/abr?version=23
    Host: myservice.local

    View Slide

  75. GET /foo/abr
    Host: myservice.local
    Accept: application/vnd.company.v23+json

    View Slide

  76. Feature Flags

    View Slide

  77. public function hasAccess() {
    return featureFlag()->isActive(
    FeatureFlag::TEST_ONE
    );
    }

    View Slide

  78. View Slide

  79. View Slide

  80. Shared database

    View Slide

  81. Headers

    View Slide

  82. GET /foo/abr
    Host: myservice.local
    X-Flag-NewFeature: 1

    View Slide

  83. Configuration
    Management

    View Slide

  84. How do I
    synchronize
    configuration over
    services?

    View Slide

  85. [
    "db_user": "user",
    "db_pw": "pw",
    "serviceA": "serviceA.local:8018"
    ]

    View Slide

  86. Config file on disk

    View Slide

  87. Duplication

    View Slide

  88. Inconsistencies

    View Slide

  89. Consul
    https://www.consul.io/

    View Slide

  90. •Consul
    •Zookeeper
    •etcd
    •…

    View Slide

  91. Consul
    Server
    Consul
    Server
    Consul
    Server
    Consul
    Agent
    ver
    Consul
    Agent
    Server
    Consul
    Agent
    Server
    Co
    Ag
    Server

    View Slide

  92. https://github.com/sensiolabs/consul-php-sdk

    View Slide

  93. Key/Value Store

    View Slide

  94. $kv->put('test/foo/bar', 'bazinga');
    $kv->get('test/foo/bar', ['raw' => true]);
    $kv->delete('test/foo/bar');

    View Slide

  95. Credentials

    View Slide

  96. $kv->put('test/db/pw', 'secret_pw');

    View Slide

  97. https://www.vaultproject.io/

    View Slide

  98. Cycling of
    credentials

    View Slide

  99. Service Discovery

    View Slide

  100. How does one
    service know where
    another service is?

    View Slide

  101. Hostname + Port

    View Slide

  102. Server
    Service A
    Server
    Service B
    Service C Service C

    View Slide

  103. Configuration

    View Slide

  104. $config = [
    'serviceA' => [
    '192.168.0.1:8001',
    '192.168.0.2:8001',
    ],
    'serviceB' => [
    '192.168.0.1:8002',
    ],
    'serviceC' => [
    '192.168.0.2:8003',
    ]
    ];

    View Slide

  105. Consul
    https://www.consul.io/

    View Slide

  106. Load balancing?

    View Slide

  107. Round robin in the
    client

    View Slide

  108. $config = [
    'serviceA' => [
    '192.168.0.1:8001',
    '192.168.0.2:8001',
    ],
    'serviceB' => [
    '192.168.0.1:8002',
    ],
    'serviceC' => [
    '192.168.0.2:8003',
    ]
    ];

    View Slide

  109. Service/Server
    down?

    View Slide

  110. $config = [
    'serviceA' => [
    '192.168.0.1:8001',
    '192.168.0.2:8001',
    ],
    'serviceB' => [
    '192.168.0.1:8002',
    ],
    'serviceC' => [
    '192.168.0.2:8003',
    ]
    ];

    View Slide

  111. Health checks

    View Slide

  112. GET /health HTTP/1.1
    Host: serviceA.local
    HTTP/1.1 200 OK

    View Slide

  113. Central load
    balancer

    View Slide

  114. Server
    Service A
    Server
    Service B
    Service C Service C
    Load balancer

    View Slide

  115. Scalability?

    View Slide

  116. Elasticity?

    View Slide

  117. Consul
    https://www.consul.io/

    View Slide

  118. Consul
    Server
    Consul
    Server
    Consul
    Server
    Consul
    Agent
    ver
    Consul
    Agent
    Server
    Consul
    Agent
    Server
    Co
    Ag
    Server

    View Slide

  119. Consul for Service
    Discovery

    View Slide

  120. Consul
    Agent
    Server
    Service A
    Registration
    Health check

    View Slide

  121. Consul API

    View Slide

  122. DNS

    View Slide

  123. [email protected]: dig web-frontend.service.consul. ANY
    ; <<>> DiG 9.8.3-P1 <<>> web-frontend.service.consul. ANY
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 29981
    ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0,
    ADDITIONAL: 0
    ;; QUESTION SECTION:
    ;web-frontend.service.consul. IN ANY
    ;; ANSWER SECTION:
    web-frontend.service.consul. 0 IN A 10.0.3.83
    web-frontend.service.consul. 0 IN A 10.0.1.109

    View Slide

  124. Consul-Template
    https://github.com/hashicorp/consul-template

    View Slide

  125. Server
    Service A
    Server
    Service B
    Service C Service C
    Load balancer
    Consul
    Template

    View Slide

  126. Single Point of
    Failure

    View Slide

  127. Server
    Service A
    Server
    Service B
    Service C Service C
    Load balancer
    Consul
    Template Load balancer
    Consul
    Template

    View Slide

  128. Monitoring

    View Slide

  129. How are my
    services behaving?

    View Slide

  130. Central Log
    Management

    View Slide

  131. Elasticsearch

    Kibana
    Logstash

    View Slide

  132. Logstash
    elasticsearch
    webserver webserver webserver
    AMQP
    log log log
    logstash logstash logstash

    View Slide

  133. View Slide

  134. Tracing IDs

    View Slide

  135. web server http service
    http service
    http service
    http service
    create
    unique
    trace_id for
    request
    user request
    trace_id
    trace_id
    trace_id
    trace_id
    log
    log
    log
    log
    log

    View Slide

  136. X-Trace-Id: bbr8ehb984tbab894

    View Slide

  137. https://www.loggly.com/

    View Slide

  138. https://getsentry.com/

    View Slide

  139. Measure everything

    View Slide

  140. Server metrics

    View Slide

  141. Application metrics

    View Slide

  142. StatsD + Graphite

    View Slide

  143. webserver webserver webserver
    statsd statsd
    statsd
    graphite
    aggregated
    UPD message
    statsd

    View Slide

  144. https://www.librato.com

    View Slide

  145. http://www.soasta.com/

    View Slide

  146. Profiling

    View Slide

  147. XHProf

    View Slide

  148. View Slide

  149. View Slide

  150. Use it in production
    for a subset of
    requests

    View Slide

  151. newrelic.com

    View Slide

  152. https://tidways.io/

    View Slide

  153. https://blackfire.io/

    View Slide

  154. Toolbars

    View Slide

  155. View Slide

  156. View Slide

  157. Make it accessible

    View Slide

  158. Aggregation

    View Slide

  159. Handling failures

    View Slide

  160. What do I do when
    something breaks?

    View Slide

  161. Errors happen

    View Slide

  162. Detecting regressions

    View Slide

  163. Server outages

    View Slide

  164. Database
    overloads

    View Slide

  165. Bugs

    View Slide

  166. Service A Service B
    200 OK

    View Slide

  167. Service A Service B
    5xx

    View Slide

  168. Service A Service B
    Timeout

    View Slide

  169. Circuit Breakers

    View Slide

  170. Service A Service B
    200 OK
    Circuit
    Breaker
    Status: closed
    Error rate: 0

    View Slide

  171. Service A Service B
    Error
    Circuit
    Breaker
    Status: -> open
    Error rate:
    > threshold

    View Slide

  172. Service A Service B
    Circuit
    Breaker
    Status: -> open
    Error rate:
    > threshold

    View Slide

  173. Service A Service B
    Error
    Circuit
    Breaker
    Status: -> open
    Error rate:
    > threshold
    Test if still failing

    View Slide

  174. Service A Service B
    200 OK
    Circuit
    Breaker
    Status: -> close
    Error rate: 0
    Test if still failing

    View Slide

  175. https://github.com/Netflix/Hystrix

    View Slide

  176. https://github.com/odesk/phystrix

    View Slide

  177. Phystrix does not
    scale well

    View Slide

  178. Gracefully handling
    exceptions

    View Slide

  179. Component based
    fronted

    View Slide

  180. View Slide

  181. View Slide

  182. View Slide

  183. View Slide

  184. View Slide

  185. View Slide

  186. View Slide

  187. Degrading
    Functionality

    View Slide

  188. Profile Publications Publication
    Publication
    Publication
    AboutMe
    LeftColumn Image
    Menu
    Institution

    View Slide

  189. Profile Publications Publication
    Publication
    Publication
    AboutMe
    LeftColumn Image
    Menu
    EXCEPTION
    Institution

    View Slide

  190. Test it

    View Slide

  191. http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html

    View Slide

  192. Scalability

    View Slide

  193. How do I handle
    traffic spikes?

    View Slide

  194. Elasticity

    View Slide

  195. Service A Service B
    200 OK
    Circuit
    Breaker

    View Slide

  196. Service A Service B
    Circuit
    Breaker
    Service C
    Circuit
    Breaker

    View Slide

  197. Throttling

    View Slide

  198. Service A Service B
    Circuit
    Breaker
    Service C
    Circuit
    Breaker
    Only allow xx% of calls

    View Slide

  199. View Slide

  200. Priority

    View Slide

  201. Service A Service B
    Circuit
    Breaker
    Service C
    Circuit
    Breaker
    100% of calls
    10% of calls

    View Slide

  202. Service A Service B
    Circuit
    Breaker
    Service C
    Circuit
    Breaker
    100% of calls
    wait until everything is ok

    View Slide

  203. Elasticity

    View Slide

  204. Service A Service B
    Circuit
    Breaker
    Service C
    Circuit
    Breaker
    Service B

    View Slide

  205. Development
    Environment

    View Slide

  206. How do I enusre a
    productive dev
    environment?

    View Slide

  207. Diverse technology
    stacks

    View Slide

  208. Diverse
    environments

    View Slide

  209. https://www.docker.com/

    View Slide

  210. Central
    DEV
    Production
    Near
    Production
    Nightly
    DEV

    View Slide

  211. Large scale
    refactorings

    View Slide

  212. Monorepo

    View Slide

  213. https://qafoo.com/talks/15_08_froscon_monorepos.pdf

    View Slide

  214. Global Code
    Search

    View Slide

  215. https://github.com/etsy/hound

    View Slide

  216. Complete Solutions

    View Slide

  217. View Slide

  218. https://mesosphere.github.io/marathon/

    View Slide

  219. Kubernetes at the Home Office
    Billie Thompson
    Friday 11:30

    View Slide

  220. https://www.flickr.com/photos/darkdwarf/19701555974/

    View Slide

  221. https://joind.in/talk/38117

    View Slide

  222. http://twitter.com/BastianHofmann
    http://lanyrd.com/people/BastianHofmann
    http://speakerdeck.com/u/bastianhofmann
    [email protected]

    View Slide