Elastic scaling in a (micro)service oriented architecture

Elastic scaling in a (micro)service oriented architecture

Splitting an application up into multiple independent services can be a good way to keep it scaling and ensure stability and developer productivity in larger, growing teams. But just splitting the codebase, creating APIs and deploying the code on some servers is not enough, somehow your services need to know where and how other services are accessible. Classical approaches like hardcoding everything in every service or having a central load-balancer can quickly lead to problems in terms of scalability and maintainability. In this talk I'll show how we at ResearchGate tackled this challenge. With the help of tools like Consul and haproxy we created a setup that allows us to quickly boot and shutdown services. This ensures that all servers are utilized optimally and load spikes can be reacted upon quickly and automatically.

8e82eb7e128a14a16d642ae55227339b?s=128

Bastian Hofmann

February 18, 2016
Tweet

Transcript

  1. Elastic Scaling in a (Micro)service oriented Architecture @BastianHofmann

  2. None
  3. None
  4. Microservices

  5. None
  6. Service Oriented Architecture

  7. Monolith

  8. http://blog.philipphauer.de/microservices-nutshell-pros-cons/ Monolith Microservices

  9. Benefits

  10. Problems

  11. Problems

  12. Challenges

  13. Microservice

  14. None
  15. Cloud Solutions

  16. None
  17. Using the cloud is not always possible

  18. … or even desirable

  19. Doing it yourself creates challenges

  20. Performance

  21. Latency

  22. Stability

  23. Reliability

  24. Transparency

  25. Learning Curves

  26. Code Reuse

  27. Maintenance

  28. Elastic Scaling?

  29. How can we solve them

  30. A lot of this is also useful for monoliths

  31. None
  32. None
  33. •A big monolith •Multiple small to medium sized services •Lots

    of shared libraries •Tools and utilities •Hadoop jobs •Flink jobs •Server Provisioning
  34. •PHP •Javascript •Java •Scala •Bash •Python •Puppet •Ruby •Go

  35. •Nginx •PHP-FPM •Glassfish •Jetty •Dropwizard •haproxy •PostgreSQL •MongoDB •Memcached •Infinispan

    •Solr •Zookeeper •Elasticsearch •Logstash •Kibana •Graphite •StatsD •RabbitMQ •Hortonworks Data Platform •HBase •Hive •Consul •Vault •CheckMK •Azkaban •ActiveMQ •Apache HTTPD •Docker •Kafka
  36. Several hundred servers

  37. None
  38. Questions? Ask

  39. http://speakerdeck.com/u/bastianhofmann

  40. https://www.flickr.com/photos/npobre/2601582256/

  41. Deployment

  42. How to get the services on our servers?

  43. Diverse technology stacks

  44. The same for every service

  45. One Click Deployment

  46. None
  47. Automation

  48. Build/Test/Release pipeline

  49. None
  50. https://www.flickr.com/photos/40987321@N02/5580348753/

  51. Base boxes

  52. Services installed in a sandbox

  53. https://www.docker.com/

  54. https://twitter.com/mfdii/status/697532387240996864

  55. Docker & PHP - development and deployment Szymon Skórczyński Thursday

    16:00
  56. Availability

  57. Zero Downtime Deployments

  58. Server Server Server Server

  59. Stability

  60. Canary environments

  61. Server Server Server Server

  62. Fast rollbacks

  63. •Ansible •Capistrano •Saltstack •Custom •….

  64. Running the service

  65. How do I stop and start a service and ensure

    it keeps running?
  66. Diverse technology stacks

  67. The same for every service

  68. •Supervisord •Upstart •S6 •Ruine •Monit •Circus •Restartd •…

  69. Releases

  70. How to synchronize changes over services?

  71. API Versioning

  72. GET /v23/foo/abr Host: myservice.local

  73. GET /foo/abr Host: myservice.local X-Version: 23

  74. GET /foo/abr?version=23 Host: myservice.local

  75. GET /foo/abr Host: myservice.local Accept: application/vnd.company.v23+json

  76. Feature Flags

  77. public function hasAccess() { return featureFlag()->isActive( FeatureFlag::TEST_ONE ); }

  78. None
  79. None
  80. Shared database

  81. Headers

  82. GET /foo/abr Host: myservice.local X-Flag-NewFeature: 1

  83. Configuration Management

  84. How do I synchronize configuration over services?

  85. [ "db_user": "user", "db_pw": "pw", "serviceA": "serviceA.local:8018" ]

  86. Config file on disk

  87. Duplication

  88. Inconsistencies

  89. Consul https://www.consul.io/

  90. •Consul •Zookeeper •etcd •…

  91. Consul Server Consul Server Consul Server Consul Agent ver Consul

    Agent Server Consul Agent Server Co Ag Server
  92. https://github.com/sensiolabs/consul-php-sdk

  93. Key/Value Store

  94. $kv->put('test/foo/bar', 'bazinga'); $kv->get('test/foo/bar', ['raw' => true]); $kv->delete('test/foo/bar');

  95. Credentials

  96. $kv->put('test/db/pw', 'secret_pw');

  97. https://www.vaultproject.io/

  98. Cycling of credentials

  99. Service Discovery

  100. How does one service know where another service is?

  101. Hostname + Port

  102. Server Service A Server Service B Service C Service C

  103. Configuration

  104. $config = [ 'serviceA' => [ '192.168.0.1:8001', '192.168.0.2:8001', ], 'serviceB'

    => [ '192.168.0.1:8002', ], 'serviceC' => [ '192.168.0.2:8003', ] ];
  105. Consul https://www.consul.io/

  106. Load balancing?

  107. Round robin in the client

  108. $config = [ 'serviceA' => [ '192.168.0.1:8001', '192.168.0.2:8001', ], 'serviceB'

    => [ '192.168.0.1:8002', ], 'serviceC' => [ '192.168.0.2:8003', ] ];
  109. Service/Server down?

  110. $config = [ 'serviceA' => [ '192.168.0.1:8001', '192.168.0.2:8001', ], 'serviceB'

    => [ '192.168.0.1:8002', ], 'serviceC' => [ '192.168.0.2:8003', ] ];
  111. Health checks

  112. GET /health HTTP/1.1 Host: serviceA.local HTTP/1.1 200 OK

  113. Central load balancer

  114. Server Service A Server Service B Service C Service C

    Load balancer
  115. Scalability?

  116. Elasticity?

  117. Consul https://www.consul.io/

  118. Consul Server Consul Server Consul Server Consul Agent ver Consul

    Agent Server Consul Agent Server Co Ag Server
  119. Consul for Service Discovery

  120. Consul Agent Server Service A Registration Health check

  121. Consul API

  122. DNS

  123. admin@hashicorp: dig web-frontend.service.consul. ANY ; <<>> DiG 9.8.3-P1 <<>> web-frontend.service.consul.

    ANY ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 29981 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;web-frontend.service.consul. IN ANY ;; ANSWER SECTION: web-frontend.service.consul. 0 IN A 10.0.3.83 web-frontend.service.consul. 0 IN A 10.0.1.109
  124. Consul-Template https://github.com/hashicorp/consul-template

  125. Server Service A Server Service B Service C Service C

    Load balancer Consul Template
  126. Single Point of Failure

  127. Server Service A Server Service B Service C Service C

    Load balancer Consul Template Load balancer Consul Template
  128. Monitoring

  129. How are my services behaving?

  130. Central Log Management

  131. Elasticsearch
 Kibana Logstash

  132. Logstash elasticsearch webserver webserver webserver AMQP log log log logstash

    logstash logstash
  133. None
  134. Tracing IDs

  135. web server http service http service http service http service

    create unique trace_id for request user request trace_id trace_id trace_id trace_id log log log log log
  136. X-Trace-Id: bbr8ehb984tbab894

  137. https://www.loggly.com/

  138. https://getsentry.com/

  139. Measure everything

  140. Server metrics

  141. Application metrics

  142. StatsD + Graphite

  143. webserver webserver webserver statsd statsd statsd graphite aggregated UPD message

    statsd
  144. https://www.librato.com

  145. http://www.soasta.com/

  146. Profiling

  147. XHProf

  148. None
  149. None
  150. Use it in production for a subset of requests

  151. newrelic.com

  152. https://tidways.io/

  153. https://blackfire.io/

  154. Toolbars

  155. None
  156. None
  157. Make it accessible

  158. Aggregation

  159. Handling failures

  160. What do I do when something breaks?

  161. Errors happen

  162. Detecting regressions

  163. Server outages

  164. Database overloads

  165. Bugs

  166. Service A Service B 200 OK

  167. Service A Service B 5xx

  168. Service A Service B Timeout

  169. Circuit Breakers

  170. Service A Service B 200 OK Circuit Breaker Status: closed

    Error rate: 0
  171. Service A Service B Error Circuit Breaker Status: -> open

    Error rate: > threshold
  172. Service A Service B Circuit Breaker Status: -> open Error

    rate: > threshold
  173. Service A Service B Error Circuit Breaker Status: -> open

    Error rate: > threshold Test if still failing
  174. Service A Service B 200 OK Circuit Breaker Status: ->

    close Error rate: 0 Test if still failing
  175. https://github.com/Netflix/Hystrix

  176. https://github.com/odesk/phystrix

  177. Phystrix does not scale well

  178. Gracefully handling exceptions

  179. Component based fronted

  180. None
  181. None
  182. None
  183. None
  184. None
  185. None
  186. None
  187. Degrading Functionality

  188. Profile Publications Publication Publication Publication AboutMe LeftColumn Image Menu Institution

  189. Profile Publications Publication Publication Publication AboutMe LeftColumn Image Menu EXCEPTION

    Institution
  190. Test it

  191. http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html

  192. Scalability

  193. How do I handle traffic spikes?

  194. Elasticity

  195. Service A Service B 200 OK Circuit Breaker

  196. Service A Service B Circuit Breaker Service C Circuit Breaker

  197. Throttling

  198. Service A Service B Circuit Breaker Service C Circuit Breaker

    Only allow xx% of calls
  199. None
  200. Priority

  201. Service A Service B Circuit Breaker Service C Circuit Breaker

    100% of calls 10% of calls
  202. Service A Service B Circuit Breaker Service C Circuit Breaker

    100% of calls wait until everything is ok
  203. Elasticity

  204. Service A Service B Circuit Breaker Service C Circuit Breaker

    Service B
  205. Development Environment

  206. How do I enusre a productive dev environment?

  207. Diverse technology stacks

  208. Diverse environments

  209. https://www.docker.com/

  210. Central DEV Production Near Production Nightly DEV

  211. Large scale refactorings

  212. Monorepo

  213. https://qafoo.com/talks/15_08_froscon_monorepos.pdf

  214. Global Code Search

  215. https://github.com/etsy/hound

  216. Complete Solutions

  217. None
  218. https://mesosphere.github.io/marathon/

  219. Kubernetes at the Home Office Billie Thompson Friday 11:30

  220. https://www.flickr.com/photos/darkdwarf/19701555974/

  221. https://joind.in/talk/38117

  222. http://twitter.com/BastianHofmann http://lanyrd.com/people/BastianHofmann http://speakerdeck.com/u/bastianhofmann mail@bastianhofmann.de