Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to mock a mocking bird - testing dynamic infrastructure

How to mock a mocking bird - testing dynamic infrastructure

A presentation on major causes of outages, how to avoid them, how to mitigate the unavoidable ones via testing and better code quality..

Ranjib Dey

April 18, 2014
Tweet

More Decks by Ranjib Dey

Other Decks in Programming

Transcript

  1. About the talk • Operations specific to distributed systems •

    Types and sources of failures • Resiliency patterns • Strategies for introducing testing
  2. Common causes for outages i. Code changes ii.Deployments iii.Dependency issues

    – e.g github is down iv.External factors i. Traffic spikes ii.Inconsistent I/O
  3. Amplifiers of outages • Topology – Zookeeper over WAN? –

    MySQL Synchronous replication in network with • Type of service – Persistence layer are not latency tolerant – Web services and deployments
  4. Amplifiers of outages • Coupling – DB migrations and deployment

    • Code quality – inefficient algorithms (sort, object allocation, mutability) – Inefficient sql queries
  5. Reference topology Too simplistic • Not cross region • Third

    party dependencies • Operation services
  6. Fault tolerant topology • Not designed, but emerged • Internet,

    genes, social networks • Evolved in response to scale and failures
  7. Testing : Stage 1 Assert Happy path scenario (most frequently

    used) works Feature: zookeeper cluster provisioning Scenario: Bootstrapping a zookeeper cluster Given I have a chef server with all our cookbooks When I run `knife provision zk 3` Then I should have “3” nodes with “zk” role
  8. Testing : Stage 1 Assert absence of known bugs (regressions)

    Feature: zookeeper cluster provisioning Scenario: Bootstrapping a zookeeper cluster Given I have a chef server with all our cookbooks When I run `knife provision zk 3` Then I should have “3” nodes with “zk” role And all zk nodes should have zk.cnf populated
  9. Testing : Stage 1 I.Tools: Cucumber, aruba, rspec II.Most valuable

    with broken or non-deterministic tools III.Time consuming IV.Steep learning curve V.Limited documentation VI.Example works: @lordcope, @sethvargo
  10. Testing : Stage 2 bash 'extract_sumologic' do user 'root' cwd

    node[:sumologic][:rootdir] code <<-EOH [ -x collectorbin ] && collectorbin stop tar zxf #{node[:sumologic][:collector][:tarball]} chmod 755 sumocollector/collector cp sumocollector/tanuki/wrapperdir/wrapper sumocollector EOH if !File.exists? node[:sumologic] :rootdir] action :run else action :nothing end end
  11. Testing : Stage 2 execute 'extract_sumologic' do user 'root' cwd

    node[:sumologic][:rootdir] code ”cp sumocollector/tanuki/wrapperdir wrapper” only_if { File.exists? node[:sumologic][:rootdir]} end stub_command('test-f #{node[:sumologic][:rootdir]}') expect(runner).to create_execute('extract_sumologic').with( user: 'root', cwd: node[:sumologic][:rootdir], code: ”cp sumocollector/tanuki/wrapperdir wrapper” ) end
  12. Testing : Stage 2 include_recipe 'foo' code some more code

    even more code package 'foo' do action :install end template 'baz' do action :install end service bar do action [:start, :enable] end
  13. Testing : Stage 2 include_recipe 'foo' extend Foo value =

    process(node) package 'foo' do action :install end template 'baz' do action :install end service bar do action [:start, :enable] end module Foo def process(node) code some more code even more code end end
  14. Testing : Stage 2 • Enforce better design • Consolidate

    repeats • Use appropriate language / stdlib alternatives
  15. Testing : Stage 2 search(:node, 'roles:cassandra').partition do | other |

    node.ec2.placement_availability_zone == other.ec2.placement_availability_zone end
  16. Testing : Stage 2 • Enforce better design • Consolidate

    repeats • Use appropriate language / stdlib alternatives • Use appropriate chef idioms.
  17. Testing : Stage 2 include_recipe 'foo' if node[:foo] package 'foo'

    do action :install end template 'baz' do action :install end service 'bar' do action :start end end include_recipe 'foo' package 'foo' do action :install not_if { node[:foo]} end template 'baz' do action :install not_if { node[:foo]} end service 'bar' do action :start not_if { node[:foo]} end
  18. Testing : Stage 2 • Lint & Unit testing •

    Typos, syntax errors, logic – knife, ChefSpec, rubocop, foodcritic • Fast • Easier to adopt • Invaluable for long term maintainability • Shared conventions
  19. Testing : Stage 3 • Deployments – Dark launching –

    Canary releases – Blue-green deployment
  20. Testing : Stage 4 • Dependency – Version compatibility •

    ChefSpec • ServerSpec – Hosted services • Degraded mode
  21. Testing : Stage 5 • External factors – Traffic patterns

    • Gatling, JMeter – Network isolation, latency, jitter etc • iptables, tc
  22. Testing : Stage 5 • External factos – resource starvation

    in shared environments • ulimit, cgroup for memory • nice, cgroup for cpu • cgroup blkio for I/O.
  23. Testing : Stage 6 • Combining of failures – Whole

    environment provisioning • Containers, vms etc • chef-metal – Feedback driven tests • Benchmark across services • Measure & enforce minimal system resources • Alert on rate of change
  24. Search, ssh & execute def knife(klass, *name_args) $stdout.sync klass.load_deps plugin

    = klass.new yield plugin.config if Kernel.block_given? plugin.name_args = name_args plugin.run end def knife_ssh(search, com, pass, concurrency) knife Chef::Knife::Ssh, search, command do |config| config[:ssh_password] = password config[:host_key_verify] = false config[:concurrency] = concurrency end end
  25. Summary • Accept failures – Make them inexpensive, isolated •

    Design matters – Read – Incremental changes • Communication influence design – Avoid knowledge silos – Adopt cross team reviews