Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to mock a mocking bird - testing dynamic infrastructure

How to mock a mocking bird - testing dynamic infrastructure

A presentation on major causes of outages, how to avoid them, how to mitigate the unavoidable ones via testing and better code quality..

7090d58d804c96911a37c84e4e90a9cf?s=128

Ranjib Dey

April 18, 2014
Tweet

Transcript

  1. How to mock a mocking bird testing dynamic infrastructure

  2. About the talk • Operations specific to distributed systems •

    Types and sources of failures • Resiliency patterns • Strategies for introducing testing
  3. Common causes for outages i. Code changes ii.Deployments iii.Dependency issues

    – e.g github is down iv.External factors i. Traffic spikes ii.Inconsistent I/O
  4. Amplifiers of outages • Topology – Zookeeper over WAN? –

    MySQL Synchronous replication in network with • Type of service – Persistence layer are not latency tolerant – Web services and deployments
  5. Amplifiers of outages • Coupling – DB migrations and deployment

    • Code quality – inefficient algorithms (sort, object allocation, mutability) – Inefficient sql queries
  6. Reference topology Too simplistic • Not cross region • Third

    party dependencies • Operation services
  7. Real life topology

  8. Fault tolerant topology • Not designed, but emerged • Internet,

    genes, social networks • Evolved in response to scale and failures
  9. Testing : Stage 1 Assert Happy path scenario (most frequently

    used) works Feature: zookeeper cluster provisioning Scenario: Bootstrapping a zookeeper cluster Given I have a chef server with all our cookbooks When I run `knife provision zk 3` Then I should have “3” nodes with “zk” role
  10. Testing : Stage 1 Assert absence of known bugs (regressions)

    Feature: zookeeper cluster provisioning Scenario: Bootstrapping a zookeeper cluster Given I have a chef server with all our cookbooks When I run `knife provision zk 3` Then I should have “3” nodes with “zk” role And all zk nodes should have zk.cnf populated
  11. Testing : Stage 1 I.Tools: Cucumber, aruba, rspec II.Most valuable

    with broken or non-deterministic tools III.Time consuming IV.Steep learning curve V.Limited documentation VI.Example works: @lordcope, @sethvargo
  12. Testing : Stage 2 • Enforce better design

  13. Testing : Stage 2 bash 'extract_sumologic' do user 'root' cwd

    node[:sumologic][:rootdir] code <<-EOH [ -x collectorbin ] && collectorbin stop tar zxf #{node[:sumologic][:collector][:tarball]} chmod 755 sumocollector/collector cp sumocollector/tanuki/wrapperdir/wrapper sumocollector EOH if !File.exists? node[:sumologic] :rootdir] action :run else action :nothing end end
  14. Testing : Stage 2 execute 'extract_sumologic' do user 'root' cwd

    node[:sumologic][:rootdir] code ”cp sumocollector/tanuki/wrapperdir wrapper” only_if { File.exists? node[:sumologic][:rootdir]} end stub_command('test-f #{node[:sumologic][:rootdir]}') expect(runner).to create_execute('extract_sumologic').with( user: 'root', cwd: node[:sumologic][:rootdir], code: ”cp sumocollector/tanuki/wrapperdir wrapper” ) end
  15. Testing : Stage 2 • Enforce better design • Consolidate

    repeats
  16. Testing : Stage 2 include_recipe 'foo' code some more code

    even more code package 'foo' do action :install end template 'baz' do action :install end service bar do action [:start, :enable] end
  17. Testing : Stage 2 include_recipe 'foo' extend Foo value =

    process(node) package 'foo' do action :install end template 'baz' do action :install end service bar do action [:start, :enable] end module Foo def process(node) code some more code even more code end end
  18. Testing : Stage 2 • Enforce better design • Consolidate

    repeats • Use appropriate language / stdlib alternatives
  19. Testing : Stage 2 search(:node, 'roles:cassandra').partition do | other |

    node.ec2.placement_availability_zone == other.ec2.placement_availability_zone end
  20. Testing : Stage 2 • Enforce better design • Consolidate

    repeats • Use appropriate language / stdlib alternatives • Use appropriate chef idioms.
  21. Testing : Stage 2 include_recipe 'foo' if node[:foo] package 'foo'

    do action :install end template 'baz' do action :install end service 'bar' do action :start end end include_recipe 'foo' package 'foo' do action :install not_if { node[:foo]} end template 'baz' do action :install not_if { node[:foo]} end service 'bar' do action :start not_if { node[:foo]} end
  22. Testing : Stage 2 • Lint & Unit testing •

    Typos, syntax errors, logic – knife, ChefSpec, rubocop, foodcritic • Fast • Easier to adopt • Invaluable for long term maintainability • Shared conventions
  23. Testing : Stage 3 • Deployments – Dark launching –

    Canary releases – Blue-green deployment
  24. Testing : Stage 4 • Dependency – Version compatibility •

    ChefSpec • ServerSpec – Hosted services • Degraded mode
  25. Testing : Stage 5 • External factors – Traffic patterns

    • Gatling, JMeter – Network isolation, latency, jitter etc • iptables, tc
  26. Testing : Stage 5 • External factos – resource starvation

    in shared environments • ulimit, cgroup for memory • nice, cgroup for cpu • cgroup blkio for I/O.
  27. Testing : Stage 6 • Combining of failures – Whole

    environment provisioning • Containers, vms etc • chef-metal – Feedback driven tests • Benchmark across services • Measure & enforce minimal system resources • Alert on rate of change
  28. Testing : Stage 7 • Combining failures – Message passing/orchestration

    • mco • Ansible • knife-ssh • serf
  29. Search, ssh & execute def knife(klass, *name_args) $stdout.sync klass.load_deps plugin

    = klass.new yield plugin.config if Kernel.block_given? plugin.name_args = name_args plugin.run end def knife_ssh(search, com, pass, concurrency) knife Chef::Knife::Ssh, search, command do |config| config[:ssh_password] = password config[:host_key_verify] = false config[:concurrency] = concurrency end end
  30. Summary • Accept failures – Make them inexpensive, isolated •

    Design matters – Read – Incremental changes • Communication influence design – Avoid knowledge silos – Adopt cross team reviews
  31. Thank you ranjib@pagerduty.com @RanjibDey