Upgrade to Pro — share decks privately, control downloads, hide ads and more …

How to mock a mocking bird - testing dynamic infrastructure

How to mock a mocking bird - testing dynamic infrastructure

A presentation on major causes of outages, how to avoid them, how to mitigate the unavoidable ones via testing and better code quality..

Ranjib Dey

April 18, 2014
Tweet

More Decks by Ranjib Dey

Other Decks in Programming

Transcript

  1. How to mock a mocking bird
    testing dynamic infrastructure

    View Slide

  2. About the talk

    Operations specific to
    distributed systems

    Types and sources of
    failures

    Resiliency patterns

    Strategies for introducing
    testing

    View Slide

  3. Common causes for outages
    i. Code changes
    ii.Deployments
    iii.Dependency issues
    – e.g github is down
    iv.External factors
    i. Traffic spikes
    ii.Inconsistent I/O

    View Slide

  4. Amplifiers of outages

    Topology
    – Zookeeper over WAN?
    – MySQL Synchronous replication in network
    with

    Type of service
    – Persistence layer are not latency tolerant
    – Web services and deployments

    View Slide

  5. Amplifiers of outages

    Coupling
    – DB migrations and deployment

    Code quality
    – inefficient algorithms (sort, object allocation,
    mutability)
    – Inefficient sql queries

    View Slide

  6. Reference topology
    Too simplistic

    Not cross
    region

    Third party
    dependencies

    Operation
    services

    View Slide

  7. Real life topology

    View Slide

  8. Fault tolerant topology

    Not designed, but
    emerged

    Internet, genes,
    social networks

    Evolved in
    response to scale
    and failures

    View Slide

  9. Testing : Stage 1
    Assert Happy path scenario (most frequently
    used) works
    Feature: zookeeper cluster provisioning
    Scenario: Bootstrapping a zookeeper cluster
    Given I have a chef server with all our cookbooks
    When I run `knife provision zk 3`
    Then I should have “3” nodes with “zk” role

    View Slide

  10. Testing : Stage 1
    Assert absence of known bugs (regressions)
    Feature: zookeeper cluster provisioning
    Scenario: Bootstrapping a zookeeper cluster
    Given I have a chef server with all our cookbooks
    When I run `knife provision zk 3`
    Then I should have “3” nodes with “zk” role
    And all zk nodes should have zk.cnf populated

    View Slide

  11. Testing : Stage 1
    I.Tools: Cucumber, aruba, rspec
    II.Most valuable with broken or non-deterministic
    tools
    III.Time consuming
    IV.Steep learning curve
    V.Limited documentation
    VI.Example works: @lordcope, @sethvargo

    View Slide

  12. Testing : Stage 2

    Enforce better design

    View Slide

  13. Testing : Stage 2
    bash 'extract_sumologic' do
    user 'root'
    cwd node[:sumologic][:rootdir]
    code <[ -x collectorbin ] && collectorbin stop
    tar zxf #{node[:sumologic][:collector][:tarball]}
    chmod 755 sumocollector/collector
    cp sumocollector/tanuki/wrapperdir/wrapper sumocollector
    EOH
    if !File.exists? node[:sumologic] :rootdir]
    action :run
    else
    action :nothing
    end
    end

    View Slide

  14. Testing : Stage 2
    execute 'extract_sumologic' do
    user 'root'
    cwd node[:sumologic][:rootdir]
    code ”cp sumocollector/tanuki/wrapperdir wrapper”
    only_if { File.exists? node[:sumologic][:rootdir]}
    end
    stub_command('test-f #{node[:sumologic][:rootdir]}')
    expect(runner).to create_execute('extract_sumologic').with(
    user: 'root',
    cwd: node[:sumologic][:rootdir],
    code: ”cp sumocollector/tanuki/wrapperdir wrapper”
    )
    end

    View Slide

  15. Testing : Stage 2

    Enforce better design

    Consolidate repeats

    View Slide

  16. Testing : Stage 2
    include_recipe 'foo'
    code
    some more code
    even more code
    package 'foo' do
    action :install
    end
    template 'baz' do
    action :install
    end
    service bar do
    action [:start, :enable]
    end

    View Slide

  17. Testing : Stage 2
    include_recipe 'foo'
    extend Foo
    value = process(node)
    package 'foo' do
    action :install
    end
    template 'baz' do
    action :install
    end
    service bar do
    action [:start, :enable]
    end
    module Foo
    def process(node)
    code
    some more code
    even more code
    end
    end

    View Slide

  18. Testing : Stage 2

    Enforce better design

    Consolidate repeats

    Use appropriate language / stdlib alternatives

    View Slide

  19. Testing : Stage 2
    search(:node, 'roles:cassandra').partition do | other |
    node.ec2.placement_availability_zone ==
    other.ec2.placement_availability_zone
    end

    View Slide

  20. Testing : Stage 2

    Enforce better design

    Consolidate repeats

    Use appropriate language / stdlib alternatives

    Use appropriate chef idioms.

    View Slide

  21. Testing : Stage 2
    include_recipe 'foo'
    if node[:foo]
    package 'foo' do
    action :install
    end
    template 'baz' do
    action :install
    end
    service 'bar' do
    action :start
    end
    end
    include_recipe 'foo'
    package 'foo' do
    action :install
    not_if { node[:foo]}
    end
    template 'baz' do
    action :install
    not_if { node[:foo]}
    end
    service 'bar' do
    action :start
    not_if { node[:foo]}
    end

    View Slide

  22. Testing : Stage 2

    Lint & Unit testing

    Typos, syntax errors, logic
    – knife, ChefSpec, rubocop, foodcritic

    Fast

    Easier to adopt

    Invaluable for long term maintainability

    Shared conventions

    View Slide

  23. Testing : Stage 3

    Deployments
    – Dark launching
    – Canary releases
    – Blue-green deployment

    View Slide

  24. Testing : Stage 4

    Dependency
    – Version compatibility

    ChefSpec

    ServerSpec
    – Hosted services

    Degraded mode

    View Slide

  25. Testing : Stage 5

    External factors
    – Traffic patterns

    Gatling, JMeter
    – Network isolation, latency, jitter etc

    iptables, tc

    View Slide

  26. Testing : Stage 5

    External factos
    – resource starvation in shared environments

    ulimit, cgroup for memory

    nice, cgroup for cpu

    cgroup blkio for I/O.

    View Slide

  27. Testing : Stage 6

    Combining of failures
    – Whole environment provisioning

    Containers, vms etc

    chef-metal
    – Feedback driven tests

    Benchmark across services

    Measure & enforce minimal system
    resources

    Alert on rate of change

    View Slide

  28. Testing : Stage 7

    Combining failures
    – Message passing/orchestration

    mco

    Ansible

    knife-ssh

    serf

    View Slide

  29. Search, ssh & execute
    def knife(klass, *name_args)
    $stdout.sync
    klass.load_deps
    plugin = klass.new
    yield plugin.config if Kernel.block_given?
    plugin.name_args = name_args
    plugin.run
    end
    def knife_ssh(search, com, pass, concurrency)
    knife Chef::Knife::Ssh, search, command do |config|
    config[:ssh_password] = password
    config[:host_key_verify] = false
    config[:concurrency] = concurrency
    end
    end

    View Slide

  30. Summary

    Accept failures
    – Make them inexpensive, isolated

    Design matters
    – Read
    – Incremental changes

    Communication influence design
    – Avoid knowledge silos
    – Adopt cross team reviews

    View Slide

  31. Thank you
    [email protected] @RanjibDey

    View Slide