Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failover and Recovery Test Automation

Failover and Recovery Test Automation

Exactpro
PRO

April 20, 2019
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. 20 April 2019, Tbilisi, Georgia
    Ivan Shamrai, Senior NFT Analyst, Exactpro
    Failover and Recovery Test Automation

    View Slide

  2. 20 April 2019, Tbilisi, Georgia 2

    View Slide

  3. 20 April 2019, Tbilisi, Georgia
    Financial infrastructures
    ● Exchanges
    ● Broker systems
    ● Clearing agencies
    ● Ticker plants
    ● Surveillance systems
    Risks associated with financial infrastructure outage:
    ● Lost profit
    ● Data loss
    ● Damaged reputation
    3

    View Slide

  4. 20 April 2019, Tbilisi, Georgia
    Distributed high-performance computing
    ● Bare-metal servers (no virtualization, no Docker and other
    handy tools)
    ● Horizontal scalability
    ● Redundancy (absence of single point of failure)
    4

    View Slide

  5. 20 April 2019, Tbilisi, Georgia
    Resilience tests
    ● Hardware outages
    ○ Network equipment failovers (Switches, Ports, Network adapters)
    ○ Server isolations
    ● Software outages
    ○ Simulation of various outage types (SIGKILL, SIGSTOP)
    ○ Failovers during different system state (at startup / trading day / during auction)
    5

    View Slide

  6. 20 April 2019, Tbilisi, Georgia
    What cases to test?
    ● Failover – failure of active primary instance
    (standby becomes active)
    ● Failback – failure of active standby
    instance
    ● Standby failure – failure of passive
    standby instance
    ● Double failure – simultaneous failure of
    both instances
    6

    View Slide

  7. 20 April 2019, Tbilisi, Georgia
    What kinds of data do we need and when?
    ● Pre-SOD: system snapshots and backups
    ● Real-time:
    ○ System metrics of all servers and all components (processes)
    ○ Captured traffic of injected load and system responses
    ○ Log files of the system
    ● Post-EOD: log data for passive testing and results analysis
    7

    View Slide

  8. 20 April 2019, Tbilisi, Georgia 8

    View Slide

  9. 20 April 2019, Tbilisi, Georgia
    Defects mining in collected data
    ● Log entries per second
    ● Warnings per second
    ● Errors per second
    ● Transaction statistics
    ● Response time (latency)
    ● Throughput
    ● Disk usage
    ● RAM usage
    ● CPU usage
    ● Network stats
    System statistics Captured traffic Log files
    9

    View Slide

  10. 20 April 2019, Tbilisi, Georgia
    Rules and thresholds
    ALERT:
    METRIC : RSS
    GROWTH : 1GB
    TIME : 10 MIN
    ALERT:
    METRIC : DISK
    GROWTH : 10%
    TIME : 1 HOUR
    Server: MP101
    Process: MatchingEngine Primary
    Metric: RSS (resident set size)
    10

    View Slide

  11. 20 April 2019, Tbilisi, Georgia
    Spikes and stairs detection
    Server: OE102
    Process: FixGateway Standby
    Metric: RSS (resident set size)
    11

    View Slide

  12. 20 April 2019, Tbilisi, Georgia
    Spikes and stairs detection
    Example:
    ● CPU usage spike happened on
    TransactionRouter component at ~11:49
    ● Most likely last scenario step done prior to
    11:49 caused that spike
    ● Information about this abnormality and steps
    that produced it will be populated in final report
    Server: CA104
    Process: TransactionRouter Primary
    Metric: CPU usage
    12

    View Slide

  13. 20 April 2019, Tbilisi, Georgia
    Data reconciliation checks
    ● Consistency across different data streams
    ○ Client’s messages
    ○ Public market data
    ○ Aggregated market data
    ● Consistency between data streams and system’s database
    13

    View Slide

  14. 20 April 2019, Tbilisi, Georgia
    How to collect data in real-time?
    ● Use of available system tools
    ● Use of monitoring provided by a proprietary software vendor
    ● Use of third party monitoring tools
    14

    View Slide

  15. 20 April 2019, Tbilisi, Georgia
    How about reinventing the wheel?
    ● Independent
    ● Incorporate all the features we need in one tool
    ● Remote controlled
    ● Support of different output formats: protobuf, json, raw binary data
    ● Support of multiple data consumers with different visibility
    ● Deliver data on need to know basis only
    ● Uniform data format across all environments
    ● Low footprint
    15

    View Slide

  16. 20 April 2019, Tbilisi, Georgia
    Downsides of the brand new bicycle
    ● Green code: not well tested in the field
    ● Requires additional resources for support
    ● Solves only a particular problem
    16

    View Slide

  17. 20 April 2019, Tbilisi, Georgia
    Who should receive real-time data?
    ● Different tests require dozens of different metrics
    ● A tester is not able to track all the changes
    ● All the data should be analyzed on the fly
    ● Test behaviour should be changed depending on the received data
    17

    View Slide

  18. 20 April 2019, Tbilisi, Georgia
    High level view on real-time monitoring
    Daemon_S
    Collecting system info, logs
    parsing, commands execution
    Collecting system info, logs
    parsing, commands execution
    Load control and test scripts
    execution
    Communication between daemons
    and controllers
    TestManager: Automated execution of
    test scenarios, collecting and
    processing test information
    Daemon_M
    Daemon_I
    Router
    TM
    Data
    Processor
    Transform, collect and store data for
    future use
    Data visualisation and reporting
    Data storage
    ...
    Management
    Server
    QA Server
    Server 1
    Server 2
    Server N
    Router
    Daemon_M
    Daemon_S1
    Daemon_S2
    Daemon_SN
    Daemon_I
    TM
    Data
    Processor
    18

    View Slide

  19. 20 April 2019, Tbilisi, Georgia
    Passive monitoring
    Management Server
    TestManager
    Data
    Processor
    Router
    Matching Server
    Daemon MEP
    MatchingEnginePrimary
    Matching log
    Monitoring Server
    Daemon MON
    System events log
    System metrics log
    MatchingEnginePrimary {PID: 1234, RSS: 500MB, CP Usage: 15%}
    System
    MatchingEnginePrimary {STATE: READY}
    MatchingEnginePrimary {INTERNAL LATENCY: 10}
    System {CPU Usage: 15%, Free Mem: 50%, Free Disk Space: 80%}
    The MON Daemon collects system
    metrics and messages
    The MEP Daemon parses matching log
    and provides router with actual system info
    19

    View Slide

  20. 20 April 2019, Tbilisi, Georgia
    Active monitoring
    Management Server
    TestManager
    Data
    Processor
    Router
    Matching Server
    Daemon MEP
    MatchingEnginePrimary
    Matching log
    Monitoring Server
    Daemon MON
    System events log
    System metrics log
    System
    System {CPU Usage: 1%, Free Mem: 75%, Free Disk Space: 83%}
    Stop matching log monitor
    When realtime data is not required user
    or an automated scenario can stop or
    update a task for an active monitor to
    reduce system load.
    RPC call
    20

    View Slide

  21. 20 April 2019, Tbilisi, Georgia
    Post-EOD data
    ● Checkpoints from the TestManager tool
    ● System and hardware usage stats
    ● Essential internal metrics from the system under test
    21

    View Slide

  22. 20 April 2019, Tbilisi, Georgia
    ...
    ~|=============================================================================
    ~|Disk I/O statistics
    ~|=============================================================================
    ~|Device Reads/sec Writes/sec AvgQSize AvgWait AvgSrvceTime
    ~|sda 0.0 ( 0.0kB) 4.1 ( 22.4kB) 0.0 0.0ms 0.0ms
    ~|sdb 0.0 ( 0.0kB) 0.0 ( 0.0kB) 0.0 0.0ms 0.0ms
    ~|sdc 0.0 ( 0.0kB) 10.7 ( 70.5kB) 0.0 0.0ms 0.0ms
    20181030074410.191|504|TEXT |System Memory Information (from /proc/meminfo)
    ~|=============================================================================
    ~|MemTotal: 263868528 kB
    ~|MemFree: 252390192 kB
    ...
    What’s wrong with system logs?
    Bias: logs should be human friendly
    22

    View Slide

  23. 20 April 2019, Tbilisi, Georgia
    Not standardized
    Release 1:
    Release 2:
    Oct 30 2017 13:30:28 | SystemComponent:1 | Transfer Queue| Rate=0.00
    (W=0.00,L=0.00, Q=0.00, T=0.00), Max Queues=(Pub=0, Pvt=0),
    Dec 12 2017 08:10:13 | SystemComponent:1 | Transfer Queue from Rcv Thread to Main Thread |
    Rate=0.00 | W=0.00 | L=0.00 | Q=0.00 | T=0.00
    Dec 12 2017 08:10:13 | SystemComponent:1 | Max Queues from Rcv Thread to Main Thread | Pub=0, Pvt=0
    What’s wrong with system logs?
    23

    View Slide

  24. 20 April 2019, Tbilisi, Georgia
    UNKNOWN METRIC DETECTED:
    [SystemComponent:1]: A To B | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max
    Queues=(Pub=0, Pvt=0)
    KNOWN METRICS:
    [SystemComponent:1]: AToB | Rate=0.00 [W=0.00,L=0.00, Q=0.00, T=0.00], Mode=LOW_LATENCY, Max
    Queues=[Pub=0, Pvt=0]
    [SystemComponent:1]: ABToWorker | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max
    Queues=(Pub=0, Pvt=0)
    How to deal with creative loggers?
    ● Accept the reality
    ● No one will change log format just for you
    ● No one will ask you prior to log format change
    ● Regexpish patterns are our “best friends”
    ● Automatic log formats analysis
    24

    View Slide

  25. 20 April 2019, Tbilisi, Georgia
    Where to store and how long?
    ● Data is sensitive and should be stored on the client’s side
    ● Data volume is huge for limited hardware resources in the test environment
    ● Data retention
    ● HW stats
    ● System merics
    ● System configs
    ● Traffic
    ● Anonymous production data
    ● System configs
    ● Aggregated test reports
    Current data Historical data
    2 weeks
    25

    View Slide

  26. 20 April 2019, Tbilisi, Georgia
    How to use?
    ● Reporting
    ● Analysis
    ● Tests improvement
    26

    View Slide

  27. 20 April 2019, Tbilisi, Georgia
    Reporting
    27

    View Slide

  28. 20 April 2019, Tbilisi, Georgia
    Software Testing is Relentless Learning
    28

    View Slide