Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Failover and Recovery Test Automation

Failover and Recovery Test Automation

Exactpro

April 20, 2019
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. 20 April 2019, Tbilisi, Georgia Ivan Shamrai, Senior NFT Analyst,

    Exactpro Failover and Recovery Test Automation
  2. 20 April 2019, Tbilisi, Georgia Financial infrastructures • Exchanges •

    Broker systems • Clearing agencies • Ticker plants • Surveillance systems Risks associated with financial infrastructure outage: • Lost profit • Data loss • Damaged reputation 3
  3. 20 April 2019, Tbilisi, Georgia Distributed high-performance computing • Bare-metal

    servers (no virtualization, no Docker and other handy tools) • Horizontal scalability • Redundancy (absence of single point of failure) 4
  4. 20 April 2019, Tbilisi, Georgia Resilience tests • Hardware outages

    ◦ Network equipment failovers (Switches, Ports, Network adapters) ◦ Server isolations • Software outages ◦ Simulation of various outage types (SIGKILL, SIGSTOP) ◦ Failovers during different system state (at startup / trading day / during auction) 5
  5. 20 April 2019, Tbilisi, Georgia What cases to test? •

    Failover – failure of active primary instance (standby becomes active) • Failback – failure of active standby instance • Standby failure – failure of passive standby instance • Double failure – simultaneous failure of both instances 6
  6. 20 April 2019, Tbilisi, Georgia What kinds of data do

    we need and when? • Pre-SOD: system snapshots and backups • Real-time: ◦ System metrics of all servers and all components (processes) ◦ Captured traffic of injected load and system responses ◦ Log files of the system • Post-EOD: log data for passive testing and results analysis 7
  7. 20 April 2019, Tbilisi, Georgia Defects mining in collected data

    • Log entries per second • Warnings per second • Errors per second • Transaction statistics • Response time (latency) • Throughput • Disk usage • RAM usage • CPU usage • Network stats System statistics Captured traffic Log files 9
  8. 20 April 2019, Tbilisi, Georgia Rules and thresholds ALERT: METRIC

    : RSS GROWTH : 1GB TIME : 10 MIN ALERT: METRIC : DISK GROWTH : 10% TIME : 1 HOUR Server: MP101 Process: MatchingEngine Primary Metric: RSS (resident set size) 10
  9. 20 April 2019, Tbilisi, Georgia Spikes and stairs detection Server:

    OE102 Process: FixGateway Standby Metric: RSS (resident set size) 11
  10. 20 April 2019, Tbilisi, Georgia Spikes and stairs detection Example:

    • CPU usage spike happened on TransactionRouter component at ~11:49 • Most likely last scenario step done prior to 11:49 caused that spike • Information about this abnormality and steps that produced it will be populated in final report Server: CA104 Process: TransactionRouter Primary Metric: CPU usage 12
  11. 20 April 2019, Tbilisi, Georgia Data reconciliation checks • Consistency

    across different data streams ◦ Client’s messages ◦ Public market data ◦ Aggregated market data • Consistency between data streams and system’s database 13
  12. 20 April 2019, Tbilisi, Georgia How to collect data in

    real-time? • Use of available system tools • Use of monitoring provided by a proprietary software vendor • Use of third party monitoring tools 14
  13. 20 April 2019, Tbilisi, Georgia How about reinventing the wheel?

    • Independent • Incorporate all the features we need in one tool • Remote controlled • Support of different output formats: protobuf, json, raw binary data • Support of multiple data consumers with different visibility • Deliver data on need to know basis only • Uniform data format across all environments • Low footprint 15
  14. 20 April 2019, Tbilisi, Georgia Downsides of the brand new

    bicycle • Green code: not well tested in the field • Requires additional resources for support • Solves only a particular problem 16
  15. 20 April 2019, Tbilisi, Georgia Who should receive real-time data?

    • Different tests require dozens of different metrics • A tester is not able to track all the changes • All the data should be analyzed on the fly • Test behaviour should be changed depending on the received data 17
  16. 20 April 2019, Tbilisi, Georgia High level view on real-time

    monitoring Daemon_S Collecting system info, logs parsing, commands execution Collecting system info, logs parsing, commands execution Load control and test scripts execution Communication between daemons and controllers TestManager: Automated execution of test scenarios, collecting and processing test information Daemon_M Daemon_I Router TM Data Processor Transform, collect and store data for future use Data visualisation and reporting Data storage ... Management Server QA Server Server 1 Server 2 Server N Router Daemon_M Daemon_S1 Daemon_S2 Daemon_SN Daemon_I TM Data Processor 18
  17. 20 April 2019, Tbilisi, Georgia Passive monitoring Management Server TestManager

    Data Processor Router Matching Server Daemon MEP MatchingEnginePrimary Matching log Monitoring Server Daemon MON System events log System metrics log MatchingEnginePrimary {PID: 1234, RSS: 500MB, CP Usage: 15%} System MatchingEnginePrimary {STATE: READY} MatchingEnginePrimary {INTERNAL LATENCY: 10} System {CPU Usage: 15%, Free Mem: 50%, Free Disk Space: 80%} The MON Daemon collects system metrics and messages The MEP Daemon parses matching log and provides router with actual system info 19
  18. 20 April 2019, Tbilisi, Georgia Active monitoring Management Server TestManager

    Data Processor Router Matching Server Daemon MEP MatchingEnginePrimary Matching log Monitoring Server Daemon MON System events log System metrics log System System {CPU Usage: 1%, Free Mem: 75%, Free Disk Space: 83%} Stop matching log monitor When realtime data is not required user or an automated scenario can stop or update a task for an active monitor to reduce system load. RPC call 20
  19. 20 April 2019, Tbilisi, Georgia Post-EOD data • Checkpoints from

    the TestManager tool • System and hardware usage stats • Essential internal metrics from the system under test 21
  20. 20 April 2019, Tbilisi, Georgia ... ~|============================================================================= ~|Disk I/O statistics

    ~|============================================================================= ~|Device Reads/sec Writes/sec AvgQSize AvgWait AvgSrvceTime ~|sda 0.0 ( 0.0kB) 4.1 ( 22.4kB) 0.0 0.0ms 0.0ms ~|sdb 0.0 ( 0.0kB) 0.0 ( 0.0kB) 0.0 0.0ms 0.0ms ~|sdc 0.0 ( 0.0kB) 10.7 ( 70.5kB) 0.0 0.0ms 0.0ms 20181030074410.191|504|TEXT |System Memory Information (from /proc/meminfo) ~|============================================================================= ~|MemTotal: 263868528 kB ~|MemFree: 252390192 kB ... What’s wrong with system logs? Bias: logs should be human friendly 22
  21. 20 April 2019, Tbilisi, Georgia Not standardized Release 1: Release

    2: Oct 30 2017 13:30:28 | SystemComponent:1 | Transfer Queue| Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Max Queues=(Pub=0, Pvt=0), Dec 12 2017 08:10:13 | SystemComponent:1 | Transfer Queue from Rcv Thread to Main Thread | Rate=0.00 | W=0.00 | L=0.00 | Q=0.00 | T=0.00 Dec 12 2017 08:10:13 | SystemComponent:1 | Max Queues from Rcv Thread to Main Thread | Pub=0, Pvt=0 What’s wrong with system logs? 23
  22. 20 April 2019, Tbilisi, Georgia UNKNOWN METRIC DETECTED: [SystemComponent:1]: A

    To B | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max Queues=(Pub=0, Pvt=0) KNOWN METRICS: [SystemComponent:1]: AToB | Rate=0.00 [W=0.00,L=0.00, Q=0.00, T=0.00], Mode=LOW_LATENCY, Max Queues=[Pub=0, Pvt=0] [SystemComponent:1]: ABToWorker | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max Queues=(Pub=0, Pvt=0) How to deal with creative loggers? • Accept the reality • No one will change log format just for you • No one will ask you prior to log format change • Regexpish patterns are our “best friends” • Automatic log formats analysis 24
  23. 20 April 2019, Tbilisi, Georgia Where to store and how

    long? • Data is sensitive and should be stored on the client’s side • Data volume is huge for limited hardware resources in the test environment • Data retention • HW stats • System merics • System configs • Traffic • Anonymous production data • System configs • Aggregated test reports Current data Historical data 2 weeks 25
  24. 20 April 2019, Tbilisi, Georgia How to use? • Reporting

    • Analysis • Tests improvement 26