Failover and Recovery Test Automation

20 April 2019, Tbilisi, Georgia Ivan Shamrai, Senior NFT Analyst,
Exactpro Failover and Recovery Test Automation

20 April 2019, Tbilisi, Georgia 2

20 April 2019, Tbilisi, Georgia Financial infrastructures • Exchanges •
Broker systems • Clearing agencies • Ticker plants • Surveillance systems Risks associated with financial infrastructure outage: • Lost profit • Data loss • Damaged reputation 3

20 April 2019, Tbilisi, Georgia Distributed high-performance computing • Bare-metal
servers (no virtualization, no Docker and other handy tools) • Horizontal scalability • Redundancy (absence of single point of failure) 4

20 April 2019, Tbilisi, Georgia Resilience tests • Hardware outages
◦ Network equipment failovers (Switches, Ports, Network adapters) ◦ Server isolations • Software outages ◦ Simulation of various outage types (SIGKILL, SIGSTOP) ◦ Failovers during different system state (at startup / trading day / during auction) 5

20 April 2019, Tbilisi, Georgia What cases to test? •
Failover – failure of active primary instance (standby becomes active) • Failback – failure of active standby instance • Standby failure – failure of passive standby instance • Double failure – simultaneous failure of both instances 6

20 April 2019, Tbilisi, Georgia What kinds of data do
we need and when? • Pre-SOD: system snapshots and backups • Real-time: ◦ System metrics of all servers and all components (processes) ◦ Captured traffic of injected load and system responses ◦ Log files of the system • Post-EOD: log data for passive testing and results analysis 7

20 April 2019, Tbilisi, Georgia 8

20 April 2019, Tbilisi, Georgia Defects mining in collected data
• Log entries per second • Warnings per second • Errors per second • Transaction statistics • Response time (latency) • Throughput • Disk usage • RAM usage • CPU usage • Network stats System statistics Captured traffic Log files 9

20 April 2019, Tbilisi, Georgia Rules and thresholds ALERT: METRIC
: RSS GROWTH : 1GB TIME : 10 MIN ALERT: METRIC : DISK GROWTH : 10% TIME : 1 HOUR Server: MP101 Process: MatchingEngine Primary Metric: RSS (resident set size) 10

20 April 2019, Tbilisi, Georgia Spikes and stairs detection Server:
OE102 Process: FixGateway Standby Metric: RSS (resident set size) 11

20 April 2019, Tbilisi, Georgia Spikes and stairs detection Example:
• CPU usage spike happened on TransactionRouter component at ~11:49 • Most likely last scenario step done prior to 11:49 caused that spike • Information about this abnormality and steps that produced it will be populated in final report Server: CA104 Process: TransactionRouter Primary Metric: CPU usage 12

20 April 2019, Tbilisi, Georgia Data reconciliation checks • Consistency
across different data streams ◦ Client’s messages ◦ Public market data ◦ Aggregated market data • Consistency between data streams and system’s database 13

20 April 2019, Tbilisi, Georgia How to collect data in
real-time? • Use of available system tools • Use of monitoring provided by a proprietary software vendor • Use of third party monitoring tools 14

20 April 2019, Tbilisi, Georgia How about reinventing the wheel?
• Independent • Incorporate all the features we need in one tool • Remote controlled • Support of different output formats: protobuf, json, raw binary data • Support of multiple data consumers with different visibility • Deliver data on need to know basis only • Uniform data format across all environments • Low footprint 15

20 April 2019, Tbilisi, Georgia Downsides of the brand new
bicycle • Green code: not well tested in the field • Requires additional resources for support • Solves only a particular problem 16

20 April 2019, Tbilisi, Georgia Who should receive real-time data?
• Different tests require dozens of different metrics • A tester is not able to track all the changes • All the data should be analyzed on the fly • Test behaviour should be changed depending on the received data 17

20 April 2019, Tbilisi, Georgia High level view on real-time
monitoring Daemon_S Collecting system info, logs parsing, commands execution Collecting system info, logs parsing, commands execution Load control and test scripts execution Communication between daemons and controllers TestManager: Automated execution of test scenarios, collecting and processing test information Daemon_M Daemon_I Router TM Data Processor Transform, collect and store data for future use Data visualisation and reporting Data storage ... Management Server QA Server Server 1 Server 2 Server N Router Daemon_M Daemon_S1 Daemon_S2 Daemon_SN Daemon_I TM Data Processor 18

20 April 2019, Tbilisi, Georgia Passive monitoring Management Server TestManager
Data Processor Router Matching Server Daemon MEP MatchingEnginePrimary Matching log Monitoring Server Daemon MON System events log System metrics log MatchingEnginePrimary {PID: 1234, RSS: 500MB, CP Usage: 15%} System MatchingEnginePrimary {STATE: READY} MatchingEnginePrimary {INTERNAL LATENCY: 10} System {CPU Usage: 15%, Free Mem: 50%, Free Disk Space: 80%} The MON Daemon collects system metrics and messages The MEP Daemon parses matching log and provides router with actual system info 19

20 April 2019, Tbilisi, Georgia Active monitoring Management Server TestManager
Data Processor Router Matching Server Daemon MEP MatchingEnginePrimary Matching log Monitoring Server Daemon MON System events log System metrics log System System {CPU Usage: 1%, Free Mem: 75%, Free Disk Space: 83%} Stop matching log monitor When realtime data is not required user or an automated scenario can stop or update a task for an active monitor to reduce system load. RPC call 20

20 April 2019, Tbilisi, Georgia Post-EOD data • Checkpoints from
the TestManager tool • System and hardware usage stats • Essential internal metrics from the system under test 21

20 April 2019, Tbilisi, Georgia ... ~|============================================================================= ~|Disk I/O statistics
~|============================================================================= ~|Device Reads/sec Writes/sec AvgQSize AvgWait AvgSrvceTime ~|sda 0.0 ( 0.0kB) 4.1 ( 22.4kB) 0.0 0.0ms 0.0ms ~|sdb 0.0 ( 0.0kB) 0.0 ( 0.0kB) 0.0 0.0ms 0.0ms ~|sdc 0.0 ( 0.0kB) 10.7 ( 70.5kB) 0.0 0.0ms 0.0ms 20181030074410.191|504|TEXT |System Memory Information (from /proc/meminfo) ~|============================================================================= ~|MemTotal: 263868528 kB ~|MemFree: 252390192 kB ... What’s wrong with system logs? Bias: logs should be human friendly 22

20 April 2019, Tbilisi, Georgia UNKNOWN METRIC DETECTED: [SystemComponent:1]: A
To B | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max Queues=(Pub=0, Pvt=0) KNOWN METRICS: [SystemComponent:1]: AToB | Rate=0.00 [W=0.00,L=0.00, Q=0.00, T=0.00], Mode=LOW_LATENCY, Max Queues=[Pub=0, Pvt=0] [SystemComponent:1]: ABToWorker | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max Queues=(Pub=0, Pvt=0) How to deal with creative loggers? • Accept the reality • No one will change log format just for you • No one will ask you prior to log format change • Regexpish patterns are our “best friends” • Automatic log formats analysis 24

20 April 2019, Tbilisi, Georgia Where to store and how
long? • Data is sensitive and should be stored on the client’s side • Data volume is huge for limited hardware resources in the test environment • Data retention • HW stats • System merics • System configs • Traffic • Anonymous production data • System configs • Aggregated test reports Current data Historical data 2 weeks 25

20 April 2019, Tbilisi, Georgia How to use? • Reporting
• Analysis • Tests improvement 26

20 April 2019, Tbilisi, Georgia Reporting 27

20 April 2019, Tbilisi, Georgia Software Testing is Relentless Learning
28

Failover and Recovery Test Automation

Failover and Recovery Test Automation

Exactpro PRO

More Decks by Exactpro

Other Decks in Technology

Featured

Transcript

20 April 2019, Tbilisi, Georgia Ivan Shamrai, Senior NFT Analyst,

20 April 2019, Tbilisi, Georgia 2

20 April 2019, Tbilisi, Georgia Financial infrastructures • Exchanges •

20 April 2019, Tbilisi, Georgia Distributed high-performance computing • Bare-metal

20 April 2019, Tbilisi, Georgia Resilience tests • Hardware outages

20 April 2019, Tbilisi, Georgia What cases to test? •

20 April 2019, Tbilisi, Georgia What kinds of data do

20 April 2019, Tbilisi, Georgia 8

20 April 2019, Tbilisi, Georgia Defects mining in collected data

20 April 2019, Tbilisi, Georgia Rules and thresholds ALERT: METRIC

20 April 2019, Tbilisi, Georgia Spikes and stairs detection Server:

20 April 2019, Tbilisi, Georgia Spikes and stairs detection Example:

20 April 2019, Tbilisi, Georgia Data reconciliation checks • Consistency

20 April 2019, Tbilisi, Georgia How to collect data in

20 April 2019, Tbilisi, Georgia How about reinventing the wheel?

20 April 2019, Tbilisi, Georgia Downsides of the brand new

20 April 2019, Tbilisi, Georgia Who should receive real-time data?

20 April 2019, Tbilisi, Georgia High level view on real-time

20 April 2019, Tbilisi, Georgia Passive monitoring Management Server TestManager

20 April 2019, Tbilisi, Georgia Active monitoring Management Server TestManager

20 April 2019, Tbilisi, Georgia Post-EOD data • Checkpoints from

20 April 2019, Tbilisi, Georgia ... ~|============================================================================= ~|Disk I/O statistics

20 April 2019, Tbilisi, Georgia Not standardized Release 1: Release

20 April 2019, Tbilisi, Georgia UNKNOWN METRIC DETECTED: [SystemComponent:1]: A

20 April 2019, Tbilisi, Georgia Where to store and how

20 April 2019, Tbilisi, Georgia How to use? • Reporting

20 April 2019, Tbilisi, Georgia Reporting 27

20 April 2019, Tbilisi, Georgia Software Testing is Relentless Learning