Slide 1

Slide 1 text

Build Software to Test Software exactpro.com Defects mining in Exchange trading systems 08/11/2018 Pavel Medvedev, Stanislav Klimakov, Mikhail Yamkovy

Slide 2

Slide 2 text

2 Build Software to Test Software exactpro.com Contents - Exactpro company overview - Intro into trading Exchange systems - Testing approach - Creating and handling load profile - Performance testing - Resilience testing - Resilience in market infrastructures - Automation of resilience testing - Defects Mining in test data - Challenges of proprietary software testing in the client’s environment - Monitoring tools deployment - Data collection - Data storage and analysis

Slide 3

Slide 3 text

3 Build Software to Test Software exactpro.com EXACTPRO Build Software to Test Software • A specialist firm focused on functional and non-functional testing of exchanges, clearing houses, depositories and other market infrastructures • Incorporated in 2009 with 10 people, our company has experienced significant growth as satisfied clients require more services; now employing 550 specialists. • Part of London Stock Exchange Group (LSEG) from May 2015 till January 2018. Exactpro management buyout from LSEG in January 2018. • We provide software testing services for mission critical technology that underpins global financial markets. Our clients are regulated by FCA, Bank of England and their counterparts from other countries.

Slide 4

Slide 4 text

4 Build Software to Test Software exactpro.com We have a global software Quality Assurance client network

Slide 5

Slide 5 text

5 Build Software to Test Software exactpro.com Trading systems types Proprietary Trading & HFT Brokerage Execution Venue

Slide 6

Slide 6 text

6 Build Software to Test Software exactpro.com Typical requirements for Exchange system ● Daily capacity - 200+ mln transactions ● Peak rates - 40,000 transactions per second ● Average round-trip latency - dozens of microseconds ● Availability - 100%

Slide 7

Slide 7 text

7 Build Software to Test Software exactpro.com Typical requirements for Exchange system Daily capacity - 100+ mln transactions Peak rates - 40k+ transactions per second Average round-trip latency - <100 microseconds Availability - 100% 3000 trx 2.5 cm <1 mm

Slide 8

Slide 8 text

8 Build Software to Test Software exactpro.com Defining NFT

Slide 9

Slide 9 text

9 Build Software to Test Software exactpro.com Defining NFT Non-functional testing answers question - “HOW”

Slide 10

Slide 10 text

10 Build Software to Test Software exactpro.com

Slide 11

Slide 11 text

11 Build Software to Test Software exactpro.com Non Functional Testing

Slide 12

Slide 12 text

12 Build Software to Test Software exactpro.com Test Coverage – Exitus Acta Probat

Slide 13

Slide 13 text

13 Build Software to Test Software exactpro.com Exchange testing common scheme

Slide 14

Slide 14 text

14 Build Software to Test Software exactpro.com Tests preparation

Slide 15

Slide 15 text

15 Build Software to Test Software exactpro.com Test results analysis Do we actually send what we thought we send? • Evaluation of message rate ‘per millisecond’ unit and order mix balance: Message rate per millisecond: • Internal monitoring stats arbitration: - Matching Engine’s NEW_ORDERS, CANCELS, AMENDS, etc – rates per second and total amount of transactions MatchingEngine | NEW | Total=11896058 (2608833,3126952,3532034,2628239), Current=430 (85,103,141,101), Peak=2728 (721,661,746,600) MatchingEngine | AMEND | Total=45509 (9493,12145,13535,10336), Current=1 (0,0,1,0), Peak=11 (5,5,6,5) MatchingEngine | CANCEL | Total=9350063 (1957683,2492535,2784674,2115171), Current=357 (72,83,115,87), Peak=2086 (400,565,627,494) Number of msgs per millisecond % Samples Inbound (into System) Outbound (from System) <5 55.64% 55.01% 5-8 3.67% 4.05% 8-10 2.60% 2.77% 10-15 5.32% 5.39% 15-20 5.88% 5.95% 20-80 26.85% 26.78% >80 0.05% 0.05% Partition 1 Message Type ME cores Total 0 1 2 3 Order 3.74% 3.02% 2.00% 4.14% 12.89% Cancel 3.56% 2.89% 1.93% 4.02% 12.39% Amend 0.60% 0.53% 0.34% 0.68% 2.16% Quote 0.32% 0.11% 0.16% 0.27% 0.85% Trades 0.24% 0.18% 0.13% 0.29% 0.84%

Slide 16

Slide 16 text

16 Build Software to Test Software exactpro.com Latency end-to-end % avg max 100 82 518 99.99 82 408 99.9 82 139 99 80 103 Latency percentiles:

Slide 17

Slide 17 text

17 Build Software to Test Software exactpro.com Daily life cycle • DLC test The test executed in conjunction with Functional test team. – Pass system through Production like schedule: • All trading cycles • All scheduled sessions – Apply appropriate load during various phases – Perform some functional tests under load – Data consistency check • reconcile output from various sources • check data for integrity

Slide 18

Slide 18 text

18 Build Software to Test Software exactpro.com Other Non-Functional tests • Rapid user actions tests (connect-disconnect, logon-logout) – System should sustain against such user behavior – HW resources consumption should not grow up • Slow consumer tests – System should handle such users and should has a protection against them – HW resources consumption should not grow up • Intensive usage of recovery channels – System should be able to handle high number of requests on recovery channels and should be able satisfy them • Massive actions from Market Operations (mass order cancels, mass trade cancels, mass instrument halts) – System should handle Market operations’ actions like mass cancel of 10k active orders or trades. • Resilience tests

Slide 19

Slide 19 text

Build Software to Test Software exactpro.com Defects mining in Exchange trading systems 08/11/2018 Pavel Medvedev, Stanislav Klimakov, Mikhail Yamkovy

Slide 20

Slide 20 text

20 Build Software to Test Software exactpro.com Resilience?

Slide 21

Slide 21 text

21 Build Software to Test Software exactpro.com Financial infrastructures • Exchanges • Broker systems • Clearing agencies • Ticker plants • Surveillance systems Risks associated with financial infrastructure outage: • Lost profit • Data loss • Damaged reputation

Slide 22

Slide 22 text

22 Build Software to Test Software exactpro.com Distributed high-performance computing • Bare-metal servers (no virtualization) • Horizontal scalability • Redundancy (absence of single point of failure)

Slide 23

Slide 23 text

23 Build Software to Test Software exactpro.com Resilience tests ● Hardware outages ○ Network equipment failovers (Switches, Ports, Network adapters) ○ Server isolations ● Software outages ○ Simulation of various outage types (SIGKILL, SIGSTOP) ○ Failovers during different system state (at startup / trading day / during auction)

Slide 24

Slide 24 text

24 Build Software to Test Software exactpro.com • Failover – failure of active primary instance (standby becomes active) • Failback – failure of active standby instance • Standby failure – failure of passive standby instance • Double failure – simultaneous failure of both instances What cases to test?

Slide 25

Slide 25 text

25 Build Software to Test Software exactpro.com • Test-manager with DSL scenario language • System monitoring tools • Load injection tool • Traffic capturing and parsing tools • Tools for data storage, visualisation and analysis What tools we use to do resilience testing?

Slide 26

Slide 26 text

26 Build Software to Test Software exactpro.com What kind of data is useful to analyse test results? • System metrics of all servers and all components (processes) • Captured traffic of injected load and system responses • Log files of the system

Slide 27

Slide 27 text

27 Build Software to Test Software exactpro.com

Slide 28

Slide 28 text

28 Build Software to Test Software exactpro.com Defects mining in collected data ● Log entries per second ● Warnings per second ● Errors per second ● Transaction statistics ● Response time (latency) ● Throughput ● Disk usage ● RAM usage ● CPU usage ● Network stats System statistics Captured traffic Log files

Slide 29

Slide 29 text

29 Build Software to Test Software exactpro.com Avoiding «dark data» Symptoms of «dark data» disease: ● Collecting data «just in case» without knowing the actual purpose of it ● Storing excessive amount of history data (in non-aggregated form) from previous test runs

Slide 30

Slide 30 text

30 Build Software to Test Software exactpro.com Overnight low touch testing ● Testing is performed without human participation ● Human friendly reports ● Data is our main value. Non-aggregated data is stored until report is seen by QA engineer (in case if more detailed investigation is needed afterwards) Test execution Real-time data collection and processing Performed by machine Performed by human Prepare environment and test tools Final report evaluation Performed by human

Slide 31

Slide 31 text

31 Build Software to Test Software exactpro.com Rules and thresholds ALERT: METRIC : RSS GROWTH : 1GB TIME : 10 MIN ALERT: METRIC : DISK GROWTH : 10% TIME : 1 HOUR Server: MP101 Process: MatchingEngine Primary Metric: RSS (resident set size)

Slide 32

Slide 32 text

32 Build Software to Test Software exactpro.com Spikes and stairs detection Server: OE102 Process: FixGateway Standby Metric: RSS (resident set size)

Slide 33

Slide 33 text

33 Build Software to Test Software exactpro.com Spikes and stairs detection Example: • CPU usage spike happened on TransactionRouter component at ~11:49 • Most likely last scenario step done prior to 11:49 caused that spike • Information about this abnormality and steps that produced it will be populated in final report Server: CA104 Process: TransactionRouter Primary Metric: CPU usage

Slide 34

Slide 34 text

34 Build Software to Test Software exactpro.com Data reconciliation checks ● Consistency across different data streams ○ Client’s messages ○ Public market data ○ Aggregated market data ● Consistency between data streams and system’s database

Slide 35

Slide 35 text

35 Build Software to Test Software exactpro.com DSL scenario example start load 3000 # Case 1: Failover of MatchingEngine Primary kill -9 primary MatchingEngine smoke start primary MatchingEngine # Case 2: Failback of MatchingEngine Standby kill -9 standby MatchingEngine smoke start standby MatchingEngine stop load 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Slide 36

Slide 36 text

36 Build Software to Test Software exactpro.com Report produced by Test Manager

Slide 37

Slide 37 text

37 Build Software to Test Software exactpro.com What do we get? • Test harness needs constant support • Higher tester qualification for improving automated scenarios • Validators may pass an issue that a tester could have noticed in real time • Need of regular review of test cases and methods of data analysis (to prevent pesticide paradox) • Better test coverage in comparison with manual execution • Test environments used 24/7 (an idle system does not help to find issues) • Efforts put into test coverage and tools improvement, but not test execution Pros: Cons:

Slide 38

Slide 38 text

Build Software to Test Software exactpro.com Defects mining in Exchange trading systems 08/11/2018 Pavel Medvedev, Stanislav Klimakov, Mikhail Yamkovy

Slide 39

Slide 39 text

39 Build Software to Test Software exactpro.com Introduction ● Challenges of proprietary software testing in the client’s environment ● Monitoring tools deployment ● Data collection challenges ● Data storage and analysis challenges

Slide 40

Slide 40 text

40 Build Software to Test Software exactpro.com Production and production like ● Legacy: stable, trusted, suitable to work with a particular system ● No ability to make changes in runtime ● No Docker, AppImage and other handy tools ● Portable tools are everything

Slide 41

Slide 41 text

41 Build Software to Test Software exactpro.com ?? ?? ?? Proprietary software in the client’s environment ● Not a complete specification ● Unknown data exchange and storage formats ● Access and other restrictions Gateway Sequencer Matching MarketData Test DB Test DB in? out? in? out? in? out? in? out? FIX ITCH Internal system messages

Slide 42

Slide 42 text

42 Build Software to Test Software exactpro.com What kinds of data do we need and when? ● Pre-SOD: system snapshots and backups ● Real-time: system metrics for active testing ● Post-EOD: log data for passive testing and results analysis

Slide 43

Slide 43 text

43 Build Software to Test Software exactpro.com How to collect data in real-time? ● Use of available system tools ● Use of monitoring provided by a proprietary software vendor ● Use of third party monitoring tools

Slide 44

Slide 44 text

44 Build Software to Test Software exactpro.com How about reinventing the wheel? ● Independent ● Incorporate all the features we need in one tool ● Remote controlled ● Support of different output formats: protobuf, json, raw binary data ● Support of multiple data consumers with different visibility ● Deliver data on need to know basis only ● Uniform data format across all environments ● Low footprint

Slide 45

Slide 45 text

45 Build Software to Test Software exactpro.com Downsides of the brand new bicycle ● Green code: not well tested in the field ● Requires additional resources for support ● Solves only a particular problem

Slide 46

Slide 46 text

46 Build Software to Test Software exactpro.com Who should receive real-time data? ● Different tests require dozens of different metrics ● A tester is not able to track all the changes ● All the data should be analyzed on the fly ● Test behaviour should be changed depending on the received data

Slide 47

Slide 47 text

47 Build Software to Test Software exactpro.com High level view on real-time monitoring ... Management Server QA Server Server 1 Server 2 Server N Router Daemon_M Daemon_S1 Daemon_S2 Daemon_SN Daemon_I TM Daemon_S Collecting system info, logs parsing, commands execution Collecting system info, logs parsing, commands execution Load control and test scripts execution Communication between daemons and controllers TestManager: Automated execution of test scenarios, collecting and processing test information Daemon_M Daemon_I Router TM Data Processor Data Processor Transform, collect and store data for future use Data visualisation and reporting Data storage and analysis Data analysis and management

Slide 48

Slide 48 text

48 Build Software to Test Software exactpro.com Passive monitoring Management Server TestManager Data Processor Router Matching Server Daemon MEP MatchingEnginePrimary Matching log Monitoring Server Daemon MON System events log System metrics log MatchingEnginePrimary {PID: 1234, RSS: 500MB, CP Usage: 15%} System MatchingEnginePrimary {STATE: READY} MatchingEnginePrimary {INTERNAL LATENCY: 10} System {CPU Usage: 15%, Free Mem: 50%, Free Disk Space: 80%} The MON Daemon collects system metrics and messages The MEP Daemon parses matching log and provides router with actual system info

Slide 49

Slide 49 text

49 Build Software to Test Software exactpro.com Active monitoring Management Server TestManager Data Processor Router Matching Server Daemon MEP MatchingEnginePrimary Matching log Monitoring Server Daemon MON System events log System metrics log System System {CPU Usage: 1%, Free Mem: 75%, Free Disk Space: 83%} Stop matching log monitor When realtime data is not required user or an automated scenario can stop or update a task for an active monitor to reduce system load. RPC call

Slide 50

Slide 50 text

50 Build Software to Test Software exactpro.com Post-EOD data ● Checkpoints from the TestManager tool ● System and hardware usage stats ● Essential internal metrics from the system under test

Slide 51

Slide 51 text

51 Build Software to Test Software exactpro.com What’s wrong with system logs? Bias: logs should be human friendly ... ~|============================================================================= ~|Disk I/O statistics ~|============================================================================= ~|Device Reads/sec Writes/sec AvgQSize AvgWait AvgSrvceTime ~|sda 0.0 ( 0.0kB) 4.1 ( 22.4kB) 0.0 0.0ms 0.0ms ~|sdb 0.0 ( 0.0kB) 0.0 ( 0.0kB) 0.0 0.0ms 0.0ms ~|sdc 0.0 ( 0.0kB) 10.7 ( 70.5kB) 0.0 0.0ms 0.0ms 20181030074410.191|504|TEXT |System Memory Information (from /proc/meminfo) ~|============================================================================= ~|MemTotal: 263868528 kB ~|MemFree: 252390192 kB ...

Slide 52

Slide 52 text

52 Build Software to Test Software exactpro.com What’s wrong with system logs? Not standardized Release 1: Release 2: Oct 30 2017 13:30:28 | SystemComponent:1 | Transfer Queue| Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Max Queues=(Pub=0, Pvt=0), Dec 12 2017 08:10:13 | SystemComponent:1 | Transfer Queue from Rcv Thread to Main Thread | Rate=0.00 | W=0.00 | L=0.00 | Q=0.00 | T=0.00 Dec 12 2017 08:10:13 | SystemComponent:1 | Max Queues from Rcv Thread to Main Thread | Pub=0, Pvt=0

Slide 53

Slide 53 text

53 Build Software to Test Software exactpro.com How to deal with creative loggers? ● Accept the reality ● No one will change log format just for you ● No one will ask you prior to log format change ● Regexpish patterns are our “best friends” ● Automatic log formats analysis UNKNOWN METRIC DETECTED: [SystemComponent:1]: A To B | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max Queues=(Pub=0, Pvt=0) KNOWN METRICS: [SystemComponent:1]: AToB | Rate=0.00 [W=0.00,L=0.00, Q=0.00, T=0.00], Mode=LOW_LATENCY, Max Queues=[Pub=0, Pvt=0] [SystemComponent:1]: ABToWorker | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max Queues=(Pub=0, Pvt=0)

Slide 54

Slide 54 text

54 Build Software to Test Software exactpro.com Where to store and how long? ● Data is sensitive and should be stored on the client’s side ● Data volume is huge for limited hardware resources in the test environment ● Data retention ● HW stats ● System merics ● System configs ● Traffic ● Anonymous production data ● System configs ● Aggregated test reports Current data Historical data 2 weeks

Slide 55

Slide 55 text

55 Build Software to Test Software exactpro.com How to use? ● Reporting ● Analysis ● Tests improvement

Slide 56

Slide 56 text

56 Build Software to Test Software exactpro.com Reporting

Slide 57

Slide 57 text

57 Build Software to Test Software exactpro.com Reporting

Slide 58

Slide 58 text

58 Build Software to Test Software exactpro.com Analysis

Slide 59

Slide 59 text

59 Build Software to Test Software exactpro.com Tests improvement ● Comparison of test conditions ● Comparison of test results ● Inspect historical data to introduce more realistic scenarios

Slide 60

Slide 60 text

60 Build Software to Test Software exactpro.com Software Testing is Relentless Learning