Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Defects mining in Exchange trading systems

Exactpro
November 08, 2018

Defects mining in Exchange trading systems

Data Fest Tbilisi 2018,
8 November, 2018

Pavel Medvedev, NFT Director, Exactpro
Mikhail Yamkovy, Senior NFT Analyst, Exactpro
Stanislav Klimakov, Senior NFT Analyst, Exactpro

Exactpro website: https://exactpro.com/
Follow us:
FB https://www.facebook.com/exactpro/
Twitter https://twitter.com/exactpro
LinkedIn https://www.linkedin.com/company/exactpro-systems-llc
Instagram https://www.instagram.com/exactpro/
Vimeo https://vimeo.com/exactpro
Youtube https://youtube.com/exactprosystems

Exactpro

November 08, 2018
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. Build Software to Test Software exactpro.com Defects mining in Exchange

    trading systems 08/11/2018 Pavel Medvedev, Stanislav Klimakov, Mikhail Yamkovy
  2. 2 Build Software to Test Software exactpro.com Contents - Exactpro

    company overview - Intro into trading Exchange systems - Testing approach - Creating and handling load profile - Performance testing - Resilience testing - Resilience in market infrastructures - Automation of resilience testing - Defects Mining in test data - Challenges of proprietary software testing in the client’s environment - Monitoring tools deployment - Data collection - Data storage and analysis
  3. 3 Build Software to Test Software exactpro.com EXACTPRO Build Software

    to Test Software • A specialist firm focused on functional and non-functional testing of exchanges, clearing houses, depositories and other market infrastructures • Incorporated in 2009 with 10 people, our company has experienced significant growth as satisfied clients require more services; now employing 550 specialists. • Part of London Stock Exchange Group (LSEG) from May 2015 till January 2018. Exactpro management buyout from LSEG in January 2018. • We provide software testing services for mission critical technology that underpins global financial markets. Our clients are regulated by FCA, Bank of England and their counterparts from other countries.
  4. 4 Build Software to Test Software exactpro.com We have a

    global software Quality Assurance client network
  5. 5 Build Software to Test Software exactpro.com Trading systems types

    Proprietary Trading & HFT Brokerage Execution Venue
  6. 6 Build Software to Test Software exactpro.com Typical requirements for

    Exchange system • Daily capacity - 200+ mln transactions • Peak rates - 40,000 transactions per second • Average round-trip latency - dozens of microseconds • Availability - 100%
  7. 7 Build Software to Test Software exactpro.com Typical requirements for

    Exchange system Daily capacity - 100+ mln transactions Peak rates - 40k+ transactions per second Average round-trip latency - <100 microseconds Availability - 100% 3000 trx 2.5 cm <1 mm
  8. 15 Build Software to Test Software exactpro.com Test results analysis

    Do we actually send what we thought we send? • Evaluation of message rate ‘per millisecond’ unit and order mix balance: Message rate per millisecond: • Internal monitoring stats arbitration: - Matching Engine’s NEW_ORDERS, CANCELS, AMENDS, etc – rates per second and total amount of transactions MatchingEngine | NEW | Total=11896058 (2608833,3126952,3532034,2628239), Current=430 (85,103,141,101), Peak=2728 (721,661,746,600) MatchingEngine | AMEND | Total=45509 (9493,12145,13535,10336), Current=1 (0,0,1,0), Peak=11 (5,5,6,5) MatchingEngine | CANCEL | Total=9350063 (1957683,2492535,2784674,2115171), Current=357 (72,83,115,87), Peak=2086 (400,565,627,494) Number of msgs per millisecond % Samples Inbound (into System) Outbound (from System) <5 55.64% 55.01% 5-8 3.67% 4.05% 8-10 2.60% 2.77% 10-15 5.32% 5.39% 15-20 5.88% 5.95% 20-80 26.85% 26.78% >80 0.05% 0.05% Partition 1 Message Type ME cores Total 0 1 2 3 Order 3.74% 3.02% 2.00% 4.14% 12.89% Cancel 3.56% 2.89% 1.93% 4.02% 12.39% Amend 0.60% 0.53% 0.34% 0.68% 2.16% Quote 0.32% 0.11% 0.16% 0.27% 0.85% Trades 0.24% 0.18% 0.13% 0.29% 0.84%
  9. 16 Build Software to Test Software exactpro.com Latency end-to-end %

    avg max 100 82 518 99.99 82 408 99.9 82 139 99 80 103 Latency percentiles:
  10. 17 Build Software to Test Software exactpro.com Daily life cycle

    • DLC test The test executed in conjunction with Functional test team. – Pass system through Production like schedule: • All trading cycles • All scheduled sessions – Apply appropriate load during various phases – Perform some functional tests under load – Data consistency check • reconcile output from various sources • check data for integrity
  11. 18 Build Software to Test Software exactpro.com Other Non-Functional tests

    • Rapid user actions tests (connect-disconnect, logon-logout) – System should sustain against such user behavior – HW resources consumption should not grow up • Slow consumer tests – System should handle such users and should has a protection against them – HW resources consumption should not grow up • Intensive usage of recovery channels – System should be able to handle high number of requests on recovery channels and should be able satisfy them • Massive actions from Market Operations (mass order cancels, mass trade cancels, mass instrument halts) – System should handle Market operations’ actions like mass cancel of 10k active orders or trades. • Resilience tests
  12. Build Software to Test Software exactpro.com Defects mining in Exchange

    trading systems 08/11/2018 Pavel Medvedev, Stanislav Klimakov, Mikhail Yamkovy
  13. 21 Build Software to Test Software exactpro.com Financial infrastructures •

    Exchanges • Broker systems • Clearing agencies • Ticker plants • Surveillance systems Risks associated with financial infrastructure outage: • Lost profit • Data loss • Damaged reputation
  14. 22 Build Software to Test Software exactpro.com Distributed high-performance computing

    • Bare-metal servers (no virtualization) • Horizontal scalability • Redundancy (absence of single point of failure)
  15. 23 Build Software to Test Software exactpro.com Resilience tests •

    Hardware outages ◦ Network equipment failovers (Switches, Ports, Network adapters) ◦ Server isolations • Software outages ◦ Simulation of various outage types (SIGKILL, SIGSTOP) ◦ Failovers during different system state (at startup / trading day / during auction)
  16. 24 Build Software to Test Software exactpro.com • Failover –

    failure of active primary instance (standby becomes active) • Failback – failure of active standby instance • Standby failure – failure of passive standby instance • Double failure – simultaneous failure of both instances What cases to test?
  17. 25 Build Software to Test Software exactpro.com • Test-manager with

    DSL scenario language • System monitoring tools • Load injection tool • Traffic capturing and parsing tools • Tools for data storage, visualisation and analysis What tools we use to do resilience testing?
  18. 26 Build Software to Test Software exactpro.com What kind of

    data is useful to analyse test results? • System metrics of all servers and all components (processes) • Captured traffic of injected load and system responses • Log files of the system
  19. 28 Build Software to Test Software exactpro.com Defects mining in

    collected data • Log entries per second • Warnings per second • Errors per second • Transaction statistics • Response time (latency) • Throughput • Disk usage • RAM usage • CPU usage • Network stats System statistics Captured traffic Log files
  20. 29 Build Software to Test Software exactpro.com Avoiding «dark data»

    Symptoms of «dark data» disease: • Collecting data «just in case» without knowing the actual purpose of it • Storing excessive amount of history data (in non-aggregated form) from previous test runs
  21. 30 Build Software to Test Software exactpro.com Overnight low touch

    testing • Testing is performed without human participation • Human friendly reports • Data is our main value. Non-aggregated data is stored until report is seen by QA engineer (in case if more detailed investigation is needed afterwards) Test execution Real-time data collection and processing Performed by machine Performed by human Prepare environment and test tools Final report evaluation Performed by human
  22. 31 Build Software to Test Software exactpro.com Rules and thresholds

    ALERT: METRIC : RSS GROWTH : 1GB TIME : 10 MIN ALERT: METRIC : DISK GROWTH : 10% TIME : 1 HOUR Server: MP101 Process: MatchingEngine Primary Metric: RSS (resident set size)
  23. 32 Build Software to Test Software exactpro.com Spikes and stairs

    detection Server: OE102 Process: FixGateway Standby Metric: RSS (resident set size)
  24. 33 Build Software to Test Software exactpro.com Spikes and stairs

    detection Example: • CPU usage spike happened on TransactionRouter component at ~11:49 • Most likely last scenario step done prior to 11:49 caused that spike • Information about this abnormality and steps that produced it will be populated in final report Server: CA104 Process: TransactionRouter Primary Metric: CPU usage
  25. 34 Build Software to Test Software exactpro.com Data reconciliation checks

    • Consistency across different data streams ◦ Client’s messages ◦ Public market data ◦ Aggregated market data • Consistency between data streams and system’s database
  26. 35 Build Software to Test Software exactpro.com DSL scenario example

    start load 3000 # Case 1: Failover of MatchingEngine Primary kill -9 primary MatchingEngine smoke start primary MatchingEngine # Case 2: Failback of MatchingEngine Standby kill -9 standby MatchingEngine smoke start standby MatchingEngine stop load 1 2 3 4 5 6 7 8 9 10 11 12 13 14
  27. 37 Build Software to Test Software exactpro.com What do we

    get? • Test harness needs constant support • Higher tester qualification for improving automated scenarios • Validators may pass an issue that a tester could have noticed in real time • Need of regular review of test cases and methods of data analysis (to prevent pesticide paradox) • Better test coverage in comparison with manual execution • Test environments used 24/7 (an idle system does not help to find issues) • Efforts put into test coverage and tools improvement, but not test execution Pros: Cons:
  28. Build Software to Test Software exactpro.com Defects mining in Exchange

    trading systems 08/11/2018 Pavel Medvedev, Stanislav Klimakov, Mikhail Yamkovy
  29. 39 Build Software to Test Software exactpro.com Introduction • Challenges

    of proprietary software testing in the client’s environment • Monitoring tools deployment • Data collection challenges • Data storage and analysis challenges
  30. 40 Build Software to Test Software exactpro.com Production and production

    like • Legacy: stable, trusted, suitable to work with a particular system • No ability to make changes in runtime • No Docker, AppImage and other handy tools • Portable tools are everything
  31. 41 Build Software to Test Software exactpro.com ?? ?? ??

    Proprietary software in the client’s environment • Not a complete specification • Unknown data exchange and storage formats • Access and other restrictions Gateway Sequencer Matching MarketData Test DB Test DB in? out? in? out? in? out? in? out? FIX ITCH Internal system messages
  32. 42 Build Software to Test Software exactpro.com What kinds of

    data do we need and when? • Pre-SOD: system snapshots and backups • Real-time: system metrics for active testing • Post-EOD: log data for passive testing and results analysis
  33. 43 Build Software to Test Software exactpro.com How to collect

    data in real-time? • Use of available system tools • Use of monitoring provided by a proprietary software vendor • Use of third party monitoring tools
  34. 44 Build Software to Test Software exactpro.com How about reinventing

    the wheel? • Independent • Incorporate all the features we need in one tool • Remote controlled • Support of different output formats: protobuf, json, raw binary data • Support of multiple data consumers with different visibility • Deliver data on need to know basis only • Uniform data format across all environments • Low footprint
  35. 45 Build Software to Test Software exactpro.com Downsides of the

    brand new bicycle • Green code: not well tested in the field • Requires additional resources for support • Solves only a particular problem
  36. 46 Build Software to Test Software exactpro.com Who should receive

    real-time data? • Different tests require dozens of different metrics • A tester is not able to track all the changes • All the data should be analyzed on the fly • Test behaviour should be changed depending on the received data
  37. 47 Build Software to Test Software exactpro.com High level view

    on real-time monitoring ... Management Server QA Server Server 1 Server 2 Server N Router Daemon_M Daemon_S1 Daemon_S2 Daemon_SN Daemon_I TM Daemon_S Collecting system info, logs parsing, commands execution Collecting system info, logs parsing, commands execution Load control and test scripts execution Communication between daemons and controllers TestManager: Automated execution of test scenarios, collecting and processing test information Daemon_M Daemon_I Router TM Data Processor Data Processor Transform, collect and store data for future use Data visualisation and reporting Data storage and analysis Data analysis and management
  38. 48 Build Software to Test Software exactpro.com Passive monitoring Management

    Server TestManager Data Processor Router Matching Server Daemon MEP MatchingEnginePrimary Matching log Monitoring Server Daemon MON System events log System metrics log MatchingEnginePrimary {PID: 1234, RSS: 500MB, CP Usage: 15%} System MatchingEnginePrimary {STATE: READY} MatchingEnginePrimary {INTERNAL LATENCY: 10} System {CPU Usage: 15%, Free Mem: 50%, Free Disk Space: 80%} The MON Daemon collects system metrics and messages The MEP Daemon parses matching log and provides router with actual system info
  39. 49 Build Software to Test Software exactpro.com Active monitoring Management

    Server TestManager Data Processor Router Matching Server Daemon MEP MatchingEnginePrimary Matching log Monitoring Server Daemon MON System events log System metrics log System System {CPU Usage: 1%, Free Mem: 75%, Free Disk Space: 83%} Stop matching log monitor When realtime data is not required user or an automated scenario can stop or update a task for an active monitor to reduce system load. RPC call
  40. 50 Build Software to Test Software exactpro.com Post-EOD data •

    Checkpoints from the TestManager tool • System and hardware usage stats • Essential internal metrics from the system under test
  41. 51 Build Software to Test Software exactpro.com What’s wrong with

    system logs? Bias: logs should be human friendly ... ~|============================================================================= ~|Disk I/O statistics ~|============================================================================= ~|Device Reads/sec Writes/sec AvgQSize AvgWait AvgSrvceTime ~|sda 0.0 ( 0.0kB) 4.1 ( 22.4kB) 0.0 0.0ms 0.0ms ~|sdb 0.0 ( 0.0kB) 0.0 ( 0.0kB) 0.0 0.0ms 0.0ms ~|sdc 0.0 ( 0.0kB) 10.7 ( 70.5kB) 0.0 0.0ms 0.0ms 20181030074410.191|504|TEXT |System Memory Information (from /proc/meminfo) ~|============================================================================= ~|MemTotal: 263868528 kB ~|MemFree: 252390192 kB ...
  42. 52 Build Software to Test Software exactpro.com What’s wrong with

    system logs? Not standardized Release 1: Release 2: Oct 30 2017 13:30:28 | SystemComponent:1 | Transfer Queue| Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Max Queues=(Pub=0, Pvt=0), Dec 12 2017 08:10:13 | SystemComponent:1 | Transfer Queue from Rcv Thread to Main Thread | Rate=0.00 | W=0.00 | L=0.00 | Q=0.00 | T=0.00 Dec 12 2017 08:10:13 | SystemComponent:1 | Max Queues from Rcv Thread to Main Thread | Pub=0, Pvt=0
  43. 53 Build Software to Test Software exactpro.com How to deal

    with creative loggers? • Accept the reality • No one will change log format just for you • No one will ask you prior to log format change • Regexpish patterns are our “best friends” • Automatic log formats analysis UNKNOWN METRIC DETECTED: [SystemComponent:1]: A To B | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max Queues=(Pub=0, Pvt=0) KNOWN METRICS: [SystemComponent:1]: AToB | Rate=0.00 [W=0.00,L=0.00, Q=0.00, T=0.00], Mode=LOW_LATENCY, Max Queues=[Pub=0, Pvt=0] [SystemComponent:1]: ABToWorker | Rate=0.00 (W=0.00,L=0.00, Q=0.00, T=0.00), Mode=LOW_LATENCY, Max Queues=(Pub=0, Pvt=0)
  44. 54 Build Software to Test Software exactpro.com Where to store

    and how long? • Data is sensitive and should be stored on the client’s side • Data volume is huge for limited hardware resources in the test environment • Data retention • HW stats • System merics • System configs • Traffic • Anonymous production data • System configs • Aggregated test reports Current data Historical data 2 weeks
  45. 55 Build Software to Test Software exactpro.com How to use?

    • Reporting • Analysis • Tests improvement
  46. 59 Build Software to Test Software exactpro.com Tests improvement •

    Comparison of test conditions • Comparison of test results • Inspect historical data to introduce more realistic scenarios