Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EXTENT-2017: Climbing Out of the Stability Sinkhole - Survivor’s Guide

EXTENT-2017: Climbing Out of the Stability Sinkhole - Survivor’s Guide

EXTENT-2017: Software Testing & Trading Technology Trends Conference
29 June, 2017, 10 Paternoster Square, London

Climbing Out of the Stability Sinkhole - Survivor’s Guide
Sergei Poliakoff, CIO, Moscow Exchange

Would like to know more?
Visit our website: extentconf.com
Follow us:
https://www.linkedin.com/company/exactpro-systems-llc?trk=biz-companies-cym
https://twitter.com/exactpro
#extentconf
#exactpro

Exactpro

June 30, 2017
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. BASIC REAL-TIME POST-TRADE REAL-TIME ORDER PRE-CLEARING TRADES END-OF-DAY, INTRA-DAY MARGIN

    CALL END-OF-DAY OR MINI BATCH ORDERS ~400 µs RISK AND MARGIN CALCULATED IN BATCHES RISK AND MARGIN CALCULATED POST-TRADE IN REAL-TIME PROJECTED RISK CALCULATED BEFORE ACCEPTING THE ORDER IF PROJECTED MARGIN REQUIREMENT BREACHES AVAILABLE MARGIN, ORDER IS NOT ADMITTED TO MATCHBOOK ORDERS TRADES ORDERS ORDERS REAL-TIME POST-TRADE RISK CHECK AND MARGIN CALL TRADES ORDERS ORDERS ~12000 orders/second HYPOTHETICAL TRADE RISK CALCULATION ~400 µs ACCOUNT SHUTOFF UNIQUE REAL-TIME RISK MANAGEMENT = STABILITY CHALLENGE
  2. INCIDENT DATE SOLUTION HARDWARE FAULT DISRUPTING THE BACKUP SCHEME August,

    12 September, 1 September, 8  Hardware replacement and upgrade (< 3 years)  Migration to «flat» network topology  Network segregation  Human resource development in operation and maintenance department  New Tier III data center CLEARING MODULE FAULTS January, 12 March, 5 June, 15  Segregation of Trading and Clearing modules  Emergency limit check scheme  Orders risk check model update  Development process improvement  Software Development Life Cycle practices implementation  Introduction of “destructive testing”  Testing cycle extension TRADE ENGINE FAULTS September, 21  Trade engine cloning (as a part of trading and clearing modules segregation programme)  Common development process improvement THE INGLORIOUS 2015 JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC CM CM CM H H H TE HARDWARE FAULTS H CLEARING MODULE FAULTS CM TRADE ENGINE FAULTS TE CRITICAL FAULTS IN 2015 TIMESCALE
  3. 4 OLD SOFTWARE DEV PRACTICES COULDN’T COPE WITH COMPLEXITY =>

    NEW DEV PROCESS Unit-like Testing Practice Auto tests Coverage Extending Changes reference to projects, tasks and issues Regular Code Review Practices Static and Dynamic Code Analysis Continuous Integration (CI) Auto deployment Software Development Life Cycle (SDLC) SOFTWARE DEVELOPMENT LIFE CYCLE QUALITY ASSURANCE PRACTICES 2017 – NEW & IMPROVED PRACTICES 2014 PREVIOUS IMPROVED NEW Unit Tests UAT (external) Acceptance testing Integrational testing (cross-system) Simulation testing Destructive testing Manual functional testing Manual regression testing Automatic functional testing Automatic regression testing DEVELOPMENT Testing metrics Static code review TESTING IMPLEMENTATION Testing metrics TOOLS Auto deployment Clarive + Ansible Coverity Static code analysis GitLab, Crucible Code review Continuous integration Jenkins Serena, Jira Bug Tracking, Task Tracking HP ALM Testing lifecycle management Valgrind Dynamic code analysis (Spectra) Regression testing PyTests Undefined Sanitizer Address Sanitizer Dynamic code analysis (ASTS) Doxygen Auto documentation FUNCTIONS AUTODEPLOYMENT COVERAGE: 100% for real-time systems OTHER TOOLS COVERAGE: ~80% of current releases
  4. AFTER THE ACTIVE PHASE OF DEVELOPMENT, DEVELOPERS  PARTICIPATE IN

    BUG CORRECTION  START WORKING ON THE TASK OF THE NEXT RELEASE  IMPROVE METHODS AND TOOLS FOR TESTING  WORK ON APPROVED NON-RELEASE TASKS  WORK ON OPTIMIZATION AND TECHNOLOGICAL DEVELOPMENT RELEASE CYCLE FUNCTIONAL TASK APPROVAL 5 weeks DEVELOPMENT 13 weeks TESTING 14 weeks PREPARING 5 weeks DEVELOPMENT 13 weeks TESTING 14 weeks 5 RELEASE 1 RELEASE 2 3 WEEKS 2 WEEKS 13 WEEKS 12 WEEKS APPROVAL OF THE RELEASE COMPOSITION, DEPLOYMENT PLAN AND ACCEPTANCE TESTING WORKING GROUP WEEK OF SILENCE FUNCTIONAL TASK APPROVAL TERMS OF REFERENCE AND DESIGN DEVELOPMENT BUGFIXING REFINED FUNCTIONAL REQUIREMENTS START OF DEVELOPMENT OF RELEASE TASKS END OF DEVELOPMENT OF RELEASE TASKS RELEASE TESTING RELEASE TASKS DEVELOPMENT AND TESTING NEW DEVELOPMENT PRACTICES NEW TESTING PRACTICES CAN’T SPEND 100% TIME TESTING, NEED TO DELIVER!
  5. 2014 - 2015 2013 2016-2017 HIERARCHICAL NETWORK SPANNING TREE DEVELOPMENT

    TESTING GAME STAND OPERATIONAL SYSTEMS OPTIMIZED FLAT NETWORK SPINE&LEAF SEGMENTED NETWORK SPINE&LEAF OPERATIONAL SYSTEM DEVELOPMENT TESTING GAME SEGMENT OPERATIONAL SYSTEMS TEST & GAME SYSTEMS OFFICE DEVELOPMENT MAIN LOAD ON THE ROOT DEVICES NETWORK STORMS NON-PERSISTENT INCREASED RESISTANCE TO NETWORK STORMS CONTAINMENT OF NETWORK DAMAGE IN ONE SEGMENT SERVERS NETWORK EXTERNAL ACCESS EXTERNAL ACCESS EXTERNAL ACCESS EXTERNAL ACCESS EXTERNAL ACCESS Spine & Leaf «flat» network topology implementation significantly decrease expectation of repeating serious consequences in case of network storm 6 THE GREAT NETWORK MELTDOWN OF AUGUST, AND WHAT WE DID ABOUT IT
  6. SYSTEM CRITICALITY CLASS EQUIPMENT UPDATE PERIOD TRADING SYSTEM ENGINE (REAL-TIME)

    1A 3 YEARS MAIN PRODUCTION SYSTEMS 2A, 3А 4 YEARS RESPONSIBLE SYSTEMS B 4 YEARS NON-CRITICAL SYSTEMS C 5 YEARS NETWORK --- 5 YEARS TECHNICAL POLICY REQUIREMENTS (INTRODUCED IN 2014) 58 30 22 19 29 26 16 33 24 13 66 16 54 44 57 58 34 0 10 20 30 40 50 60 70 80 90 100 1A 2A 3A B C Network Percentage of hardware composition (%) CRITICALITY CLASS 2015 STATE Less than 3 years 2013 - 2015 3 - 5 years 2011 - 2012 Older than 5 years 2003 - 2010 100 88 70 76 83 12 30 24 17 96 4 0 10 20 30 40 50 60 70 80 90 100 1A 2A 3A B C Network Percentage of hardware composition (%) CRITICALITY CLASS TARGET STATE Less than 3 years 2013 - 2015 3 - 5 years 2011 - 2012 Older than 5 years 2003 - 2010 7 HARDWARE PARK BY AGE HARDWARE: NEWER IS BETTER
  7. 8  Tier-3 certified data center delivers 99,98% availability 

    Compliance with safety requirements of Payment Card Industry Data Security Standard (PCI DSS) v.3 to ensure the security of customer information  High level of safety and resistance to adverse external influences RELIABILITY SECURITY  Stand-alone building  Security service  Administered surrounding territory, guarded area  Access control system  CCTV monitoring FURTHER DEVELOPMENT CAPACITY NEW DATA CENTER
  8. ICING ON A RELIABILITY CAKE: BETTER PERFORMANCE Results below from

    annual joint Exchange/brokers stress tests of core infrastructure , fall 2016 LOAD TESTING RESULTS ASTS+ SPECTRA Average response time: 230 µs 90% responses < 270 µs 99% responses (transaction frequency less than 50 per second) < 400 µs 99% responses (transaction frequency higher than 500 per second) 1500 µs 99.9% responses (typical real market frequency of transactions) < 600 µs Average response time:1 250 µs Under load up to 50 000 Tr/sec < 250 µs 99% responses < 1000 µs 1 For forecasted within the next year peak frequencies of 20 000 - 30 000 transactions per second TWIME AND CGATE COMPARISON transactions Clients’ transaction TIME
  9. 99.98 99.98 99.98 99.95 99.98 99.98 99.91 99.92 99.93 99.94

    99.95 99.96 99.97 99.98 99.99 2012 2013 2014 2015 2016 Q1-2 2017 2016 – RETURN TO STABILITY, HOPEFULLY LASTING Real-time systems availability, % Target value Average value, 2012-2017