Upgrade to Pro — share decks privately, control downloads, hide ads and more …

EXTENT-2017: Climbing Out of the Stability Sinkhole - Survivor’s Guide

EXTENT-2017: Climbing Out of the Stability Sinkhole - Survivor’s Guide

EXTENT-2017: Software Testing & Trading Technology Trends Conference
29 June, 2017, 10 Paternoster Square, London

Climbing Out of the Stability Sinkhole - Survivor’s Guide
Sergei Poliakoff, CIO, Moscow Exchange

Would like to know more?
Visit our website: extentconf.com
Follow us:
https://www.linkedin.com/company/exactpro-systems-llc?trk=biz-companies-cym
https://twitter.com/exactpro
#extentconf
#exactpro

Exactpro

June 30, 2017
Tweet

More Decks by Exactpro

Other Decks in Technology

Transcript

  1. CLIMBING OUT OF THE STABILITY
    SINKHOLE

    View full-size slide

  2. BASIC REAL-TIME POST-TRADE
    REAL-TIME ORDER
    PRE-CLEARING
    TRADES
    END-OF-DAY, INTRA-DAY
    MARGIN CALL
    END-OF-DAY
    OR MINI BATCH
    ORDERS
    ~400
    µs
    RISK AND MARGIN
    CALCULATED
    IN BATCHES
    RISK AND MARGIN
    CALCULATED
    POST-TRADE IN REAL-TIME
    PROJECTED RISK
    CALCULATED
    BEFORE ACCEPTING THE ORDER
    IF PROJECTED MARGIN
    REQUIREMENT BREACHES
    AVAILABLE MARGIN,
    ORDER IS NOT ADMITTED TO
    MATCHBOOK
    ORDERS
    TRADES
    ORDERS ORDERS
    REAL-TIME POST-TRADE
    RISK CHECK
    AND
    MARGIN CALL
    TRADES
    ORDERS ORDERS
    ~12000 orders/second
    HYPOTHETICAL TRADE
    RISK CALCULATION
    ~400
    µs
    ACCOUNT
    SHUTOFF
    UNIQUE REAL-TIME RISK MANAGEMENT = STABILITY CHALLENGE

    View full-size slide

  3. INCIDENT DATE SOLUTION
    HARDWARE FAULT
    DISRUPTING
    THE BACKUP SCHEME
    August, 12
    September, 1
    September, 8
     Hardware replacement and upgrade (< 3 years)
     Migration to «flat» network topology
     Network segregation
     Human resource development in operation and maintenance department
     New Tier III data center
    CLEARING MODULE
    FAULTS
    January, 12
    March, 5
    June, 15
     Segregation of Trading and Clearing modules
     Emergency limit check scheme
     Orders risk check model update
     Development process improvement
     Software Development Life Cycle practices implementation
     Introduction of “destructive testing”
     Testing cycle extension
    TRADE ENGINE FAULTS September, 21
     Trade engine cloning (as a part of trading and clearing modules
    segregation programme)
     Common development process improvement
    THE INGLORIOUS 2015
    JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
    CM CM CM H H
    H
    TE
    HARDWARE FAULTS
    H
    CLEARING MODULE
    FAULTS
    CM TRADE ENGINE FAULTS
    TE
    CRITICAL FAULTS IN 2015 TIMESCALE

    View full-size slide

  4. 4
    OLD SOFTWARE DEV PRACTICES COULDN’T COPE WITH COMPLEXITY
    => NEW DEV PROCESS
    Unit-like Testing Practice
    Auto tests Coverage Extending
    Changes reference to projects, tasks and issues
    Regular Code Review Practices
    Static and Dynamic Code Analysis
    Continuous Integration (CI)
    Auto deployment
    Software
    Development
    Life Cycle
    (SDLC)
    SOFTWARE DEVELOPMENT LIFE CYCLE
    QUALITY ASSURANCE PRACTICES
    2017 – NEW & IMPROVED PRACTICES
    2014
    PREVIOUS IMPROVED NEW
    Unit Tests
    UAT (external)
    Acceptance testing
    Integrational testing
    (cross-system)
    Simulation testing
    Destructive testing
    Manual functional
    testing
    Manual regression
    testing
    Automatic functional
    testing
    Automatic regression
    testing
    DEVELOPMENT
    Testing metrics
    Static code review
    TESTING
    IMPLEMENTATION Testing metrics
    TOOLS
    Auto deployment
    Clarive + Ansible
    Coverity Static code analysis
    GitLab, Crucible Code review
    Continuous integration
    Jenkins
    Serena, Jira Bug Tracking, Task Tracking
    HP ALM Testing lifecycle management
    Valgrind Dynamic code analysis (Spectra)
    Regression testing
    PyTests
    Undefined Sanitizer
    Address Sanitizer
    Dynamic code analysis (ASTS)
    Doxygen Auto documentation
    FUNCTIONS
    AUTODEPLOYMENT COVERAGE: 100% for real-time systems
    OTHER TOOLS COVERAGE: ~80% of current releases

    View full-size slide

  5. AFTER THE ACTIVE PHASE OF DEVELOPMENT, DEVELOPERS
     PARTICIPATE IN BUG CORRECTION
     START WORKING ON THE TASK OF THE NEXT RELEASE
     IMPROVE METHODS AND TOOLS FOR TESTING
     WORK ON APPROVED NON-RELEASE TASKS
     WORK ON OPTIMIZATION AND TECHNOLOGICAL DEVELOPMENT
    RELEASE CYCLE
    FUNCTIONAL TASK APPROVAL
    5 weeks
    DEVELOPMENT
    13 weeks
    TESTING
    14 weeks
    PREPARING
    5 weeks
    DEVELOPMENT
    13 weeks
    TESTING
    14 weeks
    5
    RELEASE
    1
    RELEASE
    2
    3 WEEKS 2 WEEKS 13 WEEKS 12 WEEKS
    APPROVAL OF THE
    RELEASE COMPOSITION,
    DEPLOYMENT PLAN
    AND ACCEPTANCE TESTING
    WORKING GROUP
    WEEK OF
    SILENCE
    FUNCTIONAL TASK
    APPROVAL
    TERMS OF
    REFERENCE AND
    DESIGN
    DEVELOPMENT
    BUGFIXING
    REFINED
    FUNCTIONAL
    REQUIREMENTS
    START OF
    DEVELOPMENT
    OF RELEASE
    TASKS
    END OF
    DEVELOPMENT OF
    RELEASE TASKS
    RELEASE
    TESTING
    RELEASE TASKS DEVELOPMENT AND TESTING
    NEW DEVELOPMENT
    PRACTICES
    NEW TESTING
    PRACTICES
    CAN’T SPEND 100% TIME TESTING, NEED TO DELIVER!

    View full-size slide

  6. 2014 - 2015
    2013 2016-2017
    HIERARCHICAL NETWORK
    SPANNING TREE
    DEVELOPMENT
    TESTING
    GAME STAND
    OPERATIONAL SYSTEMS
    OPTIMIZED FLAT
    NETWORK
    SPINE&LEAF
    SEGMENTED NETWORK
    SPINE&LEAF
    OPERATIONAL
    SYSTEM
    DEVELOPMENT
    TESTING
    GAME SEGMENT
    OPERATIONAL SYSTEMS
    TEST & GAME
    SYSTEMS
    OFFICE DEVELOPMENT
    MAIN LOAD ON THE
    ROOT DEVICES
    NETWORK STORMS
    NON-PERSISTENT
    INCREASED
    RESISTANCE TO
    NETWORK STORMS
    CONTAINMENT OF NETWORK
    DAMAGE IN ONE SEGMENT
    SERVERS
    NETWORK
    EXTERNAL
    ACCESS
    EXTERNAL
    ACCESS
    EXTERNAL
    ACCESS
    EXTERNAL
    ACCESS
    EXTERNAL
    ACCESS
    Spine & Leaf «flat» network topology implementation significantly decrease expectation of repeating serious consequences in case
    of network storm
    6
    THE GREAT NETWORK MELTDOWN OF AUGUST, AND WHAT WE DID ABOUT IT

    View full-size slide

  7. SYSTEM CRITICALITY CLASS EQUIPMENT UPDATE PERIOD
    TRADING SYSTEM ENGINE (REAL-TIME) 1A 3 YEARS
    MAIN PRODUCTION SYSTEMS 2A, 3А 4 YEARS
    RESPONSIBLE SYSTEMS B 4 YEARS
    NON-CRITICAL SYSTEMS C 5 YEARS
    NETWORK --- 5 YEARS
    TECHNICAL POLICY REQUIREMENTS (INTRODUCED IN 2014)
    58
    30
    22 19
    29
    26
    16 33
    24
    13 66
    16
    54
    44
    57 58
    34
    0
    10
    20
    30
    40
    50
    60
    70
    80
    90
    100
    1A 2A 3A B C Network
    Percentage of hardware composition (%)
    CRITICALITY CLASS
    2015 STATE
    Less than 3 years
    2013 - 2015
    3 - 5 years
    2011 - 2012
    Older than 5 years
    2003 - 2010
    100
    88
    70 76 83
    12
    30 24 17
    96
    4
    0
    10
    20
    30
    40
    50
    60
    70
    80
    90
    100
    1A 2A 3A B C Network
    Percentage of hardware composition (%)
    CRITICALITY CLASS
    TARGET STATE
    Less than 3 years
    2013 - 2015
    3 - 5 years
    2011 - 2012
    Older than 5 years
    2003 - 2010
    7
    HARDWARE PARK BY AGE
    HARDWARE: NEWER IS BETTER

    View full-size slide

  8. 8
     Tier-3 certified data center delivers 99,98% availability
     Compliance with safety requirements of Payment Card
    Industry Data Security Standard (PCI DSS) v.3 to ensure the
    security of customer information
     High level of safety and resistance to adverse external
    influences
    RELIABILITY
    SECURITY
     Stand-alone building
     Security service
     Administered surrounding territory, guarded area
     Access control system
     CCTV monitoring
    FURTHER DEVELOPMENT CAPACITY
    NEW DATA CENTER

    View full-size slide

  9. ICING ON A RELIABILITY CAKE: BETTER PERFORMANCE
    Results below from annual joint Exchange/brokers stress tests of core infrastructure , fall 2016
    LOAD TESTING RESULTS
    ASTS+ SPECTRA
    Average response time: 230 µs
    90% responses < 270 µs
    99% responses (transaction frequency less than 50 per
    second)
    < 400 µs
    99% responses (transaction frequency higher than 500 per
    second)
    1500 µs
    99.9% responses (typical real market frequency of
    transactions)
    < 600 µs
    Average response time:1 250 µs
    Under load up to 50 000 Tr/sec < 250 µs
    99% responses < 1000 µs
    1 For forecasted within the next year peak frequencies of 20
    000 - 30 000 transactions per second
    TWIME AND CGATE COMPARISON
    transactions Clients’ transaction
    TIME

    View full-size slide

  10. 99.98 99.98 99.98
    99.95
    99.98
    99.98
    99.91
    99.92
    99.93
    99.94
    99.95
    99.96
    99.97
    99.98
    99.99
    2012 2013 2014 2015 2016 Q1-2 2017
    2016 – RETURN TO STABILITY, HOPEFULLY LASTING
    Real-time systems availability, %
    Target value Average value, 2012-2017

    View full-size slide