Upgrade to Pro — share decks privately, control downloads, hide ads and more …

10 insane things on Big Data

Luis Belloch
September 22, 2016

10 insane things on Big Data

Ten war stories while doing large-scale data management in Accudelta, and some approaches we took to survive the tsunami.

Luis Belloch

September 22, 2016
Tweet

More Decks by Luis Belloch

Other Decks in Programming

Transcript

  1. 10 insane
    things on
    big data
    LUIS BELLOCH 

    MONEYMATE / ACCUDELTA
    SEPT. 2016
    ETSINF UPV

    View Slide

  2. Since 1991, ~100 employees
    Offices in Dublin, Boston, London, New York,
    Stockholm, Milan and Valencia.
    Valencia is an engineering office only
    Black Rock, Fidelity, J.P. Morgan US, M&G, Prudential,
    Charles Schwab, Schroders, State Street, Columbia
    Threadneedle, Canada Life, IFDS, New Ireland, ...

    View Slide

  3. View Slide

  4. View Slide

  5. #1 Wild Data
    So… is that a bunch of Excel and
    CSV files randomly piled up?
    - Day 1, MoneyMate developer

    View Slide

  6. View Slide

  7. View Slide

  8. #2 Timing ⏰
    Data is inconsistent most of the time!

    View Slide

  9. #3 Schema Agnostic
    Every client has his own schema,

    loading system has to be fast.

    View Slide

  10. #3 Schema Agnostic
    • Reduced load time from 22 h to 9min
    • In-Memory and DB modes
    • Avoid write-locks as much as possible
    • Homeostasis: resilient/adaptive loading
    • Reactive async publishing
    LOADING
    PUBLISHING

    View Slide

  11. #4 Parallel Testing
    Replay one-month events in the system,

    … using two software versions, 

    … then compare row-by-row, cell-by-cell.

    View Slide

  12. #5 Schema Evolutions
    • ~50MB of SQL, several more CSVs
    • VCS and code review friendly
    • Test-data & container migrations
    • Forward-only, no rollbacks
    • Exercised many times per day through CI builds
    • etcd distributed locks, coordination

    View Slide

  13. #6a Market
    Right after the Brexit, one of our clients started to
    load data in a daily-basis, instead of monthly.

    View Slide

  14. #6b Government
    Solvency II regulation was delayed for +2 years

    View Slide

  15. #7 Latency, the hard way
    Minimum network latency between New York and Dublin

    Distance: 5111.28 km
    Best fiber refractive index: 1.5 (n = c / v)
    Max speed on that fiber: 199,861,639 m/s
    tfiber
    = 5111.28 / vmax
    = 25.57ms
    tmin
    = d / c = 17.04ms

    View Slide

  16. (http://www.nanex.net/aqck2/4680.html)

    View Slide

  17. #8 DIY Cluster
    Cloud? Over my dead body.
    - One of our lovely customers

    View Slide

  18. That moment when you realize undersea cable broke and cluster is down (2014)

    View Slide

  19. #9 Who needs a cluster?
    Most of the problems are small.

    Distributed systems are hard.

    View Slide

  20. #10 Small Data
    Big data is an excuse,

    a catalyst improving the tools we have today

    View Slide

  21. thanks!
    @luisbelloch

    View Slide