Upgrade to Pro — share decks privately, control downloads, hide ads and more …

10 insane things on Big Data

Luis Belloch
September 22, 2016

10 insane things on Big Data

Ten war stories while doing large-scale data management in Accudelta, and some approaches we took to survive the tsunami.

Luis Belloch

September 22, 2016

More Decks by Luis Belloch

Other Decks in Programming


  1. 10 insane things on big data LUIS BELLOCH 

  2. Since 1991, ~100 employees Offices in Dublin, Boston, London, New

    York, Stockholm, Milan and Valencia. Valencia is an engineering office only Black Rock, Fidelity, J.P. Morgan US, M&G, Prudential, Charles Schwab, Schroders, State Street, Columbia Threadneedle, Canada Life, IFDS, New Ireland, ...
  3. None
  4. None
  5. #1 Wild Data So… is that a bunch of Excel

    and CSV files randomly piled up? - Day 1, MoneyMate developer “
  6. None
  7. None
  8. #2 Timing ⏰ Data is inconsistent most of the time!

  9. #3 Schema Agnostic Every client has his own schema,

    system has to be fast.
  10. #3 Schema Agnostic • Reduced load time from 22 h

    to 9min • In-Memory and DB modes • Avoid write-locks as much as possible • Homeostasis: resilient/adaptive loading • Reactive async publishing LOADING PUBLISHING
  11. #4 Parallel Testing Replay one-month events in the system,

    using two software versions, 
 … then compare row-by-row, cell-by-cell.
  12. #5 Schema Evolutions • ~50MB of SQL, several more CSVs

    • VCS and code review friendly • Test-data & container migrations • Forward-only, no rollbacks • Exercised many times per day through CI builds • etcd distributed locks, coordination
  13. #6a Market Right after the Brexit, one of our clients

    started to load data in a daily-basis, instead of monthly.
  14. #6b Government Solvency II regulation was delayed for +2 years

  15. #7 Latency, the hard way Minimum network latency between New

    York and Dublin
 Distance: 5111.28 km Best fiber refractive index: 1.5 (n = c / v) Max speed on that fiber: 199,861,639 m/s tfiber = 5111.28 / vmax = 25.57ms tmin = d / c = 17.04ms
  16. (http://www.nanex.net/aqck2/4680.html)

  17. #8 DIY Cluster Cloud? Over my dead body. - One

    of our lovely customers “
  18. That moment when you realize undersea cable broke and cluster

    is down (2014)
  19. #9 Who needs a cluster? Most of the problems are

 Distributed systems are hard.
  20. #10 Small Data Big data is an excuse,
 a catalyst

    improving the tools we have today
  21. thanks! @luisbelloch