Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Refactoring database storage @ Kiwi.com

Refactoring database storage @ Kiwi.com

Petr Joachim (Kiwi.com) at Kiwi.com Brno-Moscow Python Meetup

Video: http://www.moscowpython.ru/meetup/1/refactoring-database-storage-at-kiwi-dot-com/

Avatar for Moscow Python Meetup

Moscow Python Meetup

March 23, 2017
Tweet

More Decks by Moscow Python Meetup

Other Decks in Programming

Transcript

  1. What do we do at Kiwi.com? We search flights and

    sell flight tickets We need to own all the data all the time We are doing our own combinations
  2. Problems with flights The ultimate problem with flight data is

    that they are changing - like a lot We are processing around 40.000 flight entries per second
  3. Problems with flights We update every row in our dataset

    at least once per 48 hours (some of that more often) In our database there are 1.2 billion of flights (we update around 3.5 billion each day) - it’s only for 6 months to the future This is around 2TB of data stored
  4. Problems with flights Flights data are quite complicated structure. It

    contains things like: one way, return, direct and with stops flights (we call them segments) weird pricing policies, fare classes, special offers, child and infant prices, point of sales, different providers, …
  5. Problems with flights There is no global identifier of a

    flight! A flight is identified by all its segments segment = (flight_number, source_airport, destination_airport, airline, date_of_departure)
  6. Previous solution At first there was a PostgreSQL database with

    all data in it And to fill that database there was a Python script and a redis queue for incoming data from data acquisition team
  7. Previous solution Then one single PostgreSQL server was not enough.

    So an engineer decided to make two databases and create sharding on application level
  8. Previous solution So she created thing called “router” which is

    aware of where flights should be. She used Redis to store her data there. It is routing based on combination of data - (source, destination, date_of_departure)
  9. Previous solution Later on, there started to be a problem

    with PostgreSQL write speed So the engineer wrote Redis layer for cache PostgreSQL data And wrote insanely complex update/ invalidate mechanism in order to be able to make all things in PostgreSQL consistent with as few writes possible
  10. Previous solution So we ended up with more than 50

    server with redis queue for incomming data redis to cache data postgres to store data crazy Python update script which runs that thing
  11. Previous solution And of course this system itself doesn’t handle

    any redundancy, balancing of data, durability, adding new hosts
  12. Objective: Make it simpler We need to have the architecture

    simpler but more reliable We’d like to make it more testable And less fragile
  13. Preparations We have prepared copy of incoming data queue (remember

    that 50+ Redis queues) into one central queue solution (we have tried several ones)
  14. Implementation We have prepared simplistic consumer script (in Python of

    course) In order to be able to test some databases under such load
  15. Cassandra We are now big enough to use storage like

    that It has all we need (hopefully), because with redis caching we already lost many features of relational database
  16. Dark sides of Cassandra Written in Java No classic transactions,

    there are only row-level (lightweight) transaction (UPSERT, INSERT IF, UPDATE IF, …) It has SQL like syntax, tables, rows and columns
  17. Dark sides of Cassandra Primary indexes only - could be

    composite ( (partition1, partition2), cluster1, cluster2, …) You need to specify whole partition key in WHERE clause and then you can add clustering keys one by one (in order), you can use ranges on last used clustering key There are secondary indexes, but you should not use them because of performance
  18. Bright sides of Cassandra It has nearly linear scalability It

    can process hundreds of thousands request per second (on 25 servers in 2 data centres it handles around 300.000 writes and 100.000 reads per second)
  19. Bright sides of Cassandra It is fast enough - we

    don’t need Redis cache layer It is a cluster - we don’t need any “router” to manually shard data in application It have redundancy and all other features
  20. Couple of stories We have tried Scylla at first place

    (it is a C++ alternative of Cassandra promising 10 times better performance) - but we don’t use it, it was not mature enough
  21. Couple of stories We have hired a consultant company from

    Serbia to help us set up the cluster and to review our code and data model
  22. Couple of stories There was a firewall rule and Cassandra

    was not able to get to local JMX service
  23. Refactoring clean up We have a tool which compares data

    loaded from older solution and from new solution We’ll have a python module in place for reading those data in all other parts of our system (right now under development)