Refactoring database storage @ Kiwi.com

Refactoring database storage @ Kiwi.com Petr Joachim Señor developer -
Search

What do we do at Kiwi.com? We search flights and
sell flight tickets We need to own all the data all the time We are doing our own combinations

Chapter 1: Problems with flights

Problems with flights The ultimate problem with flight data is
that they are changing - like a lot We are processing around 40.000 flight entries per second

Problems with flights We update every row in our dataset
at least once per 48 hours (some of that more often) In our database there are 1.2 billion of flights (we update around 3.5 billion each day) - it’s only for 6 months to the future This is around 2TB of data stored

Problems with flights Flights data are quite complicated structure. It
contains things like: one way, return, direct and with stops flights (we call them segments) weird pricing policies, fare classes, special offers, child and infant prices, point of sales, different providers, …

Problems with flights There is no global identifier of a
flight! A flight is identified by all its segments segment = (flight_number, source_airport, destination_airport, airline, date_of_departure)

Chapter 2: Previous solution

Previous solution At first there was a PostgreSQL database with
all data in it And to fill that database there was a Python script and a redis queue for incoming data from data acquisition team

Previous solution Then one single PostgreSQL server was not enough.
So an engineer decided to make two databases and create sharding on application level

Previous solution So she created thing called “router” which is
aware of where flights should be. She used Redis to store her data there. It is routing based on combination of data - (source, destination, date_of_departure)

Previous solution Later on, there started to be a problem
with PostgreSQL write speed So the engineer wrote Redis layer for cache PostgreSQL data And wrote insanely complex update/ invalidate mechanism in order to be able to make all things in PostgreSQL consistent with as few writes possible

Previous solution So we ended up with more than 50
server with redis queue for incomming data redis to cache data postgres to store data crazy Python update script which runs that thing

Previous solution And of course this system itself doesn’t handle
any redundancy, balancing of data, durability, adding new hosts

Chapter 3: Best Current solution

Objective: Make it simpler We need to have the architecture
simpler but more reliable We’d like to make it more testable And less fragile

Preparations We have prepared copy of incoming data queue (remember
that 50+ Redis queues) into one central queue solution (we have tried several ones)

Implementation We have prepared simplistic consumer script (in Python of
course) In order to be able to test some databases under such load

Cassandra

Cassandra We are now big enough to use storage like
that It has all we need (hopefully), because with redis caching we already lost many features of relational database

Dark sides of Cassandra Written in Java No classic transactions,
there are only row-level (lightweight) transaction (UPSERT, INSERT IF, UPDATE IF, …) It has SQL like syntax, tables, rows and columns

Dark sides of Cassandra Primary indexes only - could be
composite ( (partition1, partition2), cluster1, cluster2, …) You need to specify whole partition key in WHERE clause and then you can add clustering keys one by one (in order), you can use ranges on last used clustering key There are secondary indexes, but you should not use them because of performance

Bright sides of Cassandra It has nearly linear scalability It
can process hundreds of thousands request per second (on 25 servers in 2 data centres it handles around 300.000 writes and 100.000 reads per second)

Bright sides of Cassandra It is fast enough - we
don’t need Redis cache layer It is a cluster - we don’t need any “router” to manually shard data in application It have redundancy and all other features

Couple of stories We have tried Scylla at first place
(it is a C++ alternative of Cassandra promising 10 times better performance) - but we don’t use it, it was not mature enough

Couple of stories We have hired a consultant company from
Serbia to help us set up the cluster and to review our code and data model

Couple of stories There was a firewall rule and Cassandra
was not able to get to local JMX service

Refactoring clean up We have a tool which compares data
loaded from older solution and from new solution We’ll have a python module in place for reading those data in all other parts of our system (right now under development)

That’s it! Petr Joachim @ Kiwi.com

Refactoring database storage @ Kiwi.com

Refactoring database storage @ Kiwi.com

Moscow Python Meetup
PRO

More Decks by Moscow Python Meetup

Other Decks in Programming

Featured

Transcript