Problems with flights The ultimate problem with flight data is that they are changing - like a lot We are processing around 40.000 flight entries per second
Problems with flights We update every row in our dataset at least once per 48 hours (some of that more often) In our database there are 1.2 billion of flights (we update around 3.5 billion each day) - it’s only for 6 months to the future This is around 2TB of data stored
Problems with flights Flights data are quite complicated structure. It contains things like: one way, return, direct and with stops flights (we call them segments) weird pricing policies, fare classes, special offers, child and infant prices, point of sales, different providers, …
Problems with flights There is no global identifier of a flight! A flight is identified by all its segments segment = (flight_number, source_airport, destination_airport, airline, date_of_departure)
Previous solution At first there was a PostgreSQL database with all data in it And to fill that database there was a Python script and a redis queue for incoming data from data acquisition team
Previous solution Then one single PostgreSQL server was not enough. So an engineer decided to make two databases and create sharding on application level
Previous solution So she created thing called “router” which is aware of where flights should be. She used Redis to store her data there. It is routing based on combination of data - (source, destination, date_of_departure)
Previous solution Later on, there started to be a problem with PostgreSQL write speed So the engineer wrote Redis layer for cache PostgreSQL data And wrote insanely complex update/ invalidate mechanism in order to be able to make all things in PostgreSQL consistent with as few writes possible
Previous solution So we ended up with more than 50 server with redis queue for incomming data redis to cache data postgres to store data crazy Python update script which runs that thing
Preparations We have prepared copy of incoming data queue (remember that 50+ Redis queues) into one central queue solution (we have tried several ones)
Cassandra We are now big enough to use storage like that It has all we need (hopefully), because with redis caching we already lost many features of relational database
Dark sides of Cassandra Written in Java No classic transactions, there are only row-level (lightweight) transaction (UPSERT, INSERT IF, UPDATE IF, …) It has SQL like syntax, tables, rows and columns
Dark sides of Cassandra Primary indexes only - could be composite ( (partition1, partition2), cluster1, cluster2, …) You need to specify whole partition key in WHERE clause and then you can add clustering keys one by one (in order), you can use ranges on last used clustering key There are secondary indexes, but you should not use them because of performance
Bright sides of Cassandra It has nearly linear scalability It can process hundreds of thousands request per second (on 25 servers in 2 data centres it handles around 300.000 writes and 100.000 reads per second)
Bright sides of Cassandra It is fast enough - we don’t need Redis cache layer It is a cluster - we don’t need any “router” to manually shard data in application It have redundancy and all other features
Couple of stories We have tried Scylla at first place (it is a C++ alternative of Cassandra promising 10 times better performance) - but we don’t use it, it was not mature enough
Refactoring clean up We have a tool which compares data loaded from older solution and from new solution We’ll have a python module in place for reading those data in all other parts of our system (right now under development)