Event Sourcing, CQRS and scalability: a status update

Event sourcing, CQRS and scalability: a status update

About me • From Italy • MSc in Mathematics (scientific
/ parallel computing, combinatorial optimization) • Joined Studyflow ~3 years ago • [email protected]

We are building a secondary education platform for Dutch high
schools. • https://www.studyflow.nl • We provide 2 courses (Rekenen & Taal) • 250+ schools, 100k+ students

Our stack: • A Rails application used for publishing (but
we’re “getting rid of it”) • Clojure/ClojureScript stack (both SPA with reagent and request/response) • Use custom event sourcing library: rill • PostgreSQL for our EventStore • Analytics with ElasticSearch and Kibana

Our stack: • On bare metal (on Online.net) ◦ 1
Database ◦ 2 Application servers ◦ 1 Load balancer + Failovers

Our team: • 3 Developers, • 1 UX designer •
1 Product designer • You? We’re looking for a (Frontend) Developer! https://studyflow.homerun.co

Event Sourcing

We use domain events, i.e. we record things that happened
in our domain. Such events are: • meaningful within the domain • result of an explicit interaction • immutable Event sourcing Because of these properties, we consider them our only source of truth

What does an event look like? type QuestionAnsweredCorrectly timestamp $T
question-id $Z student-id $X answer $Y It means: “student $X answered $Y for the question $Z at time $T”.

Why event sourcing? • Interaction and intermediate states are very
important in our domain, maybe even more so than the final state (the journey vs the destination). Example: Recording info about questions answered incorrectly might be more important than just knowing that a student successfully completed a chapter. • Events are immutable, so our system is “append only”, making reasoning easier • Events as source of truth are very useful when investigating with Kibana: we can tell exactly what has happened

Clojure fits really well

• Which questions are the most difficult? • How different
are quick learners from slow ones? • What kind of mistake is the most common for a particular question? • Is reading an explanation (theory) after a mistake useful? But, more importantly • A lot of things that we don’t know yet! From our business perspective, events can help us answer interesting questions

event 1 event 2 event 3 event 4 event 5
event 6 ….. event 331.999.997 event 331.999.998 event 331.999.999 event 332.000.000 Our event-store • A big log where all events are together, one after the other • We use PostgreSQL and we make use of a few additional columns stream-id Id of the aggregate the event refers to stream-order Local order within a stream insert-order Global order within all events

Event Stream-id event 1 student1 event 2 student1 event 3
student2 event 4 …. event 331.999.998 student3 event 331.999.999 course material event 332.000.000 student1 Using the stream-id we can look up individual aggregates, such as: • a student practicing in a particular section • a student account information • course material published • a student doing an assessment

1. One after the other, to make a sort of
global view of everything 2. Selectively for one aggregate, to create a materialized view of it These events can then be “replayed” in 2 ways event 1 student 1 event 2 student 1 event 3 student 2 event 4 …. event 331.999.998 student 3 event 331.999.999 course material event 332.000.000 student 1 Why do we need these two ways?

CQRS: Command and Queries Responsibility Segregation Here we read events
from start to end Here we retrieve single aggregates

CQRS: Command and Queries Responsibility Segregation • Realizing that write
side (commands) and read sides have different needs • They can be scaled independently • Read side is asynchronously updated • Write side is synchronously updated when a command about that specific aggregate is fired

The read model. The read model is how our database
probably would look like if we did things traditionally. Just a big in-memory clojure map: • Memory is cheap (at least with bare metal) • Memory is fast We store in the read-model all the information that we need to display to the user, e.g. all the dashboarding.

How do we build the read-model? (and the aggregates too)

Every event has a handler: Event Type How the event
is handled SectionTest/QuestionAssigned “Set the current question to X” SectionTest/QuestionAnsweredCorrectly “Mark X as correct, advance progress by 1” SectionTest/Finished “Mark the section as finished” ChapterQuiz/Failed “Reset the progress in the chapter quiz” Student/Created “Create a student account” • handle-event is a simple multimethod dispatching on event type. • Our read-model is built simply as: (reduce handle-event nil events)

Clojure multimethods fit really well

The read model is just derived from the combination of
Events + Handlers Events are immutable, so we can’t do much about those. But handlers are a whole different story...

Event Type How the event is handled SectionTest/QuestionAssigned “Set the
current question to X and track activity” SectionTest/QuestionAnsweredCorrectly “Mark X as correct, advance progress by 1, track activity” SectionTest/Finished “Mark the section as finished and track activity” ChapterQuiz/Failed “Reset the progress in the chapter quiz and track activity” Student/Created “Create a student account” Let’s assume we want to build a page which shows how much time the students spend practicing. We can change the event handlers:

If we rebuild the read-model replaying all events… it will
now contain all the activity information as if it was there all along, since the beginning!

But everything has a price... … especially when you are
starting to have quite a few events

How many events do we have? • 1st: 14/08/2014 •
100M: 15/11/2015 (+458 days) • 200M: 08/09/2016 (+299 days) • 300M: 16/01/2017 (+131 days) We have >1.1M events / day!

>1.1M events / day! (and going up)

It seems to grow exponentially

Another view at our event store with Kibana

In order to rebuild the read model we then need
to replay 300 million events... … and it takes ~ 8 hours, as of now

Can we read the events faster? Since we are at
it, can we write the events faster?

Can we scale?

Why is it important to go faster at reading? •
We don’t need to prepare it hours in advance • Quicker turnaround to fix critical bugs

Let’s take a look again at the events event 1
school 1 event 2 school 1 event 3 school 2 event 4 school 1 event 5 school 3 event 6 school 3 ….. event 331.999.997 school 2 event 331.999.998 school 1 event 331.999.999 school 2 event 332.000.000 course published Before, we mentioned 2 ways in which we can read these events. If we look at the school in which the students are, for example, we can see that there is a 3rd way!

event 1 school 1 event 2 school 1 event 3
school 2 event 4 school 1 event 5 school 3 event 6 school 3 event 7 school 2 Why don’t we separate the schools? • all the publishing stuff in the same place • all the administrative stuff (internal) in the same place • etc.

school 2 event 4 school 1 event 5 school 3 event 6 school 3 event 7 school 2 Our domain helps us: we don’t have any cross-school interaction (for now), so we can replay events for different schools in parallel! Our application server has 40 threads, so a big speedup is achievable.. 8 hours to 20 minutes

Let’s partition the event store! Time for a new column
in PostgreSQL...

Two ways of changing events: • Active migrations • Passive
migrations Event sourcing says that events are immutable. Reality does not necessarily agree, so we sometimes cheat a bit. Spoiler alert: there is a reason why they tell you not to do it

Active migrations We wrap the function that we use to
retrieve events with some middleware 1 event goes in, 1 event goes out Example: UserAgreement/Accepted was fine when implemented. One year after, we revised the agreement and people needed to accept it again. We added to the event a revision field. Where do we put the logic that no revision field is actually revision 1? • Add another event type (and keep track of both old and new events) • Handler for the event (everywhere we handle that event) • In an active migration: (mostly)

Passive migrations • Stream the event-store and append events somewhere
else (potentially after transforming them) • Used it also (with no transformation) when going from a machine to a bigger machine But it’s quite challenging to do it to a live system:

While you can rewrite an active migration (it’s just code
in your repo) you can’t really go back from a passive migration. This has caused us a couple of headaches, just recently Protip: Have a bunch of consistency checks you can run before you do a definite switch

After the partition is done, we need to keep adding
events in the correct partitions

What about writing?

school 2 event 4 school 1 event 5 school 3 event 6 school 3 event 7 school 2 Even if the event-store is partitioned, we are still appending events 1 by 1, as of now event 8 school 1 event 9 school 2 event 10 school 1

Why is it important to go faster at writing? No
matter how good is our infrastructure, appending events 1-by-1 does not scale • Current capacity is around 800 events/second • The more schools we have, the more users active at the same time, the more events we need to append every second, and so on...

We could append the events separately for each school! event
1 school 1 event 2 school 1 event 3 school 2 event 4 school 1 event 5 school 3 event 6 school 3 event 7 school 2 This is promising, because it makes our capacity scale up with the number of schools that we have. Compared to now, we could process events 250x faster! event 8 school 1 event 9 school 2 event 10 school 1 Other scaling advantages due to our domain: • Schools cannot be infinitely large • Students are only doing one thing at a time

But once again, everything has a price...

• No guarantees on the order of events in different
partitions • Potentially hard / impossible to do stuff across schools, at least how we would do it now • One transactor per partition can complicate it a bit Drawbacks

Summary

Event sourcing is awesome because we can retrieve a lot
of information from events, even retroactively Along with benefits, event sourcing also brings challenges (scaling reading and writing of events) We are partitioning the events by school, in order to do parallel read / writes

Event Sourcing, CQRS and scalability: a status ...

Event Sourcing, CQRS and scalability: a status update

More Decks by Davide Taviani

Other Decks in Technology

Featured

Transcript