Event Sourcing, CQRS and scalability: a status update

Event Sourcing, CQRS and scalability: a status update

8cad6d8aa26abc031f3c5c38a80fd06e?s=128

Davide Taviani

April 12, 2017
Tweet

Transcript

  1. Event sourcing, CQRS and scalability: a status update

  2. About me • From Italy • MSc in Mathematics (scientific

    / parallel computing, combinatorial optimization) • Joined Studyflow ~3 years ago • info@davidetaviani.com
  3. We are building a secondary education platform for Dutch high

    schools. • https://www.studyflow.nl • We provide 2 courses (Rekenen & Taal) • 250+ schools, 100k+ students
  4. Our stack: • A Rails application used for publishing (but

    we’re “getting rid of it”) • Clojure/ClojureScript stack (both SPA with reagent and request/response) • Use custom event sourcing library: rill • PostgreSQL for our EventStore • Analytics with ElasticSearch and Kibana
  5. Our stack: • On bare metal (on Online.net) ◦ 1

    Database ◦ 2 Application servers ◦ 1 Load balancer + Failovers
  6. Our team: • 3 Developers, • 1 UX designer •

    1 Product designer • You? We’re looking for a (Frontend) Developer! https://studyflow.homerun.co
  7. Event Sourcing

  8. We use domain events, i.e. we record things that happened

    in our domain. Such events are: • meaningful within the domain • result of an explicit interaction • immutable Event sourcing Because of these properties, we consider them our only source of truth
  9. What does an event look like? type QuestionAnsweredCorrectly timestamp $T

    question-id $Z student-id $X answer $Y It means: “student $X answered $Y for the question $Z at time $T”.
  10. Why event sourcing? • Interaction and intermediate states are very

    important in our domain, maybe even more so than the final state (the journey vs the destination). Example: Recording info about questions answered incorrectly might be more important than just knowing that a student successfully completed a chapter. • Events are immutable, so our system is “append only”, making reasoning easier • Events as source of truth are very useful when investigating with Kibana: we can tell exactly what has happened
  11. Clojure fits really well

  12. • Which questions are the most difficult? • How different

    are quick learners from slow ones? • What kind of mistake is the most common for a particular question? • Is reading an explanation (theory) after a mistake useful? But, more importantly • A lot of things that we don’t know yet! From our business perspective, events can help us answer interesting questions
  13. event 1 event 2 event 3 event 4 event 5

    event 6 ….. event 331.999.997 event 331.999.998 event 331.999.999 event 332.000.000 Our event-store • A big log where all events are together, one after the other • We use PostgreSQL and we make use of a few additional columns stream-id Id of the aggregate the event refers to stream-order Local order within a stream insert-order Global order within all events
  14. Event Stream-id event 1 student1 event 2 student1 event 3

    student2 event 4 …. event 331.999.998 student3 event 331.999.999 course material event 332.000.000 student1 Using the stream-id we can look up individual aggregates, such as: • a student practicing in a particular section • a student account information • course material published • a student doing an assessment
  15. 1. One after the other, to make a sort of

    global view of everything 2. Selectively for one aggregate, to create a materialized view of it These events can then be “replayed” in 2 ways event 1 student 1 event 2 student 1 event 3 student 2 event 4 …. event 331.999.998 student 3 event 331.999.999 course material event 332.000.000 student 1 Why do we need these two ways?
  16. CQRS: Command and Queries Responsibility Segregation Here we read events

    from start to end Here we retrieve single aggregates
  17. CQRS: Command and Queries Responsibility Segregation • Realizing that write

    side (commands) and read sides have different needs • They can be scaled independently • Read side is asynchronously updated • Write side is synchronously updated when a command about that specific aggregate is fired
  18. The read model. The read model is how our database

    probably would look like if we did things traditionally. Just a big in-memory clojure map: • Memory is cheap (at least with bare metal) • Memory is fast We store in the read-model all the information that we need to display to the user, e.g. all the dashboarding.
  19. How do we build the read-model? (and the aggregates too)

  20. Every event has a handler: Event Type How the event

    is handled SectionTest/QuestionAssigned “Set the current question to X” SectionTest/QuestionAnsweredCorrectly “Mark X as correct, advance progress by 1” SectionTest/Finished “Mark the section as finished” ChapterQuiz/Failed “Reset the progress in the chapter quiz” Student/Created “Create a student account” • handle-event is a simple multimethod dispatching on event type. • Our read-model is built simply as: (reduce handle-event nil events)
  21. Clojure multimethods fit really well

  22. The read model is just derived from the combination of

    Events + Handlers Events are immutable, so we can’t do much about those. But handlers are a whole different story...
  23. Event Type How the event is handled SectionTest/QuestionAssigned “Set the

    current question to X and track activity” SectionTest/QuestionAnsweredCorrectly “Mark X as correct, advance progress by 1, track activity” SectionTest/Finished “Mark the section as finished and track activity” ChapterQuiz/Failed “Reset the progress in the chapter quiz and track activity” Student/Created “Create a student account” Let’s assume we want to build a page which shows how much time the students spend practicing. We can change the event handlers:
  24. If we rebuild the read-model replaying all events… it will

    now contain all the activity information as if it was there all along, since the beginning!
  25. But everything has a price... … especially when you are

    starting to have quite a few events
  26. How many events do we have? • 1st: 14/08/2014 •

    100M: 15/11/2015 (+458 days) • 200M: 08/09/2016 (+299 days) • 300M: 16/01/2017 (+131 days) We have >1.1M events / day!
  27. >1.1M events / day! (and going up)

  28. It seems to grow exponentially

  29. Another view at our event store with Kibana

  30. In order to rebuild the read model we then need

    to replay 300 million events... … and it takes ~ 8 hours, as of now
  31. Can we read the events faster? Since we are at

    it, can we write the events faster?
  32. Can we scale?

  33. None
  34. Why is it important to go faster at reading? •

    We don’t need to prepare it hours in advance • Quicker turnaround to fix critical bugs
  35. Let’s take a look again at the events event 1

    school 1 event 2 school 1 event 3 school 2 event 4 school 1 event 5 school 3 event 6 school 3 ….. event 331.999.997 school 2 event 331.999.998 school 1 event 331.999.999 school 2 event 332.000.000 course published Before, we mentioned 2 ways in which we can read these events. If we look at the school in which the students are, for example, we can see that there is a 3rd way!
  36. event 1 school 1 event 2 school 1 event 3

    school 2 event 4 school 1 event 5 school 3 event 6 school 3 event 7 school 2 Why don’t we separate the schools? • all the publishing stuff in the same place • all the administrative stuff (internal) in the same place • etc.
  37. event 1 school 1 event 2 school 1 event 3

    school 2 event 4 school 1 event 5 school 3 event 6 school 3 event 7 school 2 Our domain helps us: we don’t have any cross-school interaction (for now), so we can replay events for different schools in parallel! Our application server has 40 threads, so a big speedup is achievable.. 8 hours to 20 minutes
  38. Let’s partition the event store! Time for a new column

    in PostgreSQL...
  39. Two ways of changing events: • Active migrations • Passive

    migrations Event sourcing says that events are immutable. Reality does not necessarily agree, so we sometimes cheat a bit. Spoiler alert: there is a reason why they tell you not to do it
  40. Active migrations We wrap the function that we use to

    retrieve events with some middleware 1 event goes in, 1 event goes out Example: UserAgreement/Accepted was fine when implemented. One year after, we revised the agreement and people needed to accept it again. We added to the event a revision field. Where do we put the logic that no revision field is actually revision 1? • Add another event type (and keep track of both old and new events) • Handler for the event (everywhere we handle that event) • In an active migration: (mostly)
  41. Passive migrations • Stream the event-store and append events somewhere

    else (potentially after transforming them) • Used it also (with no transformation) when going from a machine to a bigger machine But it’s quite challenging to do it to a live system:
  42. While you can rewrite an active migration (it’s just code

    in your repo) you can’t really go back from a passive migration. This has caused us a couple of headaches, just recently Protip: Have a bunch of consistency checks you can run before you do a definite switch
  43. After the partition is done, we need to keep adding

    events in the correct partitions
  44. What about writing?

  45. event 1 school 1 event 2 school 1 event 3

    school 2 event 4 school 1 event 5 school 3 event 6 school 3 event 7 school 2 Even if the event-store is partitioned, we are still appending events 1 by 1, as of now event 8 school 1 event 9 school 2 event 10 school 1
  46. Why is it important to go faster at writing? No

    matter how good is our infrastructure, appending events 1-by-1 does not scale • Current capacity is around 800 events/second • The more schools we have, the more users active at the same time, the more events we need to append every second, and so on...
  47. We could append the events separately for each school! event

    1 school 1 event 2 school 1 event 3 school 2 event 4 school 1 event 5 school 3 event 6 school 3 event 7 school 2 This is promising, because it makes our capacity scale up with the number of schools that we have. Compared to now, we could process events 250x faster! event 8 school 1 event 9 school 2 event 10 school 1 Other scaling advantages due to our domain: • Schools cannot be infinitely large • Students are only doing one thing at a time
  48. But once again, everything has a price...

  49. • No guarantees on the order of events in different

    partitions • Potentially hard / impossible to do stuff across schools, at least how we would do it now • One transactor per partition can complicate it a bit Drawbacks
  50. Summary

  51. Event sourcing is awesome because we can retrieve a lot

    of information from events, even retroactively Along with benefits, event sourcing also brings challenges (scaling reading and writing of events) We are partitioning the events by school, in order to do parallel read / writes
  52. None