Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Event Sourcing, CQRS and scalability: a status update

Event Sourcing, CQRS and scalability: a status update

Davide Taviani

April 12, 2017
Tweet

More Decks by Davide Taviani

Other Decks in Technology

Transcript

  1. Event sourcing, CQRS and
    scalability: a status update

    View full-size slide

  2. About me
    ● From Italy
    ● MSc in Mathematics (scientific / parallel computing, combinatorial
    optimization)
    ● Joined Studyflow ~3 years ago
    [email protected]

    View full-size slide

  3. We are building a secondary education platform for Dutch high schools.
    ● https://www.studyflow.nl
    ● We provide 2 courses (Rekenen & Taal)
    ● 250+ schools, 100k+ students

    View full-size slide

  4. Our stack:
    ● A Rails application used for publishing (but we’re “getting rid of it”)
    ● Clojure/ClojureScript stack (both SPA with reagent and
    request/response)
    ● Use custom event sourcing library: rill
    ● PostgreSQL for our EventStore
    ● Analytics with ElasticSearch and Kibana

    View full-size slide

  5. Our stack:
    ● On bare metal (on Online.net)
    ○ 1 Database
    ○ 2 Application servers
    ○ 1 Load balancer
    + Failovers

    View full-size slide

  6. Our team:
    ● 3 Developers,
    ● 1 UX designer
    ● 1 Product designer
    ● You?
    We’re looking for a (Frontend) Developer!
    https://studyflow.homerun.co

    View full-size slide

  7. Event Sourcing

    View full-size slide

  8. We use domain events, i.e. we record things that
    happened in our domain.
    Such events are:
    ● meaningful within the domain
    ● result of an explicit interaction
    ● immutable
    Event sourcing
    Because of these properties, we consider them our only source of
    truth

    View full-size slide

  9. What does an event look like?
    type QuestionAnsweredCorrectly
    timestamp $T
    question-id $Z
    student-id $X
    answer $Y
    It means: “student $X answered $Y for the question $Z
    at time $T”.

    View full-size slide

  10. Why event sourcing?
    ● Interaction and intermediate states are very important in our domain, maybe
    even more so than the final state (the journey vs the destination).
    Example: Recording info about questions answered incorrectly might be more
    important than just knowing that a student successfully completed a chapter.
    ● Events are immutable, so our system is “append only”, making reasoning easier
    ● Events as source of truth are very useful when investigating with Kibana: we can
    tell exactly what has happened

    View full-size slide

  11. Clojure fits really well

    View full-size slide

  12. ● Which questions are the most difficult?
    ● How different are quick learners from slow ones?
    ● What kind of mistake is the most common for a particular question?
    ● Is reading an explanation (theory) after a mistake useful?
    But, more importantly
    ● A lot of things that we don’t know yet!
    From our business perspective, events can help us answer
    interesting questions

    View full-size slide

  13. event 1
    event 2
    event 3
    event 4
    event 5
    event 6
    …..
    event 331.999.997
    event 331.999.998
    event 331.999.999
    event 332.000.000
    Our event-store
    ● A big log where all events are
    together, one after the other
    ● We use PostgreSQL and we make use
    of a few additional columns
    stream-id Id of the aggregate the event refers
    to
    stream-order Local order within a stream
    insert-order Global order within all events

    View full-size slide

  14. Event Stream-id
    event 1 student1
    event 2 student1
    event 3 student2
    event 4
    ….
    event 331.999.998 student3
    event 331.999.999 course material
    event 332.000.000 student1
    Using the stream-id we can look up
    individual aggregates, such as:
    ● a student practicing in a particular
    section
    ● a student account information
    ● course material published
    ● a student doing an assessment

    View full-size slide

  15. 1. One after the other, to make a sort
    of global view of everything
    2. Selectively for one aggregate, to
    create a materialized view of it
    These events can then be “replayed” in 2 ways
    event 1 student 1
    event 2 student 1
    event 3 student 2
    event 4
    ….
    event 331.999.998 student 3
    event 331.999.999 course material
    event 332.000.000 student 1
    Why do we need these two ways?

    View full-size slide

  16. CQRS: Command and Queries Responsibility Segregation
    Here we read events
    from start to end
    Here we retrieve single
    aggregates

    View full-size slide

  17. CQRS: Command and Queries Responsibility Segregation
    ● Realizing that write side (commands) and read sides have different
    needs
    ● They can be scaled independently
    ● Read side is asynchronously updated
    ● Write side is synchronously updated when a command about that
    specific aggregate is fired

    View full-size slide

  18. The read model.
    The read model is how our database probably would look like if we did things
    traditionally.
    Just a big in-memory clojure map:
    ● Memory is cheap (at least with bare metal)
    ● Memory is fast
    We store in the read-model all the information that we need to display to the user, e.g.
    all the dashboarding.

    View full-size slide

  19. How do we build the read-model?
    (and the aggregates too)

    View full-size slide

  20. Every event has a handler:
    Event Type How the event is handled
    SectionTest/QuestionAssigned “Set the current question to X”
    SectionTest/QuestionAnsweredCorrectly “Mark X as correct, advance progress by 1”
    SectionTest/Finished “Mark the section as finished”
    ChapterQuiz/Failed “Reset the progress in the chapter quiz”
    Student/Created “Create a student account”
    ● handle-event is a simple multimethod dispatching on event type.
    ● Our read-model is built simply as:
    (reduce handle-event nil events)

    View full-size slide

  21. Clojure multimethods
    fit really well

    View full-size slide

  22. The read model is just derived from the combination
    of Events + Handlers
    Events are immutable, so we can’t do
    much about those.
    But handlers are a whole different
    story...

    View full-size slide

  23. Event Type How the event is handled
    SectionTest/QuestionAssigned “Set the current question to X and track
    activity”
    SectionTest/QuestionAnsweredCorrectly “Mark X as correct, advance progress by 1, track
    activity”
    SectionTest/Finished “Mark the section as finished and track activity”
    ChapterQuiz/Failed “Reset the progress in the chapter quiz and
    track activity”
    Student/Created “Create a student account”
    Let’s assume we want to build a page which shows how much time the
    students spend practicing.
    We can change the event handlers:

    View full-size slide

  24. If we rebuild the read-model replaying all
    events…
    it will now contain all the activity
    information as if it was there all along, since
    the beginning!

    View full-size slide

  25. But everything has a price...
    … especially when you are
    starting to have quite a few
    events

    View full-size slide

  26. How many events do we have?
    ● 1st: 14/08/2014
    ● 100M: 15/11/2015 (+458 days)
    ● 200M: 08/09/2016 (+299 days)
    ● 300M: 16/01/2017 (+131 days)
    We have
    >1.1M events / day!

    View full-size slide

  27. >1.1M events / day!
    (and going up)

    View full-size slide

  28. It seems to grow exponentially

    View full-size slide

  29. Another view at our event store with Kibana

    View full-size slide

  30. In order to rebuild the read model
    we then need to replay 300 million
    events...
    … and it takes ~ 8
    hours, as of now

    View full-size slide

  31. Can we read the events faster?
    Since we are at it, can we write the events
    faster?

    View full-size slide

  32. Can we scale?

    View full-size slide

  33. Why is it important to go faster at reading?
    ● We don’t need to prepare it hours in advance
    ● Quicker turnaround to fix critical bugs

    View full-size slide

  34. Let’s take a look again at the events
    event 1 school 1
    event 2 school 1
    event 3 school 2
    event 4 school 1
    event 5 school 3
    event 6 school 3
    …..
    event 331.999.997 school 2
    event 331.999.998 school 1
    event 331.999.999 school 2
    event 332.000.000 course published
    Before, we mentioned 2 ways in
    which we can read these events.
    If we look at the school in which
    the students are, for example,
    we can see that there is a 3rd
    way!

    View full-size slide

  35. event 1 school 1
    event 2 school 1
    event 3 school 2
    event 4 school 1
    event 5 school 3
    event 6 school 3
    event 7 school 2
    Why don’t we separate the
    schools?
    ● all the publishing stuff in the same place
    ● all the administrative stuff (internal) in
    the same place
    ● etc.

    View full-size slide

  36. event 1 school 1
    event 2 school 1
    event 3 school 2
    event 4 school 1
    event 5 school 3
    event 6 school 3
    event 7 school 2
    Our domain helps us: we don’t have any cross-school interaction (for now), so
    we can replay events for different schools in parallel!
    Our application server has 40 threads, so a
    big speedup is achievable..
    8 hours to 20 minutes

    View full-size slide

  37. Let’s partition the event store!
    Time for a new column in PostgreSQL...

    View full-size slide

  38. Two ways of changing events:
    ● Active migrations
    ● Passive migrations
    Event sourcing says that events are immutable.
    Reality does not necessarily agree, so we sometimes cheat a bit.
    Spoiler alert: there is a reason why they tell you not to do it

    View full-size slide

  39. Active migrations
    We wrap the function that we use to retrieve events with some middleware
    1 event goes in, 1 event goes out
    Example:
    UserAgreement/Accepted was fine when implemented.
    One year after, we revised the agreement and people needed to accept it again. We
    added to the event a revision field.
    Where do we put the logic that no revision field is actually revision 1?
    ● Add another event type (and keep track of both old and new events)
    ● Handler for the event (everywhere we handle that event)
    ● In an active migration:
    (mostly)

    View full-size slide

  40. Passive migrations
    ● Stream the event-store and append
    events somewhere else (potentially
    after transforming them)
    ● Used it also (with no transformation)
    when going from a machine to a bigger
    machine
    But it’s quite challenging to do it to a live
    system:

    View full-size slide

  41. While you can rewrite an active migration (it’s just code in your repo) you can’t really
    go back from a passive migration.
    This has caused us a couple of headaches, just recently
    Protip: Have a bunch of consistency checks you can run before you do a definite
    switch

    View full-size slide

  42. After the partition is done, we need to
    keep adding events in the correct
    partitions

    View full-size slide

  43. What about writing?

    View full-size slide

  44. event 1 school 1
    event 2 school 1
    event 3 school 2
    event 4 school 1
    event 5 school 3
    event 6 school 3
    event 7 school 2
    Even if the event-store is
    partitioned, we are still appending
    events 1 by 1, as of now
    event 8 school 1
    event 9 school 2
    event 10 school 1

    View full-size slide

  45. Why is it important to go faster at writing?
    No matter how good is our infrastructure,
    appending events 1-by-1 does not scale
    ● Current capacity is around 800
    events/second
    ● The more schools we have, the more
    users active at the same time, the
    more events we need to append every
    second, and so on...

    View full-size slide

  46. We could append the events separately for each school!
    event 1 school 1
    event 2 school 1
    event 3 school 2
    event 4 school 1
    event 5 school 3
    event 6 school 3
    event 7 school 2
    This is promising, because it makes our
    capacity scale up with the number of
    schools that we have.
    Compared to now, we could process
    events 250x faster!
    event 8 school 1
    event 9 school 2
    event 10 school 1
    Other scaling advantages due to our
    domain:
    ● Schools cannot be infinitely large
    ● Students are only doing one thing at a
    time

    View full-size slide

  47. But once again, everything has a price...

    View full-size slide

  48. ● No guarantees on the order of events in different partitions
    ● Potentially hard / impossible to do stuff across schools, at
    least how we would do it now
    ● One transactor per partition can complicate it a bit
    Drawbacks

    View full-size slide

  49. Event sourcing is awesome because we can retrieve a lot of information from
    events, even retroactively
    Along with benefits, event sourcing also brings challenges (scaling
    reading and writing of events)
    We are partitioning the events by school, in order to do parallel read /
    writes

    View full-size slide