$30 off During Our Annual Pro Sale. View Details »

Event Sourcing, CQRS and scalability: a status update

Event Sourcing, CQRS and scalability: a status update

Davide Taviani

April 12, 2017
Tweet

More Decks by Davide Taviani

Other Decks in Technology

Transcript

  1. Event sourcing, CQRS and
    scalability: a status update

    View Slide

  2. About me
    ● From Italy
    ● MSc in Mathematics (scientific / parallel computing, combinatorial
    optimization)
    ● Joined Studyflow ~3 years ago
    [email protected]

    View Slide

  3. We are building a secondary education platform for Dutch high schools.
    ● https://www.studyflow.nl
    ● We provide 2 courses (Rekenen & Taal)
    ● 250+ schools, 100k+ students

    View Slide

  4. Our stack:
    ● A Rails application used for publishing (but we’re “getting rid of it”)
    ● Clojure/ClojureScript stack (both SPA with reagent and
    request/response)
    ● Use custom event sourcing library: rill
    ● PostgreSQL for our EventStore
    ● Analytics with ElasticSearch and Kibana

    View Slide

  5. Our stack:
    ● On bare metal (on Online.net)
    ○ 1 Database
    ○ 2 Application servers
    ○ 1 Load balancer
    + Failovers

    View Slide

  6. Our team:
    ● 3 Developers,
    ● 1 UX designer
    ● 1 Product designer
    ● You?
    We’re looking for a (Frontend) Developer!
    https://studyflow.homerun.co

    View Slide

  7. Event Sourcing

    View Slide

  8. We use domain events, i.e. we record things that
    happened in our domain.
    Such events are:
    ● meaningful within the domain
    ● result of an explicit interaction
    ● immutable
    Event sourcing
    Because of these properties, we consider them our only source of
    truth

    View Slide

  9. What does an event look like?
    type QuestionAnsweredCorrectly
    timestamp $T
    question-id $Z
    student-id $X
    answer $Y
    It means: “student $X answered $Y for the question $Z
    at time $T”.

    View Slide

  10. Why event sourcing?
    ● Interaction and intermediate states are very important in our domain, maybe
    even more so than the final state (the journey vs the destination).
    Example: Recording info about questions answered incorrectly might be more
    important than just knowing that a student successfully completed a chapter.
    ● Events are immutable, so our system is “append only”, making reasoning easier
    ● Events as source of truth are very useful when investigating with Kibana: we can
    tell exactly what has happened

    View Slide

  11. Clojure fits really well

    View Slide

  12. ● Which questions are the most difficult?
    ● How different are quick learners from slow ones?
    ● What kind of mistake is the most common for a particular question?
    ● Is reading an explanation (theory) after a mistake useful?
    But, more importantly
    ● A lot of things that we don’t know yet!
    From our business perspective, events can help us answer
    interesting questions

    View Slide

  13. event 1
    event 2
    event 3
    event 4
    event 5
    event 6
    …..
    event 331.999.997
    event 331.999.998
    event 331.999.999
    event 332.000.000
    Our event-store
    ● A big log where all events are
    together, one after the other
    ● We use PostgreSQL and we make use
    of a few additional columns
    stream-id Id of the aggregate the event refers
    to
    stream-order Local order within a stream
    insert-order Global order within all events

    View Slide

  14. Event Stream-id
    event 1 student1
    event 2 student1
    event 3 student2
    event 4
    ….
    event 331.999.998 student3
    event 331.999.999 course material
    event 332.000.000 student1
    Using the stream-id we can look up
    individual aggregates, such as:
    ● a student practicing in a particular
    section
    ● a student account information
    ● course material published
    ● a student doing an assessment

    View Slide

  15. 1. One after the other, to make a sort
    of global view of everything
    2. Selectively for one aggregate, to
    create a materialized view of it
    These events can then be “replayed” in 2 ways
    event 1 student 1
    event 2 student 1
    event 3 student 2
    event 4
    ….
    event 331.999.998 student 3
    event 331.999.999 course material
    event 332.000.000 student 1
    Why do we need these two ways?

    View Slide

  16. CQRS: Command and Queries Responsibility Segregation
    Here we read events
    from start to end
    Here we retrieve single
    aggregates

    View Slide

  17. CQRS: Command and Queries Responsibility Segregation
    ● Realizing that write side (commands) and read sides have different
    needs
    ● They can be scaled independently
    ● Read side is asynchronously updated
    ● Write side is synchronously updated when a command about that
    specific aggregate is fired

    View Slide

  18. The read model.
    The read model is how our database probably would look like if we did things
    traditionally.
    Just a big in-memory clojure map:
    ● Memory is cheap (at least with bare metal)
    ● Memory is fast
    We store in the read-model all the information that we need to display to the user, e.g.
    all the dashboarding.

    View Slide

  19. How do we build the read-model?
    (and the aggregates too)

    View Slide

  20. Every event has a handler:
    Event Type How the event is handled
    SectionTest/QuestionAssigned “Set the current question to X”
    SectionTest/QuestionAnsweredCorrectly “Mark X as correct, advance progress by 1”
    SectionTest/Finished “Mark the section as finished”
    ChapterQuiz/Failed “Reset the progress in the chapter quiz”
    Student/Created “Create a student account”
    ● handle-event is a simple multimethod dispatching on event type.
    ● Our read-model is built simply as:
    (reduce handle-event nil events)

    View Slide

  21. Clojure multimethods
    fit really well

    View Slide

  22. The read model is just derived from the combination
    of Events + Handlers
    Events are immutable, so we can’t do
    much about those.
    But handlers are a whole different
    story...

    View Slide

  23. Event Type How the event is handled
    SectionTest/QuestionAssigned “Set the current question to X and track
    activity”
    SectionTest/QuestionAnsweredCorrectly “Mark X as correct, advance progress by 1, track
    activity”
    SectionTest/Finished “Mark the section as finished and track activity”
    ChapterQuiz/Failed “Reset the progress in the chapter quiz and
    track activity”
    Student/Created “Create a student account”
    Let’s assume we want to build a page which shows how much time the
    students spend practicing.
    We can change the event handlers:

    View Slide

  24. If we rebuild the read-model replaying all
    events…
    it will now contain all the activity
    information as if it was there all along, since
    the beginning!

    View Slide

  25. But everything has a price...
    … especially when you are
    starting to have quite a few
    events

    View Slide

  26. How many events do we have?
    ● 1st: 14/08/2014
    ● 100M: 15/11/2015 (+458 days)
    ● 200M: 08/09/2016 (+299 days)
    ● 300M: 16/01/2017 (+131 days)
    We have
    >1.1M events / day!

    View Slide

  27. >1.1M events / day!
    (and going up)

    View Slide

  28. It seems to grow exponentially

    View Slide

  29. Another view at our event store with Kibana

    View Slide

  30. In order to rebuild the read model
    we then need to replay 300 million
    events...
    … and it takes ~ 8
    hours, as of now

    View Slide

  31. Can we read the events faster?
    Since we are at it, can we write the events
    faster?

    View Slide

  32. Can we scale?

    View Slide

  33. View Slide

  34. Why is it important to go faster at reading?
    ● We don’t need to prepare it hours in advance
    ● Quicker turnaround to fix critical bugs

    View Slide

  35. Let’s take a look again at the events
    event 1 school 1
    event 2 school 1
    event 3 school 2
    event 4 school 1
    event 5 school 3
    event 6 school 3
    …..
    event 331.999.997 school 2
    event 331.999.998 school 1
    event 331.999.999 school 2
    event 332.000.000 course published
    Before, we mentioned 2 ways in
    which we can read these events.
    If we look at the school in which
    the students are, for example,
    we can see that there is a 3rd
    way!

    View Slide

  36. event 1 school 1
    event 2 school 1
    event 3 school 2
    event 4 school 1
    event 5 school 3
    event 6 school 3
    event 7 school 2
    Why don’t we separate the
    schools?
    ● all the publishing stuff in the same place
    ● all the administrative stuff (internal) in
    the same place
    ● etc.

    View Slide

  37. event 1 school 1
    event 2 school 1
    event 3 school 2
    event 4 school 1
    event 5 school 3
    event 6 school 3
    event 7 school 2
    Our domain helps us: we don’t have any cross-school interaction (for now), so
    we can replay events for different schools in parallel!
    Our application server has 40 threads, so a
    big speedup is achievable..
    8 hours to 20 minutes

    View Slide

  38. Let’s partition the event store!
    Time for a new column in PostgreSQL...

    View Slide

  39. Two ways of changing events:
    ● Active migrations
    ● Passive migrations
    Event sourcing says that events are immutable.
    Reality does not necessarily agree, so we sometimes cheat a bit.
    Spoiler alert: there is a reason why they tell you not to do it

    View Slide

  40. Active migrations
    We wrap the function that we use to retrieve events with some middleware
    1 event goes in, 1 event goes out
    Example:
    UserAgreement/Accepted was fine when implemented.
    One year after, we revised the agreement and people needed to accept it again. We
    added to the event a revision field.
    Where do we put the logic that no revision field is actually revision 1?
    ● Add another event type (and keep track of both old and new events)
    ● Handler for the event (everywhere we handle that event)
    ● In an active migration:
    (mostly)

    View Slide

  41. Passive migrations
    ● Stream the event-store and append
    events somewhere else (potentially
    after transforming them)
    ● Used it also (with no transformation)
    when going from a machine to a bigger
    machine
    But it’s quite challenging to do it to a live
    system:

    View Slide

  42. While you can rewrite an active migration (it’s just code in your repo) you can’t really
    go back from a passive migration.
    This has caused us a couple of headaches, just recently
    Protip: Have a bunch of consistency checks you can run before you do a definite
    switch

    View Slide

  43. After the partition is done, we need to
    keep adding events in the correct
    partitions

    View Slide

  44. What about writing?

    View Slide

  45. event 1 school 1
    event 2 school 1
    event 3 school 2
    event 4 school 1
    event 5 school 3
    event 6 school 3
    event 7 school 2
    Even if the event-store is
    partitioned, we are still appending
    events 1 by 1, as of now
    event 8 school 1
    event 9 school 2
    event 10 school 1

    View Slide

  46. Why is it important to go faster at writing?
    No matter how good is our infrastructure,
    appending events 1-by-1 does not scale
    ● Current capacity is around 800
    events/second
    ● The more schools we have, the more
    users active at the same time, the
    more events we need to append every
    second, and so on...

    View Slide

  47. We could append the events separately for each school!
    event 1 school 1
    event 2 school 1
    event 3 school 2
    event 4 school 1
    event 5 school 3
    event 6 school 3
    event 7 school 2
    This is promising, because it makes our
    capacity scale up with the number of
    schools that we have.
    Compared to now, we could process
    events 250x faster!
    event 8 school 1
    event 9 school 2
    event 10 school 1
    Other scaling advantages due to our
    domain:
    ● Schools cannot be infinitely large
    ● Students are only doing one thing at a
    time

    View Slide

  48. But once again, everything has a price...

    View Slide

  49. ● No guarantees on the order of events in different partitions
    ● Potentially hard / impossible to do stuff across schools, at
    least how we would do it now
    ● One transactor per partition can complicate it a bit
    Drawbacks

    View Slide

  50. Summary

    View Slide

  51. Event sourcing is awesome because we can retrieve a lot of information from
    events, even retroactively
    Along with benefits, event sourcing also brings challenges (scaling
    reading and writing of events)
    We are partitioning the events by school, in order to do parallel read /
    writes

    View Slide

  52. View Slide