Maintaining Simplicity in Log Based Architectures

Slide 1

Slide 1 text

Reactive Amsterdam February 2017 Maintaining Simplicity in Log Based Architectures

Slide 2

Slide 2 text

About Friso van Vollenhoven Mostly worked in software dev and related roles. Former CTO at a (big) data analytics and machine learning company. Now CTO at FashionTrade. I am the proud owner of a three character Twitter handle: @fzk. I have 19 endorsements for Awesomeness on LinkedIn.

Slide 3

Slide 3 text

FashionTrade A B2B platform for fashion wholesale. Fashion brands and retailer can connect and do business. E-commerce for (fashion) businesses. Tagline: “We simplify wholesale so you can Connect, Trade & Grow.” About

Slide 4

Slide 4 text

No content

Slide 5

Slide 5 text

Reactive?

Slide 6

Slide 6 text

State and Truth

Slide 7

Slide 7 text

The bigger picture

Slide 8

Slide 8 text

Product information enters the brand integration API Validation Product information is merged with existing data that applies Price lists, stock levels, existing images, etc. Product enters search engine Only if complete (i.e. it has a known price, availability, etc.) Product information is used for orders, confirmations, etc. The life of a product

Slide 9

Slide 9 text

The life of a product

Slide 10

Slide 10 text

State and Truth Examples Truth: Product Information (master data, stock levels, prices) ElasticSearch index Merged (derived) product information Product image sets (thumbnails) Calculated product metadata

Slide 11

Slide 11 text

Truth: Brand reached out to retailer to connect State: Match Making in progress between brand and retailer On successful match making workflow: New truth: connection established between brand and retailer State and Truth Examples

Slide 12

Slide 12 text

State vs. Truth Derived state is cheap Can be reconstructed based on truth Allows for agility E.g. build new ElasticSearch index based on existing product information Truth is hard to reconstruct Comes from external sources Invest more in design thinking

Slide 13

Slide 13 text

Services

Slide 14

Slide 14 text

Why services Separates business concerns Allows people to work on many things concurrently Works well with log based data architecture Makes for organisational scalability at the cost of added complexity in delivery

Slide 15

Slide 15 text

However...

Slide 16

Slide 16 text

No content

Slide 17

Slide 17 text

Services maintain local state Managed autonomously by the service internally 99% use CRUD Some use a light version of Event Sourcing Services and Dependencies: local state

Slide 18

Slide 18 text

Technologies Google Cloud Datastore ElasticSearch The former is often authoritative The latter is not allowed to be (can only contain derived state) Services and Dependencies: local state

Slide 19

Slide 19 text

Services and Dependencies: communication Services depend on other services’ state Delivery managed by the transport Contents provided by the producing service Synchronous call (RPC) Asynchronous event (messaging)

Slide 20

Slide 20 text

A Word on Sync vs. Async It is sometimes said that sync dependencies are a bad thing It’s so bad that Google released a open source framework for creating sync / RPC dependencies based on their internal experiences (gRPC, http://www.grpc.io/) Perhaps it’s not about sync vs. async It’s about autonomy of clients with respect to servers Update client and server side independently This equally holds for producers / consumers

Slide 21

Slide 21 text

Technologies HTTP based RPC (based on Swagger and client code generation) Messages on Kafka Payload is JSON for both (more on that later) Typical path for a service is to use RPC first to get things done and later move to Kafka based dependency and maintain local state Services and Dependencies: communication

Slide 22

Slide 22 text

Enter Kafka

Slide 23

Slide 23 text

Kafka protocol is pipelined over TCP Outstanding requests queue up in the TCP send buffer TCP already has back pressure (congestion control, slow start) Kafka producer Async in the happy flow Becomes blocking under upstream contention (back pressure again) Kafka consumer Poll based, not push (client requests next batch of messages) Why do we all like Kafka so much?

Slide 24

Slide 24 text

Also... Utilizes zero copy send for file transfer (can saturate a 1Gb link) Configurable, durable, retention Configurable per message durability (# of ack nodes) Queue Compaction Saves the last seen message for each key Local ordering guarantees Important when using idempotent messages

Slide 25

Slide 25 text

Distributed, Persistent, Reactive… BINGO! Kafka is as close to a reactive stream as it gets Use that; no need for an additional abstraction Use it for streaming only; keep other paradigms for local state

Slide 26

Slide 26 text

Things we do with Kafka A service that consumes products, stock levels, price lists and produces enriched product information on a separate topic to be used for search, orders, product details display, etc. When we deploy a new version of search, build a new index by deploying the new search service with a new consumer group and resetting the consumer offset We can translate any business entity creation, update or deletion into a business metric using a simple consumer that translates events into a metrics aggregation to Datadog

Slide 27

Slide 27 text

Maintaining Autonomy

Slide 28

Slide 28 text

Server side needs to support all versions that ever existed Client sticks to a specific version of the service + Easy to understand - Simple / small changes are costly Solution 1: Strict Versioning

Slide 29

Slide 29 text

Solution 2: Schema evolution When adding a new field to an entity, it must be optional Removing fields from the schema can’t be done But a producer can stop populating optional fields Readers / consumers / clients must have sensible handling of empty optionals Usually default values Sometimes different behaviour

Slide 30

Slide 30 text

+ Adding fields is relatively simple + Clients / consumer are concerned with schemas and projections, not versions + Allows for contextual interpretation of older versions - Requires discipline - Sometimes you want things not to be optional Solution 2: Schema evolution

Slide 31

Slide 31 text

There is some middleware that knows how to translate any version to any other version Sometimes used in event sourcing + Transparent to the services - Doesn’t allow for contextual translation between versions Solution 3: Schema Translation

Slide 32

Slide 32 text

Autonomy: Conclusion You will end up with a combination of versioning and schema evolution Schema translation doesn’t scale in practice Sometimes, you just want to delete the Kafka topic and start over But you can’t, so you start a new topic and use a new version on there Consumer can move to the new topic gradually Requires coordination, just like monolithic deployments

Slide 33

Slide 33 text

Real Life Concerns

Slide 34

Slide 34 text

Backup the truths as part of service local state Restore service local state Trigger authoritative services to replay local state on Kafka Builds up all derived state Backup solved external to the service Hard to do when all services use different persistence Still work in progress... Back and Restore

Slide 35

Slide 35 text

This is when you switch to a new version of something E.g. in the new design our representation of a brand doesn’t match the old one Schema evolution makes things overly complex here (union types, etc.) Create new entities instead, change to new API version / new Kafka topic These kind of change requires coordination This is normal Breaking Business Changes

Slide 36

Slide 36 text

Currently using JSON for everything Schemas defined in code Code attracts logic; schemas shouldn’t have logic JSON is more troublesome than anticipated Really easy to publish evolution incompatible messages on a queue Conscious decision to lower learning curve while bootstrapping development Will move to binary message format with formal schema definitions as soon as possible Most likely Avro gRPC looks very promising for synchronous dependencies Schema discipline

Slide 37

Slide 37 text

High level metric for the health of consumers Determines overall staleness in the system Be worried when it goes up Using Burrow (https://github.com/linkedin/Burrow) to keep track Custom integration with Datadog for metrics / monitoring / alerting In Kubernetes we could use consumer lag as a auto scaling trigger Doesn’t solve stuck consumers, of course Consumer lag

Slide 38

Slide 38 text

Some Observations

Slide 39

Slide 39 text

No content

Slide 40

Slide 40 text

No content

Slide 41

Slide 41 text

Warning: shameless “we’re hiring” slide coming up...

Slide 42

Slide 42 text

Vacancies Back end engineer (JVM, Python) Core Platform Customer Success Solutions Front end engineer (JavaScript, React / Redux) Infrastructure / deployment engineer Responsible for infra, Kafka + ES clusters and the build + deployment pipeline

Slide 43

Slide 43 text

Conclusions

Slide 44

Slide 44 text

Mix paradigms, but use idiomatic implementation Understanding of concepts is more important than implementations People can read API docs Think harder about the models of the sources of truth Kafka allows any derived state to be cheaply recomputed The former requires cheap deployment, schema evolution, and queue compaction Remember: no single abstraction deals with all production aspects of a system To keep it simple