Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Maintaining Simplicity in Log Based Architectures

Maintaining Simplicity in Log Based Architectures

At FashionTrade we use (micro-)services which are independently deployed and maintained. We rely heavily on Kafka based logs for distribution of data. A key design goal is that services build up local state in service specific representations. In this talk we'll go into the conceptual and implementation details of several approaches to typical challenges in log and event based setups, including schema changes, staleness, decoupling and mixing asynchronous with synchronous dependencies. In the end we will elaborate on the implementation choices that we made to strive for a healthy tradeoff between simplicity and correctness of design.

FashionTrade.com Engineering

February 21, 2017
Tweet

More Decks by FashionTrade.com Engineering

Other Decks in Programming

Transcript

  1. About Friso van Vollenhoven Mostly worked in software dev and

    related roles. Former CTO at a (big) data analytics and machine learning company. Now CTO at FashionTrade. I am the proud owner of a three character Twitter handle: @fzk. I have 19 endorsements for Awesomeness on LinkedIn.
  2. FashionTrade A B2B platform for fashion wholesale. Fashion brands and

    retailer can connect and do business. E-commerce for (fashion) businesses. Tagline: “We simplify wholesale so you can Connect, Trade & Grow.” About
  3. Product information enters the brand integration API Validation Product information

    is merged with existing data that applies Price lists, stock levels, existing images, etc. Product enters search engine Only if complete (i.e. it has a known price, availability, etc.) Product information is used for orders, confirmations, etc. The life of a product
  4. State and Truth Examples Truth: Product Information (master data, stock

    levels, prices) ElasticSearch index Merged (derived) product information Product image sets (thumbnails) Calculated product metadata
  5. Truth: Brand reached out to retailer to connect State: Match

    Making in progress between brand and retailer On successful match making workflow: New truth: connection established between brand and retailer State and Truth Examples
  6. State vs. Truth Derived state is cheap Can be reconstructed

    based on truth Allows for agility E.g. build new ElasticSearch index based on existing product information Truth is hard to reconstruct Comes from external sources Invest more in design thinking
  7. Why services Separates business concerns Allows people to work on

    many things concurrently Works well with log based data architecture Makes for organisational scalability at the cost of added complexity in delivery
  8. Services maintain local state Managed autonomously by the service internally

    99% use CRUD Some use a light version of Event Sourcing Services and Dependencies: local state
  9. Technologies Google Cloud Datastore ElasticSearch The former is often authoritative

    The latter is not allowed to be (can only contain derived state) Services and Dependencies: local state
  10. Services and Dependencies: communication Services depend on other services’ state

    Delivery managed by the transport Contents provided by the producing service Synchronous call (RPC) Asynchronous event (messaging)
  11. A Word on Sync vs. Async It is sometimes said

    that sync dependencies are a bad thing It’s so bad that Google released a open source framework for creating sync / RPC dependencies based on their internal experiences (gRPC, http://www.grpc.io/) Perhaps it’s not about sync vs. async It’s about autonomy of clients with respect to servers Update client and server side independently This equally holds for producers / consumers
  12. Technologies HTTP based RPC (based on Swagger and client code

    generation) Messages on Kafka Payload is JSON for both (more on that later) Typical path for a service is to use RPC first to get things done and later move to Kafka based dependency and maintain local state Services and Dependencies: communication
  13. Kafka protocol is pipelined over TCP Outstanding requests queue up

    in the TCP send buffer TCP already has back pressure (congestion control, slow start) Kafka producer Async in the happy flow Becomes blocking under upstream contention (back pressure again) Kafka consumer Poll based, not push (client requests next batch of messages) Why do we all like Kafka so much?
  14. Also... Utilizes zero copy send for file transfer (can saturate

    a 1Gb link) Configurable, durable, retention Configurable per message durability (# of ack nodes) Queue Compaction Saves the last seen message for each key Local ordering guarantees Important when using idempotent messages
  15. Distributed, Persistent, Reactive… BINGO! Kafka is as close to a

    reactive stream as it gets Use that; no need for an additional abstraction Use it for streaming only; keep other paradigms for local state
  16. Things we do with Kafka A service that consumes products,

    stock levels, price lists and produces enriched product information on a separate topic to be used for search, orders, product details display, etc. When we deploy a new version of search, build a new index by deploying the new search service with a new consumer group and resetting the consumer offset We can translate any business entity creation, update or deletion into a business metric using a simple consumer that translates events into a metrics aggregation to Datadog
  17. Server side needs to support all versions that ever existed

    Client sticks to a specific version of the service + Easy to understand - Simple / small changes are costly Solution 1: Strict Versioning
  18. Solution 2: Schema evolution When adding a new field to

    an entity, it must be optional Removing fields from the schema can’t be done But a producer can stop populating optional fields Readers / consumers / clients must have sensible handling of empty optionals Usually default values Sometimes different behaviour
  19. + Adding fields is relatively simple + Clients / consumer

    are concerned with schemas and projections, not versions + Allows for contextual interpretation of older versions - Requires discipline - Sometimes you want things not to be optional Solution 2: Schema evolution
  20. There is some middleware that knows how to translate any

    version to any other version Sometimes used in event sourcing + Transparent to the services - Doesn’t allow for contextual translation between versions Solution 3: Schema Translation
  21. Autonomy: Conclusion You will end up with a combination of

    versioning and schema evolution Schema translation doesn’t scale in practice Sometimes, you just want to delete the Kafka topic and start over But you can’t, so you start a new topic and use a new version on there Consumer can move to the new topic gradually Requires coordination, just like monolithic deployments
  22. Backup the truths as part of service local state Restore

    service local state Trigger authoritative services to replay local state on Kafka Builds up all derived state Backup solved external to the service Hard to do when all services use different persistence Still work in progress... Back and Restore
  23. This is when you switch to a new version of

    something E.g. in the new design our representation of a brand doesn’t match the old one Schema evolution makes things overly complex here (union types, etc.) Create new entities instead, change to new API version / new Kafka topic These kind of change requires coordination This is normal Breaking Business Changes
  24. Currently using JSON for everything Schemas defined in code Code

    attracts logic; schemas shouldn’t have logic JSON is more troublesome than anticipated Really easy to publish evolution incompatible messages on a queue Conscious decision to lower learning curve while bootstrapping development Will move to binary message format with formal schema definitions as soon as possible Most likely Avro gRPC looks very promising for synchronous dependencies Schema discipline
  25. High level metric for the health of consumers Determines overall

    staleness in the system Be worried when it goes up Using Burrow (https://github.com/linkedin/Burrow) to keep track Custom integration with Datadog for metrics / monitoring / alerting In Kubernetes we could use consumer lag as a auto scaling trigger Doesn’t solve stuck consumers, of course Consumer lag
  26. Vacancies Back end engineer (JVM, Python) Core Platform Customer Success

    Solutions Front end engineer (JavaScript, React / Redux) Infrastructure / deployment engineer Responsible for infra, Kafka + ES clusters and the build + deployment pipeline
  27. Mix paradigms, but use idiomatic implementation Understanding of concepts is

    more important than implementations People can read API docs Think harder about the models of the sources of truth Kafka allows any derived state to be cheaply recomputed The former requires cheap deployment, schema evolution, and queue compaction Remember: no single abstraction deals with all production aspects of a system To keep it simple