Kafka Summit London 2024 - Evolve Your Schemas in a Better Way!

Evolve your schemas in a better way! A deep dive
into Avro schema compatibility and Schema Registry Tim van Baarsen & Kosta Chuturkov

About the Speakers The Netherlands - Amsterdam Team Dora Romania
- Bucharest

ING www.ing.jobs • 60,000+ employees • Serve 37+ million customers
• Corporate clients and financial institutions in over 40 countries

Kafka @ ING Frontrunners in Kafka since 2014 Running in
production: • 9 years • 7000+ topics • Serving 1000+ Development teams • Self service topic management

Kafka @ ING Traffic is growing with +10% monthly 0
200.000 400.000 600.000 800.000 1.000.000 1.200.000 1.400.000 1.600.000 1.800.000 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 Messages produced per second (average) Messages produced per second (average)

What are we going to cover today ? • Why
Schemas? • What compatibility level to pick? • What changes can I make when evolving my schemas? • What options do I have when I need to introduce a breaking change? • Should we automatically register schemas from our applications? • How do you generate Java classes from your Avro schemas and you build an automated test suite (unit tests)

Why schemas? The only constant in life is change! -Heraclitus
(Greek philosopher)

Why schemas? The only constant in life is change! The
same applies to your Kafka events flowing through your streaming applications.

Why schemas? Producer Application Kafka cluster Consumer Application 0100101001101 0100101001101
poll send old 0 1 2 3 4 5 6 new Responsibilities: - subscribe - deserialization • key • value - heartbeat Responsibilities: - send - serialization • key • value Not responsible for: • Type checking • Schema validation * • Other constraints Data in a Kafka topic are just stored as bytes! Deserializer Serializer = flow of data 0100101001101 your-topic Kafka client Kafka client

poll send old 0 1 2 3 4 5 6 new Deserializer Serializer your-topic Consumers and producers are decoupled at runtime Kafka client Kafka client

Why schemas? Consumers and producers are decoupled at runtime Producer
Application Kafka cluster Consumer Application 0100101001101 0100101001101 poll send old 0 1 2 3 4 5 6 new Deserializer Serializer your-topic Consumer Application 0100101001101 Deserializer Kafka client Kafka client Kafka client

poll send old 0 1 2 3 4 5 6 new Deserializer Serializer your-topic Consumer Application 0100101001101 Deserializer Indirectly coupled on the data format Kafka client Kafka client Kafka client

poll send old 0 1 2 3 4 5 6 new Deserializer Serializer your-topic Consumer Application 0100101001101 Deserializer Indirectly coupled on the data format What fields and types of data can I expect? Documentation of the fields? Kafka client Kafka client Kafka client

poll send old 0 1 2 3 4 5 6 new Deserializer Serializer your-topic Consumer Application 0100101001101 Deserializer Indirectly coupled on the data format Some requirements changed. We need to introduce a new field Don’t cause inconsistency Keep It compatible! No disruption my service Kafka client Kafka client Kafka client

poll send old 0 1 2 3 4 5 6 new Deserializer Serializer your-topic Consumer Application 0100101001101 Deserializer Schema Kafka client Kafka client Kafka client

poll send old 0 1 2 3 4 5 6 new Deserializer Serializer your-topic Consumer Application 0100101001101 Deserializer Schema We need the schema the data was written with to be able to read it Kafka client Kafka client Kafka client

poll send old 0 1 2 3 4 5 6 new Deserializer Serializer your-topic Consumer Application 0100101001101 Deserializer Schema Don’t send the schema each time we send data Kafka client Kafka client Kafka client

poll send old 0 1 2 3 4 5 6 new Deserializer Serializer your-topic Consumer Application 0100101001101 Deserializer Confluent Schema Registry Kafka client Kafka client Kafka client Don’t send the schema each time we send data

poll send old 0 1 2 3 4 5 6 new your-topic Consumer Application 0100101001101 Confluent Schema Registry Kafka client Kafka client Kafka client Serializer Deserializer Deserializer

poll send old 0 1 2 3 4 5 6 new KafkaAvro Serializer your-topic Consumer Application 0100101001101 Confluent Schema Registry Register Schema Kafka client Kafka client Kafka client Deserializer Deserializer

poll send old 0 1 2 3 4 5 6 new KafkaAvro Serializer your-topic Consumer Application 0100101001101 Confluent Schema Registry id: 1 id: 1 id: 1 Register Schema id: 1 Schema id Kafka client Kafka client Kafka client Deserializer Deserializer

poll send old 0 1 2 3 4 5 6 new KafkaAvro Deserializer KafkaAvro Serializer your-topic Consumer Application 0100101001101 Confluent Schema Registry KafkaAvro Deserializer id: 1 id: 1 id: 1 Register Schema id: 1 id: 1 id: 1 id: 1 Schema id Schema id Kafka client Kafka client Kafka client

Why schemas? Producer Application Kafka cluster Consumer Application Kafka client
0100101001101 0100101001101 poll send old 0 1 2 3 4 5 6 new KafkaAvro Deserializer KafkaAvro Serializer your-topic Consumer Application 0100101001101 Kafka client Confluent Schema Registry KafkaAvro Deserializer id: 1 id: 1 id: 1 Register Schema id: 1 id: 1 id: 1 id: 1 Schema id Schema id Load Schema Kafka client Confluent Schema Registry = runtime dependency Need high availability

Avro § At ING we prefer Avro § Apache Avro™
is a data serialization system offering rich data structures and uses a compact and binary format. { "type": "record", "namespace": "com.example", "name": "Customer", "fields": [ { "name": ”name", "type": "string” }, { "name": ”isJointAccountHolder", "type": ”boolean "} ] } { "name": ”Jack", "isJointAccountHolder": true }

Avro field types q primitive types (null, boolean, int, long,
float, double, bytes, and string) q complex types (record, enum, array, map, union, and fixed). q Logical types(decimal, uuid, date…) { "type": "record", "namespace": "com.example", "name": "Customer", "fields": [ { "name": ”name", "type": "string” }, { "name": ”isJointAccountHolder", "type": ”boolean "}, { "name": "country", "type": { "name": "Country", "type": "enum", "symbols" : ["US", "UK", "NL"] } }, { "name": ”dateJoined", "type": ”long”, ”logicalType": ” timestamp-millis” } ] } { "name": ”Jack", "isJointAccountHolder": true, ”country": ”UK”, ”dateJoined": 1708944593285 }

Maps Note: the values type applies for the values in
the map. The keys are strings. Example java Map representation: Map<String, Long> customerPropertiesMap = new HashMap<>(); { "type": "record", "namespace": "com.example", "name": "Customer", "fields": [ { "name": "name", "type": "string" }, { "name": "isJointAccountHolder", "type": "boolean "}, { "name": "country", "type": { "name": "Country", "type": "enum", "symbols": ["US", "UK", "NL"]}}, { "name": "dateJoined", "type": "long", "logicalType": " timestamp-millis"}, { "name": "customerPropertiesMap", "type": { "type": "map", "values": "long" }} ] } { "name": ”Jack", "isJointAccountHolder": true, ”country": ”UK”, ”dateJoined": 1708944593285, “customerPropertiesMap”: {”key1": 1708, ”key2": 1709} }

Fixed { "type": "record", "namespace": "com.example", "name": "Customer", "fields": [
{ "name": "name", "type": "string" }, { "name": "isJointAccountHolder", "type": "boolean "}, { "name": "country", "type": { "name": "Country", "type": "enum", "symbols": ["US", "UK", "NL"]}}, { "name": "dateJoined", "type": "long", "logicalType": " timestamp-millis"}, { "name": "customerPropertiesMap", "type": { "type": "map", "values": "long" }, "doc": "Customer properties"}, { "name": "annualIncome", "type": ["null", {"name": "AnnualIncome", "type": "fixed", "size": 32}],"doc": "Annual income of the Customer.", "default": null} ] } { "name": ”Jack", "isJointAccountHolder": true, ”country": ”UK”, ”dateJoined": 1708944593285, “customerPropertiesMap”: {”key1": 1708, ”key2": 1709}, ”annualIncome": [64, -9, 92, …] }

Unions • Unions are represented using JSON arrays • For
example, ["null", "string"] declares a schema which may be either a null or string. • Question: Who thinks this a valid definition? { …. "fields": [ { "name": "firstName", "type": ["null", "string"], "doc": "The first name of the Customer." } ,…] { …. "fields": [ { "name": "firstName", "type": ["null", "string”, “int”], "doc": "The first name of the Customer." } ,…] org.apache.kafka.common.errors.SerializationException: Error serializing Avro message… … Caused by: org.apache.avro.UnresolvedUnionException: Not in union ["null","string","int"]: true

Aliases { …. "fields": [ { "name": ”customerName", "aliases": [
”name" ], "type": "string”, "doc": "The name of the Customer.", "default": null } ,…] { …. "fields": [ { "name": ”name", "type": "string”, "doc": "The name of the Customer.", "default": null } ,…] • Named types and fields may have aliases • Aliases function by re-writing the writer's schema using aliases from the reader's schema. Consumer Producer

Compatibility Modes

BACKWARD { "type": "record", "namespace": "com.example", "name": "Customer", "version": "1",
"fields": [ { "name": ”name", "type": "string” }, { "name": "occupation", "type": "string "} ] } Producer 1: V1 { "type": "record", "namespace": "com.example", "name": "Customer", "version": "1", "fields": [ { "name": ”name", "type": "string” }, { "name": "occupation", "type": "string "} ] } Consumer 1 read: V1 Consumer Producer { "type": "record", "namespace": "com.example", "name": "Customer", "version": "2", "fields": [ { "name": " name ", "type": "string "}, { "name": "occupation", "type": "string "} } Consumer 2 read: V2 (Delete field)

BACKWARD { "type": "record", "namespace": "com.example", "name": "Customer", "version": "2",
"fields": [ { "name": " name ", "type": "string "}, { "name": "occupation", "type": "string "} } Producer 1: V2 { "type": "record", "namespace": "com.example", "name": "Customer", "version": "2", "fields": [ { "name": " name ", "type": "string "}, { "name": "occupation", "type": "string "} } Consumer 1 read: V2 Consumer Producer { "type": "record", "namespace": "com.example", "name": "Customer", "version": "2", "fields": [ { "name": " name ", "type": "string "}, { "name": "occupation", "type": "string "} } Consumer 2 read: V2 (Delete field) { "type": "record", "namespace": "com.example", "name": "Customer", "version": ”3", "fields": [ { "name": "name ", "type": "string ”}, { "name": "occupation", "type": "string"}, { "name": " annualIncome", "type": ["null","int"],"default": null} Consumer 3 read: V3 (Add optional field)

BACKWARD TRANSITIVE { "type": "record", "namespace": "com.example", "name": "Customer", "version":
"1", "fields": [ { "name": ”name", "type": "string” }, { "name": "occupation", "type": "string "} ] } Producer: V1 { "type": "record", "namespace": "com.example", "name": "Customer", "version": "2", "fields": [ { "name": " name ", "type": "string "}, { "name": "occupation", "type": "string "} } Consumer read: V2 (Delete field) Consumer Producer { "type": "record", "namespace": "com.example", "name": "Customer", "version": ”3", "fields": [ { "name": "name ", "type": "string ”}, { "name": "occupation", "type": "string"}, { "name": " annualIncome", "type": ["null","int"],"default": “null”} ] } Consumer read: V3 (Add optional field) { "type": "record", "namespace": "com.example", "name": "Customer", "version": ”…n", "fields": [ { "name": "name ", "type": "string ”}, { "name": "occupation", "type": "string"}, { "name": " annualIncome", "type": ["null","int"],"default": null} ] } Consumer read: V…n (Delete field) Compatible

FORWARD Consumer Producer { "type": "record", "namespace": "com.example", "name": "Customer",
"version": ”1", "fields": [ { "name": "name ", "type": "string ”}, { "name": " annualIncome", "type": ["null"," double"],"default": null} ] } { "type": "record", "namespace": "com.example", "name": "Customer", "version": ”1", "fields": [ { "name": "name ", "type": "string ”}, { "name": " annualIncome", "type": ["null",”double"],"default": null} ] } { "type": "record", "namespace": "com.example", "name": "Customer", "version": ”2", "fields": [ { "name": "name ", "type": "string ”}, { "name": " annualIncome", "type": ["null",”double"],"default": null} ] } Producer write: V2 (Delete Optional Field) { "type": "record", "namespace": "com.example", "name": "Customer", "version": ”3", "fields": [ { "name": "name ", "type": "string ”}, { "name": " annualIncome", "type": ["null",”double"],"default": null}, {"name": "dateOfBirth", "type": "string", "doc": "The date of birth for the Customer."} ] } Producer write: V3 (Add Required Field) Consumer read: V1 Producer write: V1

FORWARD TRANSITIVE { "type": "record", "namespace": "com.example", "name": "Customer", "version":
”3", "fields": [ { "name": "name ", "type": "string ”}, { "name": "occupation", "type": "string"}, { "name": " annualIncome", "type": ["null","int"],"default": null} ] } Producer: V2 (Delete Optional Field) { "type": "record", "namespace": "com.example", "name": "Customer", "version": ”3", "fields": [ { "name": "name ", "type": "string ”}, { "name": "occupation", "type": "string"}, { "name": " annualIncome", "type": ["null","int"],"default": null}, {"name": "dateOfBirth", "type": "string”} ] } Producer: V3 (Add Field) Consumer Producer { "type": "record", "namespace": "com.example", "name": "Customer", "version": ”3", "fields": [ { "name": "name ", "type": "string ”}, { "name": "occupation", "type": "string"}, { "name": " annualIncome", "type": ["null","int"],"default": null} ] } Consumer read: V1 { "type": "record", "namespace": "com.example", "name": "Customer", "version": ”…n", "fields": [ { "name": "name ", "type": "string ”}, { "name": "occupation", "type": "string"}, { "name": " annualIncome", "type": ["null","int"],"default": null}, {"name": "dateOfBirth", "type": "string"}, {"name": "phoneNumber", "type": "string”} ] } Producer: V..n (Add Field) Compatible

FULL { "type": "record", "namespace": "com.example", "name": "Customer", "version": ”1",
"fields": [ { "name": "name , "type": [”null",”string"],"default": null} { "name": "occupation , "type": [”null",”string"],"default": null} { "name": "annualIncome", "type": ["null","int"],"default": null}, { "name": "dateOfBirth", "type": [”null",”string"],"default": null} } Producer: V1 { "type": "record", "namespace": "com.example", "name": "Customer", "version": "2", "fields": [ { "name": "name , "type": [”null",”string"],"default": null} { "name": "occupation , "type": [”null",”string"],"default": null} { "name": " annualIncome", "type": ["null","int"],"default": null}, { "name": "dateOfBirth", "type": [”null",”string"],"default": null}, { "name": "phoneNumber" , "type": [”null",”string"],"default": null} } Consumer read: V2 Consumer Producer NOTE: • The default values apply only on the consumer side. • On the producer side you need to set a value for the field

Available compatibility types From: Confluent Schema Registry documentation New schema
can be used to read old data Old schema can be used to read new data Both backward and forward No compatibility enforced

What compatibility to use ? • If you are the
topic owner and the producer and in control of evolving the schema and you don’t want you break existing consumers, use FORWARD • If you are the topic owner and a consumer, use BACKWARD, so you can upgrade first and then ask the producer to evolve its schema with the fields you need

Backward Compatibility Demo: Components kafka-producer-one (schema v1 ) kafka-consumer-third (schema
v4) poll send kafka-consumer- first (schema v2) poll kafka-consumer- second (schema v3) occupation (required) annualIncome (optional) age (required) Kafka cluster old 0 1 2 3 4 5 6 new customers-topic- backward poll Adding a required field is not a backward compatible change!

Forward Compatibility Demo: Components kafka-producer-one (schema v2) send kafka-consumer- first
(schema v1) poll kafka-producer-two (schema v3) send kafka-producer- three (schema v4) send annualIncome (optional) dateOfBirth (required) phoneNumber (required) Kafka cluster old 0 1 2 3 4 5 6 new customers-topic- forward Removing a required field is not a forward compatible change!

Plugins: Avro Schema to Java Class Avro Schema .avsc avro-maven-plugin
.class .jar maven-compiler-plugin .java maven-jar-plugin

Plugins: Avro Schema to Java Class Avro Schema .avsc avro-maven-plugin
.class .jar maven-compiler-plugin .java maven-jar-plugin • Validation of Avro Syntax • No validation on compatibility!

package com.example.avro.customer; /** Avro schema for our customer. */ @org.apache.avro.specific.AvroGenerated
public class Customer extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { private static final long serialVersionUID = 1600536469030327220L; public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\" CustomerBackwardDemo\",\"namespace\":\"com.example.avro.custom er\",\"doc\":\"Avro schema for our customer.\",\"fields\":[{\"name\":\"name\",\"type\":{\"type\":\"string\",\" avro.java.string\":\"String\"},\"doc\":\"The name of the Customer.\"},{\"name\":\"occupation\",\"type\":{\"type\":\"string\",\"avro .java.string\":\"String\"},\"doc\":\"The occupation of the Customer.\"}],\"version\":1}"); … } { "namespace": "com.example.avro.customer", "type": "record", "name": "Customer", "version": 1, "doc": "Avro schema for our customer.", "fields": [ { "name": "name", "type": "string", "doc": "The name of the Customer." }, { "name": "occupation", "type": "string", "doc": "The occupation of the Customer." } ] } Plugins: Avro Schema to Java Class Customer.avsc Customer.java

Plugins: Avro Schema to Java Class .jar Producer Kafka client
Consumer Kafka client Kafka Streams App Specific record: Customer.class

Test Avro compatibility Integration test style Confluent Schema Registry Subject:
customer-value Compatibility: BACKWARD REST API V1 V2 V1 V2 V3 Validate compatibility • Curl • Confluent CLI • Confluent Maven Plugin Your Java project Registered in the Schema Registry

customer-value Compatibility: BACKWARD REST API V1 V2 V1 V2 V3 Validate compatibility • Curl • Confluent CLI • Confluent Schema Registry Maven Plugin Your Java project Registered in the Schema Registry Automate in Your Maven build

customer-value Compatibility: BACKWARD REST API V1 V2 V1 V2 V3 Validate compatibility • curl • Confluent CLI • Confluent Schema Registry Maven Plugin Your Java project Registered in the Schema Registry Automate in Your Maven build

Test Avro compatibility: Unit tests Unit test style V1 V2
V3 Validate compatibility • curl • Confluent CLI • Confluent Schema Registry Maven Plugin • Unit tests Your Java project Automate in Your Maven build Validate compatibility

Should we auto register schemas ? • By default, client
applications automatically register new schemas • Auto registration is performed by the producers only • For development environments you can use auto register schema • For Prod environments the best practice is • to register schemas outside the client application • to control when schemas are registered with Schema Registry and how they evolve • You can disable auto schema registration on the producer auto.register.schemas: false • Schema Registry: Schema registry security plugin

package com.example.avro.customer; /** Avro schema for our customer. */ @org.apache.avro.specific.AvroGenerated
public class Customer extends org.apache.avro.specific.SpecificRecordBase implements org.apache.avro.specific.SpecificRecord { private static final long serialVersionUID = 1600536469030327220L; public static final org.apache.avro.Schema SCHEMA$ = new org.apache.avro.Schema.Parser().parse("{\"type\":\"record\",\"name\":\"CustomerBackwardDemo \",\"namespace\":\"com.example.avro.customer\",\"doc\":\"Avro schema for our customer.\",\"fields\":[{\"name\":\"name\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\" },\"doc\":\"The name of the Customer.\"},{\"name\":\"occupation\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"String\"},\ "doc\":\"The occupation of the Customer.\"}],\"version\":1}"); … } Auto register schema lessons learned • Maven Avro plugin: additional information appended to the schema in Java code • Producer (KafkaAvroSerializer): auto.register.schemas: false • When serializing the Avro Schema is derived from Customer Java object { "type": "record", "namespace": "com.example", "name": "Customer", "fields": [ { "name": ”name", "type": "string”, ”doc": "The name of the Customer.” } ] } Mismatch in schema comparison Avro Schema (avsc) registered in Schema Registry Avro Schema (Java) in producer

Auto register schema lessons learned • If you are using
the its recommented to set this property (KafkaAvroSerializer) avro.remove.java.properties: true Note: There is an open issue for the Avro Maven Plugin for this AVRO-2838 { "type": "record", "namespace": "com.example", "name": "Customer", "fields": [ { "name": ”name", "type": "string”, ”doc": "The name of the Customer.” } ] } { "type": "record", "namespace": "com.example", "name": "Customer", "fields": [ { "name": ”name", "type": "string”, ”doc": "The name of the Customer.” } ] } No mismatch in schema comparison Avro Schema (avsc) Avro Schema (as Java String)

Schema Evolution Guidelines Rules of the Road for Modifying Schemas
If you want to make your schema evolvable, then follow these guidelines. § Provide a default value for fields in your schema, as this allows you to delete the field later. § Don’t change a field’s data type. § Don’t rename an existing field (use aliases instead).

Breaking changes. How to move forward? What can you do?
• “Force push“ schema • BACKWARD -> NONE -> BACKWARD • Allow for downtime? • Both producers and consumer under your control? • Last resort • “Produce to multiple topics” • V1 topic • V2 topic • Migrate consumers • Transaction atomic operation • Data Contracts for Schema Registry • Field level transformations

Wrap up Communication § Important to communicate changes between producing
and consuming teams Gain more confidence § Add unit/integration tests to make sure your changes are compatible

Wrap up Schema registration § Don’t allow applications register schemas
automatically § Don’t assume application will set auto.register.schemas=false § Make sure to have security measurements in place Be aware of pitfalls § Avro Maven plugin adds: "avro.java.string" § Deserialization exceptions on the consumer side

Questions? 🤔 ❔ Demo codebase: https://github.com/j-tim/kafka-summit-london-2024

Kafka Summit London 2024 - Evolve Your Schemas ...

Kafka Summit London 2024 - Evolve Your Schemas in a Better Way!

More Decks by Tim

Other Decks in Technology

Featured

Transcript