Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Data Serialization Using Apache Avro

KMKLabs
March 28, 2018

Data Serialization Using Apache Avro

In My Tech Talk session, I present about binary-based data serialization(process converting structures or object state into a binary format that can be stored or transmitted and reconstructed later) using apache Avro. It is focusing on why we must use data serialization using binary-based compared to simple json format, what is the main benefits, etc. it also explained about avro schema definition in details and how to use it.

KMKLabs

March 28, 2018
Tweet

More Decks by KMKLabs

Other Decks in Programming

Transcript

  1. Data Serialization Using Data Serialization Using Apache Avro Apache Avro

    Presented By: Moh. Ropiyudin KMK’s Tech Talk – 10 Nov 2017
  2. Data Serialization Using Apache Avro 2 Protocol to use? •

    Native • JSON • Protobuf • Avro → the choosen one Problem Statement (case: timeline backup, restore, feeds) Key factors? • Ability To grow • Efficient • Performance • Simple – Easy To Use + Maintain
  3. Data Serialization Using Apache Avro 3 JSON Binary Protocol Size

    3x Protocol Time Protocol Ease of Use Important Consideration
  4. Data Serialization Using Apache Avro 4 Avro is a data

    serialization system. Avro provides: • Rich data structures. • A compact, fast, binary data format. • A container file, to store persistent data. • Simple integration with dynamic languages. • Developer(s) : Apache Software Foundation What is Avro ?
  5. Data Serialization Using Apache Avro 5 Schema Definition { "namespace":

    "com.puter.avro", "type": "record", "name": "Employee", "fields": [ {"name": "name", "type": "string"}, {"name": "dob", "type": "timestamp"}, {"name": "height", "type": "int"}, {"name": "previosCompany", "type": "string"}, {"name": "favoriteColor", "type": ["string", "null"]} ] } { "namespace": "com.puter.avro", "type": "record", "name": "Employee", "fields": [ {"name": "name", "type": "string"}, {"name": "dob", "type": "timestamp"}, {"name": "height", "type": "int"}, {"name": "previosCompany", "type": "string"}, {"name": "favoriteColor", "type": ["string", "null"]} ] } A Schema is represented in JSON: • Primitive Types : null, boolean, int, long, float, double, bytes, string • Complex Types : records, enums, arrays, maps, unions and fixed
  6. Data Serialization Using Apache Avro 6 Compact Binary Representation Octal

    representation Size : 352 Byte RealData : 32 Byte { "type": "record", "name": "Person", "fields": [ {"name": "userName", "type": "string"}, {"name": "favouriteNumber", "type": ["null", "long"]}, {"name": "interests", "type": {"type": "array", "items": "string"}} ] }
  7. Data Serialization Using Apache Avro 7 * Schema evolution *

    Untagged data * Dynamic typing Top 3 Feature of Avro
  8. Data Serialization Using Apache Avro 8 • Avro requires schemas

    when data is written or read. • We can use different schemas for serialization and deserialization. • Avro will handle the missing/extra/modified fields. Schema Evolution
  9. Data Serialization Using Apache Avro 9 • Providing a schema

    with binary data allows each datum be written without overhead. • The result is more compact data encoding, and faster data processing. Untagged Data
  10. Data Serialization Using Apache Avro 10 • Serialization and deserialization

    without code generation. • Used by dynamically-typed language : Ruby, Python • But, code generation still available in Avro for statically typed languages as an optional optimization. Dynamic Typing SCHEMA = <<-JSON { "type": "record", "name": "User", "fields" : [ {"name": "username", "type": "string"}, {"name": "age", "type": "int"}, {"name": "verified", "type": "boolean", "default": "false"} ]} JSON file = File.open('data.avr', 'wb') schema = Avro::Schema.parse(SCHEMA) writer = Avro::IO::DatumWriter.new(schema) dw = Avro::DataFile::Writer.new(file, writer, schema) dw << {"username" => "john", "age" => 25, "verified" => true} dw << {"username" => "ryan", "age" => 23, "verified" => false} dw.close
  11. Data Serialization Using Apache Avro 11 val employee = Employee.newBuilder().apply

    { name = "name1" dob = DateTime.parse("2017-10-26T18:00:00Z") previosCompany = "previousCompany1" favoriteColor = "favoriteColor1" height = 10 }.build() val out = ByteArrayOutputStream() out.use { val encoder = EncoderFactory.get().directBinaryEncoder(out, null) val writer = SpecificDatumWriter<Employee>(Employee.getClassSchema()) writer.write(employee, encoder) encoder.flush() } val employeeByteData = out.toByteArray() val input = ByteArrayInputStream(employeeByteData) val decoder = DecoderFactory.get().directBinaryDecoder(input, null) val reader = SpecificDatumReader<Employee>(Employee.getClassSchema()) val afterDeserializeEmployee = reader.read(null, decoder) Serialize – Deserialize (example)