Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Introduction to Apache Avro

Introduction to Apache Avro

Avro is a language agnostic data serialization and RPC framework initial developed within Apache's Hadoop project. It uses JSON or defining data types and protocols, and serializes data in a compact binary format.
Example code used in presentation found in:
http://github.com/dmtrs/introduction-to-avro

Dimitrios Flaco Mengidis

March 22, 2016
Tweet

More Decks by Dimitrios Flaco Mengidis

Other Decks in Programming

Transcript

  1. • Establish a baseline for discussion on serialization for this

    Meetup group • Consider serialization as a helpful tool for our API challenges • Get some early experience through a demo Goals
  2. Overview Created by Doug Cutting (Lucene, Nutch and Hadoop among

    others) Joined Apache April 2009 as Hadoop subproject Version 1.0.0 released July 2009 Apache Top level project May 2010 (version 1.3.2) Latest version 1.8.0 released January 2016 Apache License, Version 2.0
  3. In Apache Hadoop Apache Hadoop is an open-source software framework

    written in Java for distributed storage and distributed processing of very large data sets on clusters. Avro is used as: 1. A serialization format for persistent data. 2. A wire format for communication between Hadoop nodes, and clients to Hadoop services.
  4. Definition Avro is a data serialization and a procedure call

    framework (RPC). Uses JSON for defining data types and protocols a.k.a. Schema definition. Serializes data in a compact binary format. Require schema definition when reading from binary.
  5. Supported types • Primitive types ◦ null ◦ boolean ◦

    int, long, float, double ◦ string • Complex types ◦ record ◦ enum ◦ array ◦ map ◦ union ◦ fixed
  6. Schema evolution Set of rules to follow for schema evolution.

    http://avro.apache.org/docs/1.7.7/spec. html#Schema+Resolution
  7. Language Agnostic http://apache.cc.uoc.gr/avro/avro-1.8.0/ • C • C++ • C# •

    Java • Javascript • PHP • Ruby • Python, Python3 • Microsoft .NET https://hadoopsdk.codeplex.com/wikipage?title=Avro% 20Library
  8. Questions? • Data Serialization and Evolution http://docs.confluent.io/2.0.0/avro.html • Apache Avro™

    1.8.0 Specification http://avro.apache. org/docs/1.8.0/spec.html • Code examples from presentation https://github.com/dmtrs/introduction-to-avro • Benchmark comparing serialization libraries on the JVM https://github.com/eishay/jvm-serializers/wiki • Schema resolution rules http://avro.apache.org/docs/1.7.7/spec. html#Schema+Resolution