Blog

Apache Avro – Lets get practical with code

09 Dec, 2018
Xebia Background Header Wave

In my previous posts I talked about data from a high level perspective and introduced Apache Avro, a data serialization system. This time we are going to get hands-on with Apache Avro and look at Schemas

Introducing the domain

We need to have a domain and because this is a blog, lets keep the domain simple. Lets create a domain that consists of a single person entity. The person entity will change over time, and we will have three versions. The initial version, v1.Person, has a single name field. In v2.Person we have added an age field. In v3.Person we have changed the ordering of the fields.
We can declare this domain in Scala as follows:

case class v1.Person(name: String = "")
case class v2.Person(name: String = "", age: Int = 0)
case class v3.Person(age: Int = 0, name: String = "")

Avro Schemas

Apache Avro schemas definitions are declared in JSON format. When we generate schemas from these case classes we get the following Avro schemas.

v1.Person:

{
    "type" : "record",
    "name" : "Person",
    "namespace" : "binxio",
    "fields" : [ {
      "name" : "name",
      "type" : "string",
      "default" : ""
    } ]
}

v2.Person:

{
    "type" : "record",
    "name" : "Person",
    "namespace" : "binxio",
    "fields" : [ {
      "name" : "name",
      "type" : "string",
      "default" : ""
    }, {
      "name" : "age",
      "type" : "int",
      "default" : 0
    } ]
}

v3.Person:

{
    "type" : "record",
    "name" : "Person",
    "namespace" : "binxio",
    "fields" : [ {
      "name" : "age",
      "type" : "int",
      "default" : 0
    }, {
      "name" : "name",
      "type" : "string",
      "default": ""
    } ]
}

Serializing

When we serialize a record to Avro we get the following Avro Datums. We represent the Avro Datums as hexadecimal strings. You can see that the name ‘Dennis’ is encoded as ‘0C44656E6E6973’. The number ‘42’ is encoded as ‘54’ due to zig-zag encoding.

v1.Person:

v1.Person("Dennis").toAvroBinary().hex
"0C44656E6E6973"

v2.Person:

v2.Person("Dennis", 42).toAvroBinary().hex
"0C44656E6E697354"

v3.Person:

v3.Person(42, "Dennis").toAvroBinary().hex
"540C44656E6E6973"

Deserializing

When we deserialize an Avro Datum, we need to provide the writer, and the reader schema.

v1.Person:

"0C44656E6E6973".parseAvroBinary[v1.Person, v1.Person] 
v1.Person("Dennis")

v2.Person:

"0C44656E6E697354".parseAvroBinary[v2.Person, v2.Person]
v2.Person("Dennis", 42)

v3.Person:

"540C44656E6E6973".parseAvroBinary[v3.Person, v3.Person]
v3.Person(42, "Dennis")

Schema Evolution

The schemas that we have defined all have default values for fields. This means that the schemas are full compatible. The Avro Datum is written with v1.Person. When we instruct the system that we want a different representation, Avro will calculate the schema evolution and provide the requested schema version for consumption.

Writer: v1.Person => Reader: v2.Person:

"0C44656E6E6973".parseAvroBinary[v2.Person, v1.Person]
v2.Person("Dennis", 0)

Writer: v1.Person => Reader: v3.Person:

"0C44656E6E6973".parseAvroBinary[v3.Person, v1.Person]
v3.Person(0, "Dennis")

Writer: v3.Person => Reader: v1.Person:

"540C44656E6E6973".parseAvroBinary[v1.Person, v3.Person]
v1.Person("Dennis")    

Writer: v3.Person => Reader: v2.Person:

"540C44656E6E6973".parseAvroBinary[v2.Person, v3.Person]
v2.Person("Dennis", 42)

Writer: v4.Person => Reader: v1.Person:

"540C44656E6E697302021
C4C61617065727376656C64
203237021248696C7665727
3756D020E31323133205642".parseAvroBinary[v1.Person, v4.Person]
v1.Person("Dennis")

Cross Domain Evolution

Apache Avro can evolve schemas across domains if necessary.

Writer: v1.Person => Reader: v1.Cat:

"0C44656E6E6973".parseAvroBinary[v1.Cat, v1.Person]
v1.Cat("Dennis")

Apparently I’m also a cat!

Conclusion

In this blog we have created a simple domain, created Avro schemas for the domain, serialized records to Avro Datums, evolved schemas and even did a cross domain evolution.
In this blog we’ve used Scala, a programming language for the JVM to show the examples. Apache Avro has language binding support for C, C++, C#, Go, Haskell, Java, Perl, PHP, Python, Ruby, Scala, TypeScript and more.
The Avro Datums that we’ve generated in this blog can be read by other systems as well, because Apache Avro is an open data serialization system.
Apache Avro is used by high volume, high performance, high throughput, data processing systems like Apache Kafka, Apache Hadoop and Apache Spark.

Questions?

Get in touch with us to learn more about the subject and related solutions

Explore related posts