Apache Avro - A data serialization system

In my previous posts I talked about data from a high level perspective. This time we are going to look at a data serialization system called Apache Avro.

Introducting Apache Avro

Apache Avro is created by Doug Cutting, the creator of Hadoop, Lucene and Nutch, specifically designed as an open data serialization system. Apache Avro is used by data processing systems like Hadoop, Spark and Kafka, and is the de-facto data serialization system for high volume, high performance, high throughput, data processing systems.

Features

Apache Avro is a data serialization system that provides features like:

Programming Language Bindings: Avro has language binding support for C, C++, C#, Go, Haskell, Java, Perl, PHP, Python, Ruby, Scala, TypeScript and more,
Data Structures: A rich set of data structures that can be used to write data,
Binary Data Format: A fast and compact binary data format,
Container Format: A container file to store and persist data,
RPC: Remote Procedure Calls (RPC),
Schema Definitions: Language independent schema definition format.

The Core Features

The great thing is that Avro can be used like a toolkit. As we are going to focus on the core feature of Apache Avro, the serialization system, we will be looking at the following features:

Programming Language Bindings,
Data Structures,
Binary Data Format,
Schema Definitions

Programming Language Bindings

Apache Avro provides language bindings for C, C++, C#, Go, Haskell, Java, Perl, PHP, Python, Ruby, Scala, TypeScript and more. The reference implementation of Avro is developed and released as a Java library.
Although it is possible to write data transformations using the Java programming language, another language on the JVM is much more suitable.
The Scala programming language is designed with data transformation in mind. To support declaring data transformation pipelines, Scala supports a very concise function literal, and transformation primitives like the Type class, that makes it very easy to transform data from one presentation into another, in a very simple and concise way.
For this blog we’ll use the Apache Avro Java reference implementation and the Apache Avro type class library for Scala called Avro4s.

Data Structures

What are schemas?

Schemas define the structure, the type, the meaning (through documentation) of data. By having schemas, data consumers understand the data produced by data producers. By having schemas other concepts like schema evolution and lazy data transformation become possible.

Separate Data and Schema

Apache Avro separates schemas from the serialized format. Separating schemas from the serialized format provides an opportunity to optimize both the serialized format and the schema format. The Avro serialization format has been optimized to be as compact and fast as possible. The schema format has been optimized to be language independent, to be stored, to be indexed and made available for search, and to describe data using a rich type system that allows for contextual information by means of documentation.

A Flexible Data System

Separating schema from data provides an opportunity to reason about data in multiple dimensions. One dimension is serializing data in its most elemental form, without any other information, other than the data itself. The other dimension is having the availability of a schema catalogue, which contains all definitions, from all domains. The data catalogue is indexed, is continuously updated, is searchable and made available to all data stakeholders. The stakeholders can use the data catalogue to reason about data domains, explore what is available and if applicable start consuming data products.

Schema Format

Avro schemas are defined in JSON. Avro schemas can be stored in a data store and used by both the data producer and data consumer.

Avro Record Type

Apache Avro supports lots of types, but the type that is most used is the ‘record’ type. The record type is useful for describing complex types with nested relationships. Domain entities are described using Avro records.

Schema Evolution Patterns

Schema evolution is a data management strategy that allows for the evolution of a schema over time. When a schema evolution strategy is in place, downstream consumers are able to consume old, current and future schemas seamlessly.
There are three evolutions patterns:
| Pattern | Definition |
| — | — |
| Backwards Compatible | Data encoded with an older schema can be read with a newer schema |
| Forwards Compatible | Data encoded with a newer schema can be read with an older schema |
| Full Compatible | Old data can be read with a new schema and new data can be read with an old schema |

Designing Data

Apache Avro is a data serialization system that provides primitives to serialize data, and to describe data. Apache Avro does not define the data serialization strategy that is a fit for our goals. Depending on our vision, our goals, our ambition and our culture, we must design a data serialization system that is a fit for what we want to achieve.
If full compatibility is a requirement, which means that schemas are both backwards and forwards compatible, we can design our schemas in such a way that they are.

Schema is independent of data

Now that we know that Apache Avro separates schema from data, we start to see something very interesting. Schemas are independent from data. In fact, data does not play a role when it comes to defining schemas.
Schemas are the most important part of working with Apache Avro. Schemas are all about definitions. It is how we capture our domains. It is how we define entities, describe what they mean by means of documentation and annotate whether or not the entity has a certain risk involved with it. Schemas provide a very rich vocabulary in how we can describe domains. Schemas are also extensible which means we can define our own definitions that can be used by other layers in our data serialization system.

Schema Evolution Support

Serializing and Deserializing

In Apache Avro, there are two parties involved in the serialization system. The writer produces data and at the time of data production, it used a schema. This schema is called the Writer Schema because that is the schema that the writer used when it serialized the data. The other party is the reader which consumes the data. The reader can use any schema, also schemas from a whole different domain. Lets say it uses the exact same schema as the writer used to read the data. The schema that the reader uses is called the Reader Schema.
The Apache Avro serialization system contains a Delta Engine. The engine needs the Writer and the Reader schema in order to determine what data transformation must be done in order to make the data fit the schema of the reader. This is called Schema on read. The result is a data product that the reader can consume. The system can then use the data to marshal the data to a record for processing. When the programming language is Scala, most often you would create a case class from it.

Serialization Formats

Apache Avro supports two serialization formats, the binary encoding or the JSON encoding. For high performance, high volume processing the binary encoding is advisable.

Binary Encodings

Apache Avro supports two binary encodings, the Single Object Encoding and the raw message encoding. Raw binary encoded data is called Avro Datum, and is encoded as per Avro binary encoding specification. The Single Object Encoding is a binary wrapper around the Avro Datum with an identification marker, and the fingerprint of the writer schema. The Single Object Encoding tags Avro Datum with a fingerprint of the Writer Schema. A reader can use the fingerprint to look up the schema of the writer in order to deserialize the Avro Datum.

Schema Registry

Apache Avro schemas are made available for consumption by both writer and reader by means of a schema registry. Data processing systems that use Apache Avro provide a schema registry just like Apache Kafka. Schema registry provides a serving layer for your metadata. It provides a strategy for storing and retrieving Avro schemas.

Conclusion

In this blog we have looked at the features of Apache Avro, a data serialization system. Apache Avro is the de-facto data serialization system for high volume, high performance, high throughput, data processing systems like Apache Kafka, Apache Hadoop and Apache Spark.
Apache Avro can be tailored so that it provides a data serialization system that fits our vision, our goals, our ambition, our culture and what we want to achieve. Apache Avro separates schema from data and that allows for a highly flexible schema evolution strategy.
Having data in its most elemental form available, allows us to transform data and add identification wrappers to the data like the Single Object Encoding, but also data security wrappers which can be at a per field level.
Next time we’ll look at how we can work with Apache Avro, serialize and deserialize data, and how the schemas and binary encoding looks like in code.

Apache Avro – A data serialization system