Apache Avro is created by Doug Cutting, the creator of Hadoop, Lucene and Nutch, specifically designed as an open data serialization system. Apache Avro is used by data processing systems like Hadoop, Spark and Kafka, and is the de-facto data serialization system for high volume, high performance, high throughput, data processing systems.
Apache Avro is a data serialization system that provides features like:
The great thing is that Avro can be used like a toolkit. As we are going to focus on the core feature of Apache Avro, the serialization system, we will be looking at the following features:
Apache Avro provides language bindings for C, C++, C#, Go, Haskell, Java, Perl, PHP, Python, Ruby, Scala, TypeScript and more. The reference implementation of Avro is developed and released as a Java library.
Although it is possible to write data transformations using the Java programming language, another language on the JVM is much more suitable.
The Scala programming language is designed with data transformation in mind. To support declaring data transformation pipelines, Scala supports a very concise function literal, and transformation primitives like the Type class, that makes it very easy to transform data from one presentation into another, in a very simple and concise way.
In order to serialize data structures, you need a strategy to define data structures using types. Avro supports primitive types, logical types and complex types.
|primitive type||null, boolean, int, long, float, double, byte, string|
|logical type||decimal, date, time_ms, timestamp_ms|
|complex type||record, enum, array, map, union, fixed.|
Schemas define the structure, the type, the meaning (through documentation) of data. By having schemas, data consumers understand the data produced by data producers. By having schemas other concepts like schema evolution and lazy data transformation become possible.
Apache Avro separates schemas from the serialized format. Separating schemas from the serialized format provides an opportunity to optimize both the serialized format and the schema format. The Avro serialization format has been optimized to be as compact and fast as possible. The schema format has been optimized to be language independent, to be stored, to be indexed and made available for search, and to describe data using a rich type system that allows for contextual information by means of documentation.
Separating schema from data provides an opportunity to reason about data in multiple dimensions. One dimension is serializing data in its most elemental form, without any other information, other than the data itself. The other dimension is having the availability of a schema catalogue, which contains all definitions, from all domains. The data catalogue is indexed, is continuously updated, is searchable and made available to all data stakeholders. The stakeholders can use the data catalogue to reason about data domains, explore what is available and if applicable start consuming data products.
Avro schemas are defined in JSON. Avro schemas can be stored in a data store and used by both the data producer and data consumer.
Apache Avro supports lots of types, but the type that is most used is the ‘record’ type. The record type is useful for describing complex types with nested relationships. Domain entities are described using Avro records.
Schema evolution is a data management strategy that allows for the evolution of a schema over time. When a schema evolution strategy is in place, downstream consumers are able to consume old, current and future schemas seamlessly.
There are three evolutions patterns:
|Backwards Compatible||Data encoded with an older schema can be read with a newer schema|
|Forwards Compatible||Data encoded with a newer schema can be read with an older schema|
|Full Compatible||Old data can be read with a new schema and new data can be read with an old schema|
Apache Avro is a data serialization system that provides primitives to serialize data, and to describe data. Apache Avro does not define the data serialization strategy that is a fit for our goals. Depending on our vision, our goals, our ambition and our culture, we must design a data serialization system that is a fit for what we want to achieve.
If full compatibility is a requirement, which means that schemas are both backwards and forwards compatible, we can design our schemas in such a way that they are.
Now that we know that Apache Avro separates schema from data, we start to see something very interesting. Schemas are independent from data. In fact, data does not play a role when it comes to defining schemas.
Schemas are the most important part of working with Apache Avro. Schemas are all about definitions. It is how we capture our domains. It is how we define entities, describe what they mean by means of documentation and annotate whether or not the entity has a certain risk involved with it. Schemas provide a very rich vocabulary in how we can describe domains. Schemas are also extensible which means we can define our own definitions that can be used by other layers in our data serialization system.
Apache Avro schemas allow changes to a definition. A domain entity will often change over time, and Apache Avro is allows for changes to a definition. Apache Avro comes with the following schema evolution support. Depending on your use case, you can decide whether or not the evolution support that Avro provides is a fit for purpose, or you can write your own evolution strategy. Most data platforms provide their own proprietary schema evolution engine. The most important Apache Avro resolution rules are listed below:
|Record field ordering||Reordering of fields||Fields are matched by name|
|Defines a field||Removed a field||ignore the field|
|no field||new field with default value||use default value|
|enum field||must be present||else error|
|list of type||list of type must match||else error|
|map of type||map of type must match||else error|
|fixed of type and size||must match||else error|
|field of int||long, float, double||promote|
|field of long||float, double||promote|
|field of float||double||promote|
|field of string||bytes||promote|
|field of byte||string||promote|
In Apache Avro, there are two parties involved in the serialization system. The writer produces data and at the time of data production, it used a schema. This schema is called the Writer Schema because that is the schema that the writer used when it serialized the data. The other party is the reader which consumes the data. The reader can use any schema, also schemas from a whole different domain. Lets say it uses the exact same schema as the writer used to read the data. The schema that the reader uses is called the Reader Schema.
The Apache Avro serialization system contains a Delta Engine. The engine needs the Writer and the Reader schema in order to determine what data transformation must be done in order to make the data fit the schema of the reader. This is called Schema on read. The result is a data product that the reader can consume. The system can then use the data to marshal the data to a record for processing. When the programming language is Scala, most often you would create a case class from it.
Apache Avro supports two serialization formats, the binary encoding or the JSON encoding. For high performance, high volume processing the binary encoding is advisable.
Apache Avro supports two binary encodings, the Single Object Encoding and the raw message encoding. Raw binary encoded data is called Avro Datum, and is encoded as per Avro binary encoding specification. The Single Object Encoding is a binary wrapper around the Avro Datum with an identification marker, and the fingerprint of the writer schema. The Single Object Encoding tags Avro Datum with a fingerprint of the Writer Schema. A reader can use the fingerprint to look up the schema of the writer in order to deserialize the Avro Datum.
Apache Avro schemas are made available for consumption by both writer and reader by means of a schema registry. Data processing systems that use Apache Avro provide a schema registry just like Apache Kafka. Schema registry provides a serving layer for your metadata. It provides a strategy for storing and retrieving Avro schemas.
In this blog we have looked at the features of Apache Avro, a data serialization system. Apache Avro is the de-facto data serialization system for high volume, high performance, high throughput, data processing systems like Apache Kafka, Apache Hadoop and Apache Spark.
Apache Avro can be tailored so that it provides a data serialization system that fits our vision, our goals, our ambition, our culture and what we want to achieve. Apache Avro separates schema from data and that allows for a highly flexible schema evolution strategy.
Having data in its most elemental form available, allows us to transform data and add identification wrappers to the data like the Single Object Encoding, but also data security wrappers which can be at a per field level.
Next time we’ll look at how we can work with Apache Avro, serialize and deserialize data, and how the schemas and binary encoding looks like in code.