Introduction
Apache Avro™ is the leading serialization format for record data, and first choice for streaming data pipelines. It offers excellent schema evolution
In this article, we discuss how to author Avro enums so that data written with the new version of the schema can be read by consumers using the previous version of the schema. In other words, we are authoring the Avro schema so that it is
Forward CompatibleAvro Enums
In Avro, you can define enums as follows:
"type": { "name": "color", "type": "enum", "symbols": ["red", "blue", "green"] }
Enums are always strings in Avro.
The apache documentation describes the specification for Enums
Avro Compatibility
Now, we come to the main part of this blog post. What happens if you read data having an enum symbol that is not part of reader schema?
Lets start by defining the first version of the schema in a file called "v1.avsc". This is the reader schema.
{
"name": "v1",
"type": "record",
"namespace": "com.acme",
"doc": "Enum test",
"fields": [
{
"name": "color",
"type": {
"name": "enum",
"type": "enum",
"symbols": [
"unknown",
"red",
"blue",
"green"
],
"default": "unknown"
},
"default": "unknown"
}
]
}
We have the schema. How do we generate data ? For that we need
avro-tools jar. Download avro-tools and store it to a local folder.
Now, you can generate random data as follows. First, write a json message corresponding to the schema.
Write the following into a file called "v1.json"
{ "color": "red" }
Now, convert the JSON to avro using avro-tools
java -jar ~/DevTools/avro-tools-1.11.1.jar fromjson --schema-file v1.avsc v1.json > v1.avro
Now, we evolve the schema. Add another symbol - "yellow" to the "color" enum. Store it in a file called v2.avsc
{
"name": "v1",
"type": "record",
"namespace": "com.acme",
"doc": "Enum test",
"fields": [
{
"name": "color",
"type": {
"name": "enum",
"type": "enum",
"symbols": [
"unknown",
"red",
"blue",
"green", "yellow"
],
"default": "unknown"
},
"default": "unknown"
}
]
}
Create a new JSON message in file v2.json that uses the new enum value
{ "color": "yellow" }
Lets convert this to avro.
java -jar ~/DevTools/avro-tools-1.11.1.jar fromjson --schema-file v2.avsc v2.json > v2.avro
Now lets see what happens if you try to read the v2.avro file using v1 schema ( v1.avsc ). Remember that the v1 schema does not have the symbol "yellow" in the enum.
We will use the toJson command from avro-tools to convert the AVRO to json.
$ java -jar ~/DevTools/avro-tools-1.11.1.jar tojson --reader-schema-file v1.avsc v2.avro
{"reason":"unknown"}
As you can see, trying to read the v2 avro message using a schema that does not have the new enum symbol causes the new enum to be converted to the default value for that enum.
Now, lets see what happens if we dont have the default value specified in the schema. Lets create a new schema with that specification, in a file called v1-nodefault.avsc
{
"name": "v1",
"type": "record",
"namespace": "com.acme",
"doc": "Enum test",
"fields": [
{
"name": "color",
"type": {
"name": "enum",
"type": "enum",
"symbols": [
"unknown",
"red",
"blue",
"green"
]
},
"default": "unknown"
}
]
}
Use this schema to read v2.avro file.
java -jar ~/DevTools/avro-tools-1.11.1.jar tojson --reader-schema-file v1-nodefault.avsc v2.avro
This will result in the following exception printed to stdout
Exception in thread "main" org.apache.avro.AvroTypeException: No match for yellow
at org.apache.avro.io.ResolvingDecoder.readEnum(ResolvingDecoder.java:269)
at org.apache.avro.generic.GenericDatumReader.readEnum(GenericDatumReader.java:268)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:182)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:263)
at org.apache.avro.file.DataFileStream.next(DataFileStream.java:248)
at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:98)
at org.apache.avro.tool.Main.run(Main.java:67)
at org.apache.avro.tool.Main.main(Main.java:56)
As you can see, removing the default value from the enum in the schema will make data written with new schema incompatible with the previous schema.
Conclusion
Now, we can conclude by stating the best practice for Avro Enums.
Use default value in Avro enums to allow schema evolution while maintaining backward compatibility with data that was written with the previous versions of the schema.
If you liked this article, you might also like the following: