Thursday, June 6, 2024

Textual description of firstImageUrl

Authoring Avro Enums for Extensibility


Introduction

Apache Avro™ is the leading serialization format for record data, and first choice for streaming data pipelines. It offers excellent schema evolution

In this article, we discuss how to author Avro enums so that data written with the new version of the schema can be read by consumers using the previous version of the schema. In other words, we are authoring the Avro schema so that it is Forward Compatible

Avro Enums


In Avro, you can define enums as follows:

"type": { "name": "color", "type": "enum", "symbols": ["red", "blue", "green"] }


Enums are always strings in Avro.

The apache documentation describes the specification for Enums

Avro Compatibility


Now, we come to the main part of this blog post. What happens if you read data having an enum symbol that is not part of reader schema?

Lets start by defining the first version of the schema in a file called "v1.avsc". This is the reader schema.

{ "name": "v1", "type": "record", "namespace": "com.acme", "doc": "Enum test", "fields": [ { "name": "color", "type": { "name": "enum", "type": "enum", "symbols": [ "unknown", "red", "blue", "green" ], "default": "unknown" }, "default": "unknown" } ] }

We have the schema. How do we generate data ? For that we need avro-tools jar. Download avro-tools and store it to a local folder.

Now, you can generate random data as follows. First, write a json message corresponding to the schema.

Write the following into a file called "v1.json"
{ "color": "red" }

Now,  convert the JSON to avro using avro-tools

java -jar ~/DevTools/avro-tools-1.11.1.jar fromjson --schema-file v1.avsc v1.json > v1.avro


Now, we evolve the schema. Add another symbol - "yellow" to the "color" enum. Store it in a file called v2.avsc

{ "name": "v1", "type": "record", "namespace": "com.acme", "doc": "Enum test", "fields": [ { "name": "color", "type": { "name": "enum", "type": "enum", "symbols": [ "unknown", "red", "blue", "green", "yellow" ], "default": "unknown" }, "default": "unknown" } ] }

Create a new JSON message in file v2.json that uses the new enum value

{ "color": "yellow" }

Lets convert this to avro.

java -jar ~/DevTools/avro-tools-1.11.1.jar fromjson --schema-file v2.avsc v2.json > v2.avro

Now lets see what happens if you try to read the v2.avro file using v1 schema ( v1.avsc ). Remember that the v1 schema does not have the symbol "yellow" in the enum.

We will use the toJson command from avro-tools to convert the AVRO to json.

$ java -jar ~/DevTools/avro-tools-1.11.1.jar tojson --reader-schema-file v1.avsc v2.avro {"reason":"unknown"}

As you can see, trying to read the v2 avro message using a schema that does not have the new enum symbol causes the new enum to be converted to the default value for that enum.

Now, lets see what happens if we dont have the default value specified in the schema. Lets create a new schema with that specification, in a file called v1-nodefault.avsc

{

"name": "v1", "type": "record", "namespace": "com.acme", "doc": "Enum test", "fields": [ { "name": "color", "type": { "name": "enum", "type": "enum", "symbols": [ "unknown", "red", "blue", "green" ] }, "default": "unknown" } ] } 


Use this schema to read v2.avro file.

java -jar ~/DevTools/avro-tools-1.11.1.jar tojson --reader-schema-file v1-nodefault.avsc v2.avro


This will result in the following exception printed to stdout


Exception in thread "main" org.apache.avro.AvroTypeException: No match for yellow
	at org.apache.avro.io.ResolvingDecoder.readEnum(ResolvingDecoder.java:269)
	at org.apache.avro.generic.GenericDatumReader.readEnum(GenericDatumReader.java:268)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:182)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
	at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:260)
	at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:248)
	at org.apache.avro.generic.GenericDatumReader.readWithoutConversion(GenericDatumReader.java:180)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:161)
	at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:154)
	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:263)
	at org.apache.avro.file.DataFileStream.next(DataFileStream.java:248)
	at org.apache.avro.tool.DataFileReadTool.run(DataFileReadTool.java:98)
	at org.apache.avro.tool.Main.run(Main.java:67)
	at org.apache.avro.tool.Main.main(Main.java:56)

As you can see, removing the default value from the enum in the schema will make data written with new schema incompatible with the previous schema.

Conclusion

Now, we can conclude by stating the best practice for Avro Enums.

Use default value in Avro enums to allow schema evolution while maintaining backward compatibility with data that was written with the previous versions of the schema.

If you liked this article, you might also like the following: