Sunday, July 7, 2024

Textual description of firstImageUrl

Kafka Privacy Toolkit Part 1: Protect PII data using a custom single message transform.

Introduction


Kafka is a highly scalable and durable distributed streaming platform that has become the go-to choice for handling large amounts of data across a variety of industries. Its ecosystem of services is vast and powerful, enabling businesses to solve complex problems and uncover insights from their data in real-time. 

One of the key services in the Kafka ecosystem is Kafka Connect. It is a framework that enables streaming integration between Kafka and other data systems. Kafka Connect is flexible and extensible, with a plugin-based architecture that allows it to integrate seamlessly with various data systems. Users can write custom connectors to integrate Kafka with any data system that has a Java API, and Kafka Connect's RESTful API can be used to configure and manage connector instances. With Kafka Connect, businesses can leverage the power of Kafka to enable streaming integration between different data systems with ease.

Data Privacy & Compliance


Organizations process an extensive amount of data and some of that data includes sensitive information like Personally Identifiable Information (PII) such as names, addresses, social security numbers, among others. It is important to protect that data from unauthorized access and usage. 

One of the use cases where removing PII when transferring data to an external system using Kafka Connect is crucial is in the financial industry. Financial institutions are required by regulations to comply with data protection laws and maintain the confidentiality of their clients’ information. Kafka Connect enables these institutions to transfer data seamlessly between internal systems and external systems, like data warehouses or third-party vendors for analytics purposes while ensuring that sensitive information is obfuscated. This enhances data security, helps maintain compliance with regulations, and avoids the risk of data breaches.

In this blog post, we will explore how to use a custom Kafka Single Message Transformer (SMT) to obfuscate PII data from a Kafka message. This will help you maintain data privacy and security while using Kafka effectively.

Drop Messages with PII

You can use the Confluent Filter SMT to drop entire messages that match a condition. The condition can be set to match on PII data in the message.

For eg, given the input record:


And the kafka connect SMT config:

The output will be:


As you can see, the "Jon Doe" record was dropped because it had a salary field value that was greater than zero.

Remove sensitive values in PII fields.


Use the Confluent Mask Field transform to mask data in PII field. It can also replace it with a token value. When combined with Predicates, it is a powerful mechanism to mask out PII data when certain conditions are met.

For eg, given the input record:


And the kafka connect SMT config:

The output will be:


In the output, all records with a salary field were masked and replaced with a default value.

Manage JSON payloads in Kafka messages.


Sometimes, there are situations where your kafka message contains fields that are themselves JSON payloads encoded as strings. This can happen, for example, when using a DynamoDB source connector. This connector will embed the Dynamodb row into the Document field of the message.

The challenge comes when trying to mask out PII field from a JSON message embedded in a single field of a Kafka message.

For this you can use the mask-json-field transform. This transform allows you to mask out fields at any level of the JSON message. It even supports arrays.

For eg, given the input record:


And the kafka connect SMT config:

The output will be:


In the output, the salary field was replaced with Zero, and the SSN field was replaced with "xxx-xx-xxxx".

Conclusion

In this article, I showed how to use the Kafka Connect Single Message transforms to drop messages with PII field, or to mask PII fields in messages.

You can use these transforms to ensure the data flowing into or out of your kafka topics is compliant with data regulations.