Schema Registry for Confluent Platform (original) (raw)

Schema Registry provides a centralized repository for managing and validating schemas for topic message data, and for serialization and deserialization of the data over the network.Producers and consumers to Kafka topics can use schemas to ensure data consistency and compatibility as schemas evolve. Schema Registry is a key component for data governance, helping to ensure data quality, adherence to standards, visibility into data lineage, audit capabilities, collaboration across teams, efficient application development protocols, and system performance.

Quick starts

About Schema Registry

Schema Registry is a key component of Stream Governance, available in Essentials and Advanced Stream Governance packages on Confluent Cloud, and as a premium feature of self-managed Confluent Platform. Schema Registry provides a centralized repository for managing and validating schemas used in data processing and serialization (into and out of binary format). As an event based system, Confluent brokers uses Schema Registry to intelligently transfer Kafka topic message data and events between producers and consumers.

../_images/schema-registry-ecosystem.jpg

Schema Registry provides several benefits, including data validation, compatibility checking, versioning, and evolution. It also simplifies the development and maintenance of data pipelines and reduces the risk of data compatibility issues, data corruption, and data loss.

Schema Registry enables you to define schemas for your data formats and versions, and register them with the registry. Once registered, the schema can be shared and reused across different systems and applications. When a producer sends data to a message broker, the schema for the data is included in the message header, and Schema Registry ensures that the schema is valid and compatible with the expected schema for the topic.

Data ecosystem

Schema Registry provides the following services:

At its core, Schema Registry has two main parts:

Schema Registry seamlessly integrates with the rest of the Confluent ecosystem:

Understanding schemas

A schema defines the structure of message data. It defines allowed data types, their format, and relationships. A schema acts as a blueprint for data, describing the structure of data records, the data types of individual fields, the relationships between fields, and any constraints or rules that apply to the data.

Schemas are used in various data processing systems, including databases, message brokers, and distributed event and data processing frameworks. They help ensure that data is consistent, accurate, and can be efficiently processed and analyzed by different systems and applications. Schemas facilitate data sharing and interoperability between different systems and organizations.

Common data problems solved by Schema Registry

Schema Registry solves the following common problems of working with large scale data systems:

Schema Registry features and benefits

Schema Registry helps improve reliability, flexibility, and scalability of systems and applications by providing a standard way to manage and validate schemas used by producers and consumers.

Schema Registry provides the following advantages with regard to data management:

Compatibility and schema evolution

Apache Kafka® producers write data to Kafka topics and Kafka consumers read data from Kafka topics. There is an implicit “contract” that producers write data with a schema that can be read by consumers, even as producers and consumers evolve their schemas. Schema Registry helps ensure that this contract is met with compatibility checks.

It is useful to think about schemas as APIs. Applications depend on APIs and expect any changes made to APIs are still compatible and applications can still run. Similarly, streaming applications depend on schemas and expect any changes made to schemas are still compatible and they can still run. Schema evolution requires compatibility checks to ensure that the producer-consumer contract is not broken. This is where Schema Registry helps: it provides centralized schema management and compatibility checks as schemas evolve.

To learn more about how Schema Registry manages compatibility, see the following topic in either the Confluent Cloud or Confluent Platform documentation:

Data serialization

A schema is typically used in data serialization, which is the process of converting data structures or objects into a format that can be transmitted across a network or stored in a file. In this context, a schema defines the format of the serialized data and is used to validate the data as it is beingdeserialized by another system or application. Confluent Schema Registry supports Avro, JSON Schema, and Protobuf serializers and deserializers (serdes). When you write producers and consumers using these supported formats, they handle the details of the wire format for you, so you don’t have to worry about how messages are mapped to bytes.

Always start with a Schema Registry

If you start without Schema Registry and retrofit later, you increase the workload by using custom code as a base that you must then pull re-do to some extent.

Starting from the beginning of a project with Schema Registry is a best practice for several reasons, some of which are already explained in previous sections:

What’s supported