How Schemas Make the World Go Round

Today, most businesses are data-driven and rely heavily on fast insights generated from that data. Traditionally, organizations embraced the approach of storing and processing their data in one central location or data center, organizing this raw data based on attributes such as primary keys, IDs, types, and maximum length. While this approach was once sufficient, it doesn’t hold its own as your events move closer to the edge than your database.

For example, you might work closely with data streaming, which involves continuous data generation at a high velocity and volume. In its raw form, the massive amounts of data generated by a streaming source make little to no sense. To handle all this data, you might take a document-based approach, like using NoSQL.

Similarly, edge computing involves moving part of data processing as close to the streaming source as possible. Edge computing has been necessitated by the increase in data being generated by streaming devices. With your message broker sitting in front of the database — in other words, sitting at the edge — you need something else to act as the contract for data shape.

The solution to handling and working with large amounts of data is schemas. Schemas are a blueprint that defines the structure of messages processed by the system. The uniformity enforced by the schema ensures that all the players across the network understand the data.

The Apache Pulsar Schema is one of the most critical components of Apache Pulsar, an open-source distributed system messaging and streaming platform. In this article, we’ll explore the role of schemas in data streaming and how JSON, Avro, and Apache Pulsar work together to make edge computing possible.

Schemas in Apache Pulsar

In Apache Pulsar, the client on the sender side doesn’t come into contact with the client on the receiver’s side. They need a contract that can define how communication with each other should be structured for both sides to understand it — exactly what a schema facilitates.

Avro and JSON schemas ensure that clients in Apache Pulsar maintain uniformity in their message structures. This means that consumers are guaranteed that messages sent by producers will be readable to them.

JSON Schemas

JSON schemas contain the specifications used in defining the structure of JSON data. They also provide a contract detailing the required JSON data and how it should be handled. The main purpose of a JSON schema is to support the validation, hyperlinking, documentation, and navigation of the JSON data.

Apache Pulsar’s data streaming system needs the JSON schema to enforce the predefined data structures. That’s because JSON’s primary focus is usually on the data, not other aspects such as storage.

Here is an example of a JSON schema:

  {
    "$schema": "http://json-schema.org/draft-04/schema#",
    "title": "Student",
    "description": "Students taking a programming course",
    "type": "object",


    "properties": {


    "id": {
    "description": "Unique number identifying a student",
    "type": "integer"
    },


    "name": {
    "description": "Student full name",
    "type": "string"
    }
    },


    "required": ["id", "name"]
  }

Let’s look at some of the important keywords from the example above:

$schema specifies the version of the schema.
title is the name of the schema. It is a unique name that should be assigned based on the objects the schema will hold. For instance, it can be a student, customer, product, and so on.
description is the description of what the schema will contain.
type describes the data’s first constraint and must be a JSON object.
properties defines the keys and the value types that will be used.
required identifies the properties that must contain a value.

Avro Schemas

Avro is an open-source, schema-based serialization utility that accepts records as inputs. Its schemas contain the field type and its value. Data definition in Avro is stored in JSON format, making it easy to read; the data itself is stored in compact binary format, making it effective for exchanging data between processing systems written in any language.

Avro schemas are also resilient and can handle data structure changes without failing — otherwise known as schema evolution. Users can change the current schema to accommodate new data fields without making it unusable for data stored using the old schema.

Avro schemas are particularly useful when working with clients that aren’t in direct contact with each other, like an Apache Pulsar producer and consumer. With that in mind, Avro schemas play a key role in powering edge computing, where different devices process data in different locations.

Below is an example of an Avro schema:

  {
    "type": "record",
    "namespace": "com.productfiles",
    "name": "name",
    "fields": [
    { "name": "first", "type": "string" }
    ]
  }

An Avro schema is created in JSON format and contains four attributes: type, name, namespace, and fields:

type identifies the JSON field type. In Avro, it must always be recorded and positioned at the top.
name identifies the name of the schema. It is important because, when combined with the namespace, it gives a schema a unique identity. For instance, the full name of the schema above is com.productfiles.name.
namespace is a URL that specifies the location where this record is stored. The namespace feature makes Avro a perfect fit for distributed systems such as Apache Pulsar that manage resources and access using namespaces.
fields is where the actual schema definition is found. It specifies the fields in the schema and the data type of each of the fields. The field can hold simple types, such as strings, integers, or another record. In the example above, there’s a field name, which is expected to hold data of type string.

Avro Versus JSON: When to Use Each

The most notable difference between Avro and JSON is what each schema prioritizes. Unlike the Avro schema, the JSON schema is focused on the data itself and not where it’s stored. Another difference is that an Avro schema can handle different data types and records in its fields, while a properties field in a JSON schema can only handle JSON objects.

Avro schemas are efficient when dealing with big data because Avro stores those in a compact binary format. That makes an efficient data serialization framework when deploying an edge computing system.

Another feature that makes Avro a great tool when dealing with big data is the inclusion of markers in its files. Avro makes it easy to split large data sets into smaller subsets that can be processed quickly — so if you’re working with edge computing, you’ll find it beneficial to use Avro schemas.

Meanwhile, JSON schemas are efficient when transmitting smaller amounts of data because those can easily be parsed by machines. JSON is also the best option when working with REST APIs because it’s lightweight and efficient with queries.

Here is how JSON, Avro, Apache Pulsar, and edge computing come together:

Avro schemas rely on JSON format to achieve easy schema definition and independence from programming languages.
Data streaming services such as Apache Pulsar rely on efficient and resilient schemas such as Avro schemas to efficiently handle big data.
Edge computing relies on a data streaming service such as Apache Pulsar to transmit data between the different machines.

Conclusion

As data and data-driven business insights increasingly drive businesses, it’s crucial to move away from centralized, manual ways of processing data. Instead, you should use schemas, which enable you to predefine message structures so that there’s uniformity and consistency across your system.

These predefined message structures have made the age of distributed systems a reality, with data streaming and edge computing becoming common practices. This is because all a component needs to participate is the approved schema.

As you’ve seen in this article, Avro schemas are at the forefront as a standard structure for data streaming. Their functionality within Apache Pulsar enables you to process different data types and records and handle big data efficiently.

As data streaming’s popularity continues to rise, the popularity of schemas will rise with it.