Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: docs for AVRO and JSON_SR keys #6700

Merged
merged 2 commits into from
Dec 4, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 68 additions & 22 deletions docs/concepts/schemas.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ keywords: ksqldb, schema, evolution, avro, protobuf, json, csv
---

Data sources like streams and tables have an associated schema. This schema defines the columns
available in the data, just like a the columns in a traditional SQL database table.
available in the data, just like the columns in a traditional SQL database table.

## Key vs Value columns

Expand Down Expand Up @@ -77,30 +77,25 @@ ksqlDB is [configured to use it](../operate-and-deploy/installation/server-confi

Here's what you can do with schema inference in ksqlDB:

- Declare streams and tables on {{ site.ak }} topics with supported value formats by using
`CREATE STREAM` and `CREATE TABLE` statements, without needing to declare the value columns.
- Declare streams and tables on {{ site.ak }} topics with supported key and value formats by using
`CREATE STREAM` and `CREATE TABLE` statements, without needing to declare the key and/or value columns.
- Declare derived views with `CREATE STREAM AS SELECT` and `CREATE TABLE AS SELECT` statements.
The schema of the view is registered in {{ site.sr }} automatically.
- Convert data to different formats with `CREATE STREAM AS SELECT` and
`CREATE TABLE AS SELECT` statements, by declaring the required output
format in the `WITH` clause. For example, you can convert a stream from
Avro to JSON.

Only the schema of the message *value* can be retrieved from {{ site.sr }}. Message
*keys* must be compatible with the [`KAFKA` format](../developer-guide/serialization.md#kafka)
to be accessible within ksqlDB. ksqlDB ignores schemas that have been registered
for message keys.

!!! note
Message *keys* in Avro and Protobuf are not supported. If your message keys
Message *keys* in Protobuf are not supported. If your message keys
are in an unsupported format, see [What to do if your key is not set or is in a different format](../developer-guide/syntax-reference.md#what-to-do-if-your-key-is-not-set-or-is-in-a-different-format).
JSON message keys can be accessed by defining the key as a single `STRING` value, which will
contain the JSON document.

Although ksqlDB doesn't support loading the message key's schema from {{ site.sr }},
you can provide the key column definition within the `CREATE TABLE` or `CREATE STREAM`
statement, if the data records are compatible with ksqlDB. This is known as
_partial schema inference_, because the key schema is provided explicitly.
If declaring a stream or table with a key format that is different from its
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If declaring a stream or table with a key format that is different from its
If you're declaring a stream or table with a key format that's different from its

value format, and only one of the two formats supports schema inference,
you can explicitly provide the columns for the format that does not support schema inference
while still having ksqlDB load columns for the format that does support schema inference
from m {{ site.sr }}. This is known as _partial schema inference_. To infer value columns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from m {{ site.sr }}. This is known as _partial schema inference_. To infer value columns
from {{ site.sr }}. This is known as _partial schema inference_. To infer value columns

for a keyless stream, set the key format to the [`NONE` format](../developer-guide/serialization.md#none).

Tables require a `PRIMARY KEY`, so you must supply one explicitly in your
`CREATE TABLE` statement. `KEY` columns are optional for streams, so if you
Expand Down Expand Up @@ -141,6 +136,27 @@ time the statement is first executed.

#### With a key column

The following statement shows how to create a new `pageviews` stream by reading
from a {{ site.ak }} topic that has Avro-formatted key and message values.

```sql
CREATE STREAM pageviews WITH (
KAFKA_TOPIC='pageviews-avro-topic',
KEY_FORMAT='AVRO',
VALUE_FORMAT='AVRO'
Comment on lines +145 to +146
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also mention FORMAT='AVRO'? I suspect this might be somewhat common

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you imagining a tip saying that this is equivalent to simply specifying FORMAT, updating these examples to use FORMAT, or something else? The FORMAT property is documented in https://github.com/confluentinc/ksql/blob/master/docs/developer-guide/ksqldb-reference/create-stream.md (and similar for CT, CSAS, and CTAS) which feels like the more appropriate place to formally introduce it, but I can certainly slip in a mention of it here as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline -- leaving this as is for now. Will see if we can consolidate information about keys into a natural place in a subsequent PR.

);
```

In the previous example, ksqlDB infers the key and value columns automatically from the latest
registered schemas for the `pageviews-avro-topic` topic. ksqlDB uses the most
recent schemas at the time the statement is first executed.

!!! note
The key and value schemas must be registered in {{ site.sr }} under the subjects
`pageviews-avro-topic-key` and `pageviews-avro-topic-value`, respectively.

#### With partial schema inference

The following statement shows how to create a new `pageviews` stream by reading
from a {{ site.ak }} topic that has Avro-formatted message values and a
`KAFKA`-formatted `INT` message key.
Expand All @@ -150,11 +166,12 @@ CREATE STREAM pageviews (
pageId INT KEY
) WITH (
KAFKA_TOPIC='pageviews-avro-topic',
KEY_FORMAT='KAFKA',
VALUE_FORMAT='AVRO'
);
```

In the previous example, you need only supply the key column in the CREATE
In the previous example, only the key column is supplied in the CREATE
statement. ksqlDB infers the value columns automatically from the latest
registered schema for the `pageviews-avro-topic` topic. ksqlDB uses the most
recent schema at the time the statement is first executed.
Expand All @@ -165,6 +182,31 @@ recent schema at the time the statement is first executed.

### Create a new table

#### With key and value schema inference

The following statement shows how to create a new `users` table by reading
from a {{ site.ak }} topic that has Avro-formatted key and message values.

```sql
CREATE TABLE users (
userId BIGINT PRIMARY KEY
) WITH (
KAFKA_TOPIC='users-avro-topic',
KEY_FORMAT='AVRO',
VALUE_FORMAT='AVRO'
);
```

In the previous example, ksqlDB infers the key and value columns automatically from the latest
registered schemas for the `users-avro-topic` topic. ksqlDB uses the most
recent schemas at the time the statement is first executed.

!!! note
The key and value schemas must be registered in {{ site.sr }} under the subjects
`users-avro-topic-key` and `users-avro-topic-value`, respectively.

#### With partial schema inference

The following statement shows how to create a new `users` table by reading
from a {{ site.ak }} topic that has Avro-formatted message values and a
`KAFKA`-formatted `BIGINT` message key.
Expand All @@ -174,11 +216,12 @@ CREATE TABLE users (
userId BIGINT PRIMARY KEY
) WITH (
KAFKA_TOPIC='users-avro-topic',
KEY_FORMAT='KAFKA',
VALUE_FORMAT='AVRO'
);
```

In the previous example, you need only supply the key column in the CREATE
In the previous example, only the key column is supplied in the CREATE
statement. ksqlDB infers the value columns automatically from the latest
registered schema for the `users-avro-topic` topic. ksqlDB uses the most
recent schema at the time the statement is first executed.
Expand Down Expand Up @@ -229,17 +272,17 @@ CREATE TABLE pageviews_by_url
```

!!! note
The schema will be registered in {{ site.sr }} under the subject
The value schema will be registered in {{ site.sr }} under the subject
`PAGEVIEWS_BY_URL-value`.

### Converting formats

ksqlDB enables you to change the underlying value format of streams and tables.
ksqlDB enables you to change the underlying key and value formats of streams and tables.
This means that you can easily mix and match streams and tables with different
data formats and also convert between value formats. For example, you can join
data formats and also convert between formats. For example, you can join
a stream backed by Avro data with a table backed by JSON data.

The example below converts a JSON-formatted topic into Avro. Only the
The example below converts a topic into JSON-formatted values into Avro. Only the
`VALUE_FORMAT` is required to achieve the data conversion. ksqlDB generates an
appropriate Avro schema for the new `PAGEVIEWS_AVRO` stream automatically and
registers the schema with {{ site.sr }}.
Expand All @@ -260,13 +303,16 @@ CREATE STREAM pageviews_avro
```

!!! note
The schema will be registered in {{ site.sr }} under the subject
The value schema will be registered in {{ site.sr }} under the subject
`PAGEVIEWS_AVRO-value`.

For more information, see
[Changing Data Serialization Format from JSON to Avro](https://www.confluent.io/stream-processing-cookbook/ksql-recipes/changing-data-serialization-format-json-avro)
in the [Stream Processing Cookbook](https://www.confluent.io/product/ksql/stream-processing-cookbook).

You can convert between different key formats in an analogous manner by specifying the
`KEY_FORMAT` property instead of `VALUE_FORMAT`.

## Valid Identifiers

Column and field names must be valid identifiers.
Expand Down
6 changes: 3 additions & 3 deletions docs/developer-guide/serialization.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,7 @@ This data format supports all SQL
| Feature | Supported |
|------------------------------|-----------|
| As value format | Yes |
| As key format | `JSON`: Yes, `JSON_SR`: No |
| As key format | `JSON`: Yes, `JSON_SR`: Yes |
| [Schema Registry required][0]| `JSON`: No, `JSON_SR`: Yes |
| [Schema inference][1] | `JSON`: No, `JSON_SR`: Yes|
| [Single field unwrapping][2] | Yes |
Expand Down Expand Up @@ -251,7 +251,7 @@ used.
| Feature | Supported |
|------------------------------|-----------|
| As value format | Yes |
| As key format | No |
| As key format | Yes |
| [Schema Registry required][0]| Yes |
| [Schema inference][1] | Yes |
| [Single field wrapping][2] | Yes |
Expand Down Expand Up @@ -284,7 +284,7 @@ And an Avro record serialized with the schema:
"name": "UserDetails",
"fields": [
{ "name": "id", "type": "long" },
{ "name": "name", "type": "string" }
{ "name": "name", "type": "string" },
{ "name": "age", "type": "int" }
]
}
Expand Down