Auto Schema Generation #233

jwyglendowski-precisionlender · 2019-04-24T17:17:05Z

I am evaluating AVRO ("avsc": "^5.4.7") for use with large objects that have mostly static fields with some that are extremely complex and nested. So the thought was to use the auto Schema generation support and see where that would go.

In my first attempt I was able to encode and decode a large JSON payload without issue until I tried opening it with another library for which I found that the sub record types were not getting named. This lead me to issue 108 which describes a solution that made sense in my case since the issue with the resulting auto generated schema was with sub record types not being auto named.

So I went down the route described in issue 108 and was able to encode a file but not able to decode the file. Looking at the raw output I can see that the root record type name was set but not of the sub types.

I am attaching the sample code and a snippet of the generated schema before and after using the name hook.

Schema Before

{"type":"record","fields":[{"name":"events","type":{"type":"array","items":{"type":"record","fields":[{"name":"sequence","type":"int"},{"name":"fieldItem","type":"string"},{"name":"userId","type":"string"},{"name":"tenantId","type":"string"},{"name":"applicationId","type":"string"},{"name":"contextId","type":"string"},{"name":"type","type":"string"},{"name":"locale","type":"string"},{"name":"name","type":"string"},{"name":"subject","type":"string"},{"name":"viewState","type":"string"},{"name":"metaData","type":{"type":"record","fields":[{"name":"userFullName","type":"string"},{"name":"userEmail","type":"string"},{"name":"username","type":"string"},{"name":"userId","type":"string"}]}},{"name":"startTime","type":"float"},{"name":"schemaVersion","type":"string"},{"name":"data","type":{"type":"array","items":{"type":"record","fields":[{"name":"somefield","type":{"type":"record","fields":[{"name":"id","type":"string"},

Schema After (Look Towards the end)

{"name":"Auto11999","type":"record","fields":[{"name":"events","type":{"type":"array","items":{"name":"Auto11998","type":"record","fields":[{"name":"sequence","type":"int"},{"name":"fieldItem","type":"string"},{"name":"userId","type":"string"},{"name":"tenantId","type":"string"},{"name":"applicationId","type":"string"},{"name":"contextId","type":"string"},{"name":"type","type":"string"},{"name":"locale","type":"string"},{"name":"name","type":"string"},{"name":"subject","type":"string"},{"name":"viewState","type":"string"},{"name":"metaData","type":{"name":"Auto0","type":"record","fields":[{"name":"userFullName","type":"string"},{"name":"userEmail","type":"string"},{"name":"username","type":"string"},{"name":"userId","type":"string"}]}},{"name":"startTime","type":"float"},{"name":"schemaVersion","type":"string"},{"name":"data","type":{"type":"array","items":{"type":"record","fields":[{"name":"somefield","type":{"type":"record","fields":[{"name":"id","type":"string"}

The source json payload in this test is 8 megs in size making to large to post here. Any suggestions would be appreciated.

** Sample Code **

const avro = require('avsc');
cost fs = require('fs');

const eventJson = fs.readFileSync('./large-payload-enough-to-fail-8mb.json').toString('utf8');
const event = JSON.parse(eventJson);

function createNamingHook() {
    let index = 0;
    return function (schema, opts) {
        switch (schema.type) {
            case 'enum':
            case 'fixed':
            case 'record':
                schema.name = `Auto${index++}`;
                break;
            default:
        }
    };
}

const type = avro.Type.forValue(event, {
    typeHook: createNamingHook()
});

if (fs.existsSync("./event.avro")) { fs.unlinkSync("./event.avro"); }

const encoder = avro.createFileEncoder("./event.avro", type);
encoder.end(event);
encoder.destroy();

avro.createFileDecoder('./event.avro')
    .on('metadata', function (type) {
        console.log(type);
    })
    .on('data', function (record) {
        console.log(JSON.stringify(record));
    })
    .on('error', function (err) {
        console.log("err");
    })
    .on('end', function (args) {
        console.log("end");
    });

jwyglendowski-precisionlender · 2019-04-25T12:25:20Z

I think the issue revolves around the fact that the object that I am passing into forValue has fields with arrays that have no root node. All the items record types that are not getting named fall into that category.

Will Get Skipped

{
    "data": [
        {
            "field1": {
                "id": "467346b1-8f59-4079-be88-6b55a2466335"
            }
            "field2": "0c12e5cc-3a7a-4b26-86e0-f3ad95092631"
        },
        {
            "field1": {
                "id": "467346b1-8f59-4079-be88-6b55a2466335"
            }
            "field2": "0c12e5cc-3a7a-4b26-86e0-f3ad95092631"
        }
    ]
}

If my base json was formated like the following:

{
    "data": [
        {
            dataItem: {
                "field1": {
                    "id": "467346b1-8f59-4079-be88-6b55a2466335"
                }
                "field2": "0c12e5cc-3a7a-4b26-86e0-f3ad95092631"
            }
        },
        {
            dataItem: {
                "field1": {
                    "id": "467346b1-8f59-4079-be88-6b55a2466335"
                }
                "field2": "0c12e5cc-3a7a-4b26-86e0-f3ad95092631"
            }
        }
    ]
}

Then I believe it would work. What would be nice is if there was someway to pass the parent's name to the type hook and ensure that it gets called on in this specific case where there is no root node to the array objects. This is probably never an issue if the source of the data begins its life as a C# or Java object that is serialized into JSON.

jwyglendowski-precisionlender · 2019-04-25T19:26:13Z

So I attempted using the example I posted in my last comment and that didn't work either. The resulting encoded file never includes any data.

mtth · 2019-04-26T03:39:11Z

I think the issue revolves around the fact that the object that I am passing into forValue has fields with arrays that have no root node. All the items record types that are not getting named fall into that category.

Great find! There was indeed a bug where type options (which include the type hook) weren't properly passed when combining array types; f69c869 should fix this. Can you upgrade to 5.4.9 and try again?

jwyglendowski-precisionlender · 2019-04-26T11:42:03Z

@mtth Okay I updated to 5.4.9 and now the auto naming works across the schema. So that bug based on my test is fixed. It doesn't explain why the file never gets encoded but I will follow up with that in issue 232. Thank you so much for the quick turn around it's very much appreciated.

jwyglendowski-precisionlender · 2019-04-26T13:05:16Z

I thought everything was fixed but when I went to run the resulting avro file through C# it failed for a record type missing a name. I found another edge case and am trying to define what is exactly it is.

jwyglendowski-precisionlender · 2019-04-26T14:43:50Z

@mtth Here are some snippets from my large file that didn't get names. I am not sure if this helps are not.

{
	"type": "array",
	"items": ["string", {
		"type": "record",
		"fields": [{
			"name": "field1",
			"type": "float"
		}, {
			"name": "field2",
			"type": "float"
		}, {
			"name": "field3",
			"type": "int"
		}]
	}]
}

{
	"name": "someName",
	"type": {
		"type": "record",
		"fields": [{
			"name": "major",
			"type": "int"
		}, {
			"name": "minor",
			"type": "int"
		}, {
			"name": "patch",
			"type": "int"
		}, {
			"name": "build",
			"type": "int"
		}]
	}
}

{
	"name": "anotherName",
	"type": {
		"type": "array",
		"items": {
			"type": "record",
			"fields": [{
				"name": "field1",
				"type": "float"
			},  {
				"name": "nameTwo",
				"type": {
					"type": "record",
					"fields": [{
						"name": "major",
						"type": "int"
					}]
				}
			}]
		}
	}
}

mtth · 2019-04-26T14:59:06Z

Thanks! I just fixed another edge case where the options weren't wired through properly; can you upgrade to 5.4.10 and see if you still get errors?

jwyglendowski-precisionlender · 2019-04-26T19:15:43Z

@mtth That did the trick. I do have one suggestion that I am not sure if it's possible as it relates to the type hook. What would be nice is if you could enable something like below. If we knew the parent name then we could make a record name that had more meaning. The project jsonschema-avro follows a similar convention for dealing with records with no names.

It would also be nice if the root of the schema is of type record to be able to specify that somehow.

If you do decide to make that change I would be happy to run it against my monster 6 meg file and review the resulting schema.

Thanks a bunch for you help. Hopefully those edge cases help someone else out.
A+ Jason Wyglendowski
Bonne Weekend

function createNamingHook() {
    let index = 0;
    return function (schema, opts) {
        switch (schema.type) {
            case 'enum':
            case 'fixed':
            case 'record':
                if(!schema.name) {
                   const schemaName =  `${parentName}.record` ||  `Auto${index++}`;
                   schema.name = schemaName ;
               }
                
                break;
            default:
        }
    };
}

mtth · 2019-04-27T15:17:24Z

Great that it works!

If we knew the parent name then we could make a record name that had more meaning.

Unfortunately, this is tricky: children's schemas are generated before their parent's (so there is no "parent name" yet). An alternative could be to provide a path to the value which is currently being parsed, similar to what isValid's error hook provides. Would still be useful enough for your use-case?

jwyglendowski-precisionlender · 2019-04-29T11:48:02Z

@mtth It definitely sounds like it would work in my use case. I should be able to create a strategy to determine the name based on the path.

mtth · 2019-05-07T03:20:13Z

Great, I filed #234 to track adding it to avsc.

If you are only using type-related logic in avsc, consider switching to @avro/types where I already released it (it's one of the packages avsc will be be broken into at the next major release). Here's how you can use it:

const {Type} = require('@avro/types');

const val = {
  one: [1,2,3],
  two: {
    three: 'four',
  },
  five: [
    {six: 6},
    {six: '6'},
  ],
};

function typeHook(schema, opts, scope) {
  // scope.path is the path to the type about to be generated.
}

const type = Type.forValue(val, {typeHook});

jwyglendowski-precisionlender · 2019-05-08T11:25:49Z

@mtth Thanks I will check out that example. A+

jwyglendowski-precisionlender closed this as completed Apr 26, 2019

jwyglendowski-precisionlender reopened this Apr 26, 2019

mtth mentioned this issue May 7, 2019

Add path information to type hooks #234

Open

mtth closed this as completed May 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto Schema Generation #233

Auto Schema Generation #233

jwyglendowski-precisionlender commented Apr 24, 2019 •

edited

Loading

jwyglendowski-precisionlender commented Apr 25, 2019 •

edited

Loading

jwyglendowski-precisionlender commented Apr 25, 2019

mtth commented Apr 26, 2019

jwyglendowski-precisionlender commented Apr 26, 2019

jwyglendowski-precisionlender commented Apr 26, 2019 •

edited

Loading

jwyglendowski-precisionlender commented Apr 26, 2019

mtth commented Apr 26, 2019

jwyglendowski-precisionlender commented Apr 26, 2019 •

edited

Loading

mtth commented Apr 27, 2019

jwyglendowski-precisionlender commented Apr 29, 2019

mtth commented May 7, 2019

jwyglendowski-precisionlender commented May 8, 2019

Auto Schema Generation #233

Auto Schema Generation #233

Comments

jwyglendowski-precisionlender commented Apr 24, 2019 • edited Loading

jwyglendowski-precisionlender commented Apr 25, 2019 • edited Loading

jwyglendowski-precisionlender commented Apr 25, 2019

mtth commented Apr 26, 2019

jwyglendowski-precisionlender commented Apr 26, 2019

jwyglendowski-precisionlender commented Apr 26, 2019 • edited Loading

jwyglendowski-precisionlender commented Apr 26, 2019

mtth commented Apr 26, 2019

jwyglendowski-precisionlender commented Apr 26, 2019 • edited Loading

mtth commented Apr 27, 2019

jwyglendowski-precisionlender commented Apr 29, 2019

mtth commented May 7, 2019

jwyglendowski-precisionlender commented May 8, 2019

jwyglendowski-precisionlender commented Apr 24, 2019 •

edited

Loading

jwyglendowski-precisionlender commented Apr 25, 2019 •

edited

Loading

jwyglendowski-precisionlender commented Apr 26, 2019 •

edited

Loading

jwyglendowski-precisionlender commented Apr 26, 2019 •

edited

Loading