Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Schema Generation #233

Closed
jwyglendowski-precisionlender opened this issue Apr 24, 2019 · 12 comments
Closed

Auto Schema Generation #233

jwyglendowski-precisionlender opened this issue Apr 24, 2019 · 12 comments

Comments

@jwyglendowski-precisionlender
Copy link

jwyglendowski-precisionlender commented Apr 24, 2019

I am evaluating AVRO ("avsc": "^5.4.7") for use with large objects that have mostly static fields with some that are extremely complex and nested. So the thought was to use the auto Schema generation support and see where that would go.

In my first attempt I was able to encode and decode a large JSON payload without issue until I tried opening it with another library for which I found that the sub record types were not getting named. This lead me to issue 108 which describes a solution that made sense in my case since the issue with the resulting auto generated schema was with sub record types not being auto named.

So I went down the route described in issue 108 and was able to encode a file but not able to decode the file. Looking at the raw output I can see that the root record type name was set but not of the sub types.

I am attaching the sample code and a snippet of the generated schema before and after using the name hook.

Schema Before

{"type":"record","fields":[{"name":"events","type":{"type":"array","items":{"type":"record","fields":[{"name":"sequence","type":"int"},{"name":"fieldItem","type":"string"},{"name":"userId","type":"string"},{"name":"tenantId","type":"string"},{"name":"applicationId","type":"string"},{"name":"contextId","type":"string"},{"name":"type","type":"string"},{"name":"locale","type":"string"},{"name":"name","type":"string"},{"name":"subject","type":"string"},{"name":"viewState","type":"string"},{"name":"metaData","type":{"type":"record","fields":[{"name":"userFullName","type":"string"},{"name":"userEmail","type":"string"},{"name":"username","type":"string"},{"name":"userId","type":"string"}]}},{"name":"startTime","type":"float"},{"name":"schemaVersion","type":"string"},{"name":"data","type":{"type":"array","items":{"type":"record","fields":[{"name":"somefield","type":{"type":"record","fields":[{"name":"id","type":"string"},

Schema After (Look Towards the end)

{"name":"Auto11999","type":"record","fields":[{"name":"events","type":{"type":"array","items":{"name":"Auto11998","type":"record","fields":[{"name":"sequence","type":"int"},{"name":"fieldItem","type":"string"},{"name":"userId","type":"string"},{"name":"tenantId","type":"string"},{"name":"applicationId","type":"string"},{"name":"contextId","type":"string"},{"name":"type","type":"string"},{"name":"locale","type":"string"},{"name":"name","type":"string"},{"name":"subject","type":"string"},{"name":"viewState","type":"string"},{"name":"metaData","type":{"name":"Auto0","type":"record","fields":[{"name":"userFullName","type":"string"},{"name":"userEmail","type":"string"},{"name":"username","type":"string"},{"name":"userId","type":"string"}]}},{"name":"startTime","type":"float"},{"name":"schemaVersion","type":"string"},{"name":"data","type":{"type":"array","items":{"type":"record","fields":[{"name":"somefield","type":{"type":"record","fields":[{"name":"id","type":"string"}

The source json payload in this test is 8 megs in size making to large to post here. Any suggestions would be appreciated.

** Sample Code **

const avro = require('avsc');
cost fs = require('fs');

const eventJson = fs.readFileSync('./large-payload-enough-to-fail-8mb.json').toString('utf8');
const event = JSON.parse(eventJson);

function createNamingHook() {
    let index = 0;
    return function (schema, opts) {
        switch (schema.type) {
            case 'enum':
            case 'fixed':
            case 'record':
                schema.name = `Auto${index++}`;
                break;
            default:
        }
    };
}

const type = avro.Type.forValue(event, {
    typeHook: createNamingHook()
});

if (fs.existsSync("./event.avro")) { fs.unlinkSync("./event.avro"); }

const encoder = avro.createFileEncoder("./event.avro", type);
encoder.end(event);
encoder.destroy();

avro.createFileDecoder('./event.avro')
    .on('metadata', function (type) {
        console.log(type);
    })
    .on('data', function (record) {
        console.log(JSON.stringify(record));
    })
    .on('error', function (err) {
        console.log("err");
    })
    .on('end', function (args) {
        console.log("end");
    });
@jwyglendowski-precisionlender
Copy link
Author

jwyglendowski-precisionlender commented Apr 25, 2019

I think the issue revolves around the fact that the object that I am passing into forValue has fields with arrays that have no root node. All the items record types that are not getting named fall into that category.

Will Get Skipped

{
    "data": [
        {
            "field1": {
                "id": "467346b1-8f59-4079-be88-6b55a2466335"
            }
            "field2": "0c12e5cc-3a7a-4b26-86e0-f3ad95092631"
        },
        {
            "field1": {
                "id": "467346b1-8f59-4079-be88-6b55a2466335"
            }
            "field2": "0c12e5cc-3a7a-4b26-86e0-f3ad95092631"
        }
    ]
}

If my base json was formated like the following:

{
    "data": [
        {
            dataItem: {
                "field1": {
                    "id": "467346b1-8f59-4079-be88-6b55a2466335"
                }
                "field2": "0c12e5cc-3a7a-4b26-86e0-f3ad95092631"
            }
        },
        {
            dataItem: {
                "field1": {
                    "id": "467346b1-8f59-4079-be88-6b55a2466335"
                }
                "field2": "0c12e5cc-3a7a-4b26-86e0-f3ad95092631"
            }
        }
    ]
}

Then I believe it would work. What would be nice is if there was someway to pass the parent's name to the type hook and ensure that it gets called on in this specific case where there is no root node to the array objects. This is probably never an issue if the source of the data begins its life as a C# or Java object that is serialized into JSON.

@jwyglendowski-precisionlender
Copy link
Author

So I attempted using the example I posted in my last comment and that didn't work either. The resulting encoded file never includes any data.

@mtth
Copy link
Owner

mtth commented Apr 26, 2019

I think the issue revolves around the fact that the object that I am passing into forValue has fields with arrays that have no root node. All the items record types that are not getting named fall into that category.

Great find! There was indeed a bug where type options (which include the type hook) weren't properly passed when combining array types; f69c869 should fix this. Can you upgrade to 5.4.9 and try again?

@jwyglendowski-precisionlender
Copy link
Author

@mtth Okay I updated to 5.4.9 and now the auto naming works across the schema. So that bug based on my test is fixed. It doesn't explain why the file never gets encoded but I will follow up with that in issue 232. Thank you so much for the quick turn around it's very much appreciated.

@jwyglendowski-precisionlender
Copy link
Author

jwyglendowski-precisionlender commented Apr 26, 2019

I thought everything was fixed but when I went to run the resulting avro file through C# it failed for a record type missing a name. I found another edge case and am trying to define what is exactly it is.

@jwyglendowski-precisionlender
Copy link
Author

@mtth Here are some snippets from my large file that didn't get names. I am not sure if this helps are not.

{
	"type": "array",
	"items": ["string", {
		"type": "record",
		"fields": [{
			"name": "field1",
			"type": "float"
		}, {
			"name": "field2",
			"type": "float"
		}, {
			"name": "field3",
			"type": "int"
		}]
	}]
}
{
	"name": "someName",
	"type": {
		"type": "record",
		"fields": [{
			"name": "major",
			"type": "int"
		}, {
			"name": "minor",
			"type": "int"
		}, {
			"name": "patch",
			"type": "int"
		}, {
			"name": "build",
			"type": "int"
		}]
	}
}
{
	"name": "anotherName",
	"type": {
		"type": "array",
		"items": {
			"type": "record",
			"fields": [{
				"name": "field1",
				"type": "float"
			},  {
				"name": "nameTwo",
				"type": {
					"type": "record",
					"fields": [{
						"name": "major",
						"type": "int"
					}]
				}
			}]
		}
	}
}

@mtth
Copy link
Owner

mtth commented Apr 26, 2019

Thanks! I just fixed another edge case where the options weren't wired through properly; can you upgrade to 5.4.10 and see if you still get errors?

@jwyglendowski-precisionlender
Copy link
Author

jwyglendowski-precisionlender commented Apr 26, 2019

@mtth That did the trick. I do have one suggestion that I am not sure if it's possible as it relates to the type hook. What would be nice is if you could enable something like below. If we knew the parent name then we could make a record name that had more meaning. The project jsonschema-avro follows a similar convention for dealing with records with no names.

It would also be nice if the root of the schema is of type record to be able to specify that somehow.

If you do decide to make that change I would be happy to run it against my monster 6 meg file and review the resulting schema.

Thanks a bunch for you help. Hopefully those edge cases help someone else out.
A+ Jason Wyglendowski
Bonne Weekend

function createNamingHook() {
    let index = 0;
    return function (schema, opts) {
        switch (schema.type) {
            case 'enum':
            case 'fixed':
            case 'record':
                if(!schema.name) {
                   const schemaName =  `${parentName}.record` ||  `Auto${index++}`;
                   schema.name = schemaName ;
               }
                
                break;
            default:
        }
    };
}

@mtth
Copy link
Owner

mtth commented Apr 27, 2019

Great that it works!

If we knew the parent name then we could make a record name that had more meaning.

Unfortunately, this is tricky: children's schemas are generated before their parent's (so there is no "parent name" yet). An alternative could be to provide a path to the value which is currently being parsed, similar to what isValid's error hook provides. Would still be useful enough for your use-case?

@jwyglendowski-precisionlender
Copy link
Author

@mtth It definitely sounds like it would work in my use case. I should be able to create a strategy to determine the name based on the path.

@mtth
Copy link
Owner

mtth commented May 7, 2019

Great, I filed #234 to track adding it to avsc.

If you are only using type-related logic in avsc, consider switching to @avro/types where I already released it (it's one of the packages avsc will be be broken into at the next major release). Here's how you can use it:

const {Type} = require('@avro/types');

const val = {
  one: [1,2,3],
  two: {
    three: 'four',
  },
  five: [
    {six: 6},
    {six: '6'},
  ],
};

function typeHook(schema, opts, scope) {
  // scope.path is the path to the type about to be generated.
}

const type = Type.forValue(val, {typeHook});

@mtth mtth closed this as completed May 7, 2019
@jwyglendowski-precisionlender
Copy link
Author

@mtth Thanks I will check out that example. A+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants