Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow named types in unions #469

Merged
merged 24 commits into from
Sep 29, 2024
Merged

feat: allow named types in unions #469

merged 24 commits into from
Sep 29, 2024

Conversation

joscha
Copy link
Contributor

@joscha joscha commented Sep 11, 2024

The Avro specification defines:

Unions may not contain more than one schema with the same type, except for the named types record, fixed and enum

see https://avro.apache.org/docs/1.11.1/specification/#unions

meaning that if a record type has a name, it is valid to mix multiple of them in a union, unwrapped.

The same change should possibly be made for enums as well.

The Avro specification defines:

> Unions may not contain more than one schema with the same type, except for the named types record, fixed and enum

see https://avro.apache.org/docs/1.11.1/specification/#unions

meaning that if a `record` type has a name, it is valid to mix multiple of them in a union, unwrapped.
lib/types.js Outdated Show resolved Hide resolved
@joscha joscha closed this Sep 11, 2024
@joscha
Copy link
Contributor Author

joscha commented Sep 11, 2024

Closing this for now, I think it needs an additional fix in the bucket selection for getValueBucket, see

avsc/lib/types.js

Line 1261 in 7cb76a6

let index = this._bucketIndices[getValueBucket(val)];

@joscha
Copy link
Contributor Author

joscha commented Sep 11, 2024

For my current project's, very specific use-case (avro schemas generated from an openapi definition and then the data queried via a generated client as well, so the class names match) this version works:

function getValueBucket(val) {
  if (val === null) {
    return 'null';
  }
  var bucket = typeof val;
  if (bucket === 'object') {
    // Could be bytes, fixed, array, map, or record.
    if (Array.isArray(val)) {
      return 'array';
    } else if (Buffer.isBuffer(val)) {
      return 'buffer';
    }
    if (val.constructor.name) {
      return 'model.' + val.constructor.name;
    };
  }
  return bucket;
}

however

    if (val.constructor.name) {
      return 'model.' + val.constructor.name;
    };

doesn't transfer to raw JSON. The current namespace would need to be passed into the UnwrappedUnion class as well, in order to qualify the type.

Generally, there is a discriminator (an attribute named type) on the containing object and the object to serialize, which can be used to discriminate, but that would need a lot more introspection on the types. I wonder if a custom resolver in getValueBucket would be acceptable, so the serialization implementation can decide? Thoughts?

@joscha joscha reopened this Sep 11, 2024
lib/types.js Outdated Show resolved Hide resolved
@joscha
Copy link
Contributor Author

joscha commented Sep 11, 2024

Reopened this with the proper implementation respecting the namespace and inferred arrays holding objects. Would be great to get some feedback on this and see if this is the right direction. If yes, will add tests of course. All current tests pass with this change.

@mtth
Copy link
Owner

mtth commented Sep 13, 2024

Unions of named types are already supported using wrapped unions. Can you share some context on your use-case to understand why they aren't a good fit?

Note also that you can already force an unwrapped representation for arbitrary unions using logical types (see here, and the linked comment).

@joscha
Copy link
Contributor Author

joscha commented Sep 13, 2024

Unions of named types are already supported using wrapped unions. Can you share some context on your use-case to understand why they aren't a good fit?

Of course. I am using this API from Affinity CRM (OpenAPI spec at the top or a current extract here) to generate a Typescript client with openapi-generator/typescript. You can see the implementation of it here.

Our current Affinity CRM instance has multiple 10 thousand items in it of the Company type.
A Company type can have an arbitrary number of fields with each field having one of many different value types (see here). The overall data is about 1.4 GB of raw JSON.

I am using openapi-generator/avro-schema to generate an Avro schema for the same data.

I then read the data in batches of 100 from the API and stream them directly through avsc on disk using a file encoder.
From there I import them into Snowflake.

I want the data in Snowflake to be:

  • as simple as possible
  • as close as possible to the OpenAPI response from the API

I also want to:

  • ideally not having to post-process the data before sending it to avsc; wrapped unions would require that, as I need to change the JSON data passed to avsc to something like
    field.value = {
      'model.TextValue': field.value
    }
    in order to adhere to the rule that the value needs to sit in an object with a single key of its type name
  • not change the automatically generated Avro schema to resemble wrapped unions as avsc understands them for two reasons:
    1. I don't want to bind the data structure / Avro schema to the current implementation of avsc
    2. I don't want to have to apply a patch (bad) or manual changes (even worse) to the generated schema each time the OpenAPI spec from the API gets an update.

Does that make sense?

Note also that you can already force an unwrapped representation for arbitrary unions using logical types (see here, and the linked comment).

Yes, I am currently using wrapUnions: 'never', however the unions I have do almost certainly not need wrapping. I believe the auto value is currently not deciding correctly. With the fix in the value buckets I don't need to enforce using unwrapped unions, the automatic detection does it right.

@mtth
Copy link
Owner

mtth commented Sep 15, 2024

Thank you for the context.

This change will not help you since you are uploading Avro-encoded files to Snowflake. All unions have the same Avro encoding, independent of their in-memory representation. You will have to check on the Snowflake side to see how they represent the data.

W.r.t. the logic, a major concern is that it only works with values with a named constructor. No other values currently have this requirement.

@joscha
Copy link
Contributor Author

joscha commented Sep 16, 2024

You will have to check on the Snowflake side to see how they represent the data.

with 'never' I seem to just get a plain JSON object. I haven't tried wrapped unions.

W.r.t. the logic, a major concern is that it only works with values with a named constructor. No other values currently have this requirement.

Yes, hence the guard.

@mtth
Copy link
Owner

mtth commented Sep 18, 2024

Yes, hence the guard.

It's not sufficient unfortunately. It would break many users of unwrapped unions, for example the common nullable case:

const type = avro.Type.forSchema([
  'null',
  {type: 'record', name: 'test.Foo', fields: [{name: 'id', type: 'string'}]},
]);

type.toBuffer({id: 'abc'}); // Not valid anymore

(We should add a test for this.)

Another issue is that the library should be self-contained: any constructors required for use with unions should be included in--or generated by--the library. Record constructors are already generated but meant to be opt-in. They also aren't compatible with this implementation, which means that certain values don't roundtrip anymore:

const val = type.fromBuffer(Buffer.from('0206616263', 'hex')); // Generated record instance
type.toBuffer(val); // Throws

For this feature to be worth adding, I think it needs to allow arbitrary record representations. Even if we assume the use of constructors to disambiguate branches, there is no obvious single way to name them. Another user would be justified in requesting that their own records, named slightly differently, also be supported.

Here is one way we could achieve this:

  1. We allow the unwrapUnions option to also accept a function, which is expected to return a disambiguation function (e.g. (types: ReadonlyArray<Type>) => ((val: unknown) => number) | undefined) or nothing.
  2. This function is called at schema parsing time on each union with its branches' types. If it returns a non-null (function) value, that function will be called each time a value's branch needs to be inferred and should return the branch's index. In this case the union will use an unwrapped representation. Otherwise, the union will be wrapped.

This single function gives users full flexibility to decide how to disambiguate unions. In your case, it could be a simple constructor check using the naming convention from the PR. Other users would be free to use their own constructor naming convention, or something else entirely.

WDYT?

@joscha
Copy link
Contributor Author

joscha commented Sep 18, 2024

Yes, hence the guard.

It's not sufficient unfortunately. It would break many users of unwrapped unions, for example the common nullable case:

Another issue is that the library should be self-contained: any constructors required for use with unions should be included in--or generated by--the library. Record constructors are already generated but meant to be opt-in. They also aren't compatible with this implementation, which means that certain values don't roundtrip anymore:

For this feature to be worth adding, I think it needs to allow arbitrary record representations. Even if we assume the use of constructors to disambiguate branches, there is no obvious single way to name them. Another user would be justified in requesting that their own records, named slightly differently, also be supported.

Okay, I get your point, thank you for elaborating. Reading the constructor name is definitely not a great implementation option anyway.

Here is one way we could achieve this:

  1. We allow the unwrapUnions option to also accept a function, which is expected to return a disambiguation function (e.g. (types: ReadonlyArray<Type>) => ((val: unknown) => number) | undefined) or nothing.
  2. This function is called at schema parsing time on each union with its branches' types. If it returns a non-null (function) value, that function will be called each time a value's branch needs to be inferred and should return the branch's index. In this case the union will use an unwrapped representation. Otherwise, the union will be wrapped.

This single function gives users full flexibility to decide how to disambiguate unions. In your case, it could be a simple constructor check using the naming convention from the PR. Other users would be free to use their own constructor naming convention, or something else entirely.

WDYT?

I like it. This is what we would use instead of the result from getValueBucket in those cases when a projection function has been supplied AND returns a defined value, correct?

(We should add a test for this.)

I can do this when I open the pull request for the above?

@joscha
Copy link
Contributor Author

joscha commented Sep 19, 2024

I believe the above implementation suggestion can also solve #225 (comment) - you can simply return the discriminator of the union to ensure there is no wrapping needed. (cc @hath995 - might be a bit late for you possibly)

@mtth
Copy link
Owner

mtth commented Sep 21, 2024

I like it. This is what we would use instead of the result from getValueBucket in those cases when a projection function has been supplied AND returns a defined value, correct?

That's right.

I can do this when I open the pull request for the above?

A separate tiny PR adding the test would be best.

@joscha
Copy link
Contributor Author

joscha commented Sep 23, 2024

A separate tiny PR adding the test would be best.

#477

lib/types.js Outdated Show resolved Hide resolved
@joscha
Copy link
Contributor Author

joscha commented Sep 23, 2024

@mtth PTAL

Copy link
Owner

@mtth mtth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joscha.

Will you also want this backported to 5.x? (It will probably be at least a few weeks before 6.0 is released.)

lib/types.js Outdated Show resolved Hide resolved
types/index.d.ts Outdated Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
@joscha
Copy link
Contributor Author

joscha commented Sep 24, 2024

Will you also want this backported to 5.x? (It will probably be at least a few weeks before 6.0 is released.)

I wasn't going to - given it's not a fix, I think it could reasonably be expected to upgrade if the feature is needed? Do you think we need to?

lib/types.js Outdated Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
@joscha joscha requested a review from mtth September 24, 2024 10:02
@joscha joscha changed the title fix: allow named types in unions feat: allow named types in unions Sep 25, 2024
test/test_types.js Outdated Show resolved Hide resolved
test/test_types.js Show resolved Hide resolved
test/test_types.js Outdated Show resolved Hide resolved
types/index.d.ts Outdated Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
test/test_types.js Show resolved Hide resolved
lib/types.js Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
@joscha joscha requested a review from mtth September 28, 2024 22:05
test/test_types.js Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
lib/types.js Outdated Show resolved Hide resolved
@joscha joscha requested a review from mtth September 29, 2024 06:39
@mtth mtth merged commit aeada8c into mtth:master Sep 29, 2024
3 checks passed
@mtth
Copy link
Owner

mtth commented Sep 29, 2024

Thanks @joscha!

@mtth
Copy link
Owner

mtth commented Sep 29, 2024

I think it could reasonably be expected to upgrade if the feature is needed? Do you think we need to?

I was checking in case you needed it to be backported, to use before 6.0 is released. Otherwise, yes - no reason to.

@joscha joscha deleted the patch-2 branch September 29, 2024 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants