Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some schema updates do not support dots inside a field name #10875

Open
1 of 3 tasks
adamhonen opened this issue Aug 4, 2024 · 0 comments · May be fixed by #10883
Open
1 of 3 tasks

Some schema updates do not support dots inside a field name #10875

adamhonen opened this issue Aug 4, 2024 · 0 comments · May be fixed by #10883
Labels
bug Something isn't working

Comments

@adamhonen
Copy link

adamhonen commented Aug 4, 2024

Apache Iceberg version

1.6.0 (latest release)

Query engine

None

Please describe the bug 🐞

I've reproduced this with the following test added to TestSchemaUpdate.java:

  @Test
  public void testAddFieldWithDotInName() {
    Schema expected =
            new Schema(
                    required(1, "id", Types.IntegerType.get()),
                    optional(2, "dot", Types.StructType.of(
                            required(3, "field", Types.IntegerType.get()))),
                    optional(4, "dot.field", Types.IntegerType.get()));

    Schema schema = new Schema(
            required(1, "id", Types.IntegerType.get()),
            optional(2, "dot", Types.StructType.of(
                            required(3, "field", Types.IntegerType.get()))));

    Schema result =
            new SchemaUpdate(schema, schema.highestFieldId())
                    .addColumn(null,"dot.field", Types.IntegerType.get())
                    .apply();

    assertThat(result.asStruct()).isEqualTo(expected.asStruct());
  }

Seems like there are some internal maps (e.g. Schema.nameToId) that use plain strings as keys. These strings get built through joining nested fields using dots. This makes field nested under dot and a non-nested field dot.field result in the same key, causing the above test to fail.

I started looking at what would be needed to resolve this. Ideally the key isn't a string, but maybe a list of strings that represents different levels of nesting, or something similar. Otherwise, the dots should be escaped and unescaped when dealing with fields that actually have a dot in the name.

The main issue with this is that there are already quite a few methods that implicitly assume the field name doesn't have a dot inside. Some examples: deleteColumn, moveAfter and basically anything on UpdateSchema that doesn't accept a parent field (though even then, what if the parent field has a dot inside?). So more methods (or another interface with said methods) would need to be added in order to support the use case of a dot inside field names, in order to not break existing code.

Edit: I see this is actually a known limitation and s similar solution as suggested above was already discussed, but decided against for the sake of simplicity. IMO this limitation should still be documented in the spec instead of only in error messages.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@adamhonen adamhonen added the bug Something isn't working label Aug 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant