Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade Delta Lake version:2.4 for deletion vector support #349

Merged
merged 5 commits into from
Feb 27, 2024

Conversation

ashvina
Copy link
Contributor

@ashvina ashvina commented Feb 27, 2024

Fixes #340

Support for Deletion Vectors has been added in Delta Lake version 2.4. The upgrade also requires updating the spark runtime version to 3.4+

In addition to chaning the version of the dependencies, this change also incorporates all the backward incompatible changes in the Delta API.

  1. getSnapshotAt change: it does not accept a optional timestamp (2nd method argument) anymore. This argument was not provided / used by XTable and can safely be ignored in all invocations.
  2. addFile api change: It now requires information about DeleteVector as a parameter. As Deletion vectors writing is not supported in the current version of XTable, a null is provided to the addFile method call.
  3. update transaction api change: It now requires an Catalyst Expression object, instead of a generic string object, to be linked to a update operation. This change replaces the string object used by XTable with a Literal-expression.
  4. getSnapshot api change: It does not require a timestamp to initialize current snapshot anymore. This change removes the additional argument in the method invocation in XTable.
  5. DeltaLog metadata change: The metadata is now available through the DeltaLog's snapshot instance, instead of being made available through the DeltaLog itself like in the older versions.
  6. change in the update api: it now requires the user to choose if defaults need to be ignored. It seems that the defaults need to be ingored for operations like copy. By default, the value for ignore-defaults is false for most operations. Hence it is the choosen value in XTable also.
  7. Update spark version requires catalog and sql extension configurations in the sessoin definition. This change adds these two configs wherever a spark instance is created for writing Delta Lake commits.

Support for Deletion Vectors has been added in Delta Lake version 2.4. The upgrade also requires
updating the spark runtime version to 3.4+

In addition to chaning the version of the dependencies, this change also incorporates all the
backward incompatible changes in the Delta API.
1. getSnapshotAt change: it does not accept a optional timestamp (2nd method argument)
anymore. This argument was not provided / used by XTable and can safely be ignored in all
invocations.
2. addFile api change: It now requires information about DeleteVector as a parameter. As Deletion
vectors writing is not supported in the current version of XTable, a null is provided to the addFile
method call.
3. update transaction api change: It now requires an Catalyst Expression object, instead of a
generic string object, to be linked to a update operation. This change replaces the string object
used by XTable with a Literal-expression.
4. getSnapshot api change: It does not require a timestamp to initialize current snapshot anymore.
This change removes the additional argument in the method invocation in XTable.
5. DeltaLog metadata change: The metadata is now available through the DeltaLog's snapshot instance,
instead of being made available through the DeltaLog itself like in the older versions.
6. change in the update api: it now requires the user to choose if defaults need to be ignored. It
seems that the defaults need to be ingored for operations like copy. By default, the value for
ignore-defaults is false for most operations. Hence it is the choosen value in XTable also.
7. Update spark version requires catalog and sql extension configurations in the sessoin definition.
This change adds these two configs wherever a spark instance is created for writing Delta Lake
commits.
@ashvina ashvina linked an issue Feb 27, 2024 that may be closed by this pull request
@@ -77,8 +77,6 @@ public static void initSpark() {
spark = SparkSession.builder().config(sparkConf).getOrCreate();
}

@ParameterizedTest
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: Need help in fixing this test. This test is currently failing with an error related to serialization.
Caused by: java.lang.AbstractMethodError: Receiver class org.apache.spark.sql.adapter.Spark3_4Adapter does not define or inherit an implementation of the resolved method 'abstract org.apache.spark.sql.avro.HoodieAvroSerializer createAvroSerializer(org.apache.spark.sql.types.DataType, org.apache.avro.Schema, boolean)' of interface org.apache.spark.sql.hudi.SparkAdapter. at org.apache.hudi.AvroConversionUtils$.createInternalRowToAvroConverter(AvroConversionUtils.scala:59)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a look at this one but there are more tests failing.

@ashvina ashvina force-pushed the 340-update-delta-lake-version-to-2.4+ branch from f11984a to dd72522 Compare February 27, 2024 23:11
@ashvina ashvina merged commit 961d9b0 into main Feb 27, 2024
1 check passed
@ashvina ashvina deleted the 340-update-delta-lake-version-to-2.4+ branch February 27, 2024 23:47
@vinishjail97 vinishjail97 mentioned this pull request Aug 16, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Delta Lake version to 2.4+ (and spark to 3.4+)
2 participants