Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding blog entry on E-commerce Funnel Analysis with StarRocks: 87 Million Records, Apache Hudi, Apache Iceberg, Delta Lake (MinIO, Apache HMS, Apache xTable) #360

Closed
wants to merge 4 commits into from

Conversation

alberttwong
Copy link
Contributor

What is the purpose of the pull request

adding a new blog entry

Verify this pull request

Manually verified the change by running the website locally.

ashvina and others added 3 commits March 1, 2024 10:53
This is a performance optimization change and extends the improvements added to DeltaClient to IcebergClient.

The current code in the Iceberg client generates unnecessary objects when computing the file diff to find new and removed files. The process first converts all table format data files of the current snapshot to OneDataFiles, uses OneDataFiles to compute the diff, and then converts the resulting OneDataFiles collection back to table format data file objects for writing. There is an unnecessary round trip here. For large tables with thousands of data files in a snapshot, this results in the creation of a large number of objects unnecessarily.

This change optimizes this process by skipping the unnecessary conversions. This optimization does not change the behavior of the translation. This change does not break backward compatibility and is already covered by existing tests.
Support for Deletion Vectors has been added in Delta Lake version 2.4. 
The upgrade also requires updating the spark runtime version to 3.4+

In addition to chaning the version of the dependencies, this change also incorporates all the
backward incompatible changes in the Delta API.
1. getSnapshotAt change: it does not accept an optional timestamp (2nd method argument)
anymore. This argument was not provided / used by XTable and can safely be ignored in all
invocations.
2. addFile api change: It now requires information about DeleteVector as a parameter. As Deletion
vectors writing is not supported in the current version of XTable, a null is provided to the addFile
method call.
3. update transaction api change: It now requires an Catalyst Expression object, instead of a
generic string object, to be linked to a update operation. This change replaces the string object
used by XTable with a Literal-expression.
4. getSnapshot api change: It does not require a timestamp to initialize current snapshot anymore.
This change removes the additional argument in the method invocation in XTable.
5. DeltaLog metadata change: The metadata is now available through the DeltaLog's snapshot instance,
instead of being made available through the DeltaLog itself like in the older versions.
6. change in the update api: it now requires the user to choose if defaults need to be ignored. It
seems that the defaults need to be ingored for operations like copy. By default, the value for
ignore-defaults is false for most operations. Hence it is the choosen value in XTable also.
7. Update spark version requires catalog and sql extension configurations in the sessoin definition.
This change adds these two configs wherever a spark instance is created for writing Delta Lake
commits.
8. Remove deprecated spark config spark.sql.iceberg.handle-timestamp-without-timezone
9. swap hudi-utilities dependency for hudi-sync-common

Co-authored-by: Timothy Brown <[email protected]>
…llion Records, Apache Hudi, Apache Iceberg, Delta Lake (MinIO, Apache HMS, Apache xTable)

Signed-off-by: alberttwong <[email protected]>
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool, Thanks @alberttwong!. Can we separate the code changes (intended?) from the blog changes?

@alberttwong
Copy link
Contributor Author

I was on an old version and I think with I did the git pull it just auto merged. I'm hoping that it'll do the same and just add the delta. If you want, I can make a clean PR.

@alberttwong alberttwong closed this by deleting the head repository Mar 2, 2024
@alberttwong
Copy link
Contributor Author

closing this PR and opening a clean PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants