Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding blog entry on E-commerce Funnel Analysis with StarRocks: 87 Million Records, Apache Hudi, Apache Iceberg, Delta Lake (MinIO, Apache HMS, Apache xTable) #360

Closed
wants to merge 4 commits into from

Commits on Mar 1, 2024

  1. Avoid unnecessary Iceberg datafile to onedatafile conversions (#330)

    This is a performance optimization change and extends the improvements added to DeltaClient to IcebergClient.
    
    The current code in the Iceberg client generates unnecessary objects when computing the file diff to find new and removed files. The process first converts all table format data files of the current snapshot to OneDataFiles, uses OneDataFiles to compute the diff, and then converts the resulting OneDataFiles collection back to table format data file objects for writing. There is an unnecessary round trip here. For large tables with thousands of data files in a snapshot, this results in the creation of a large number of objects unnecessarily.
    
    This change optimizes this process by skipping the unnecessary conversions. This optimization does not change the behavior of the translation. This change does not break backward compatibility and is already covered by existing tests.
    ashvina authored and alberttwong committed Mar 1, 2024
    Configuration menu
    Copy the full SHA
    d5d1a13 View commit details
    Browse the repository at this point in the history
  2. Upgrade Delta Lake version:2.4 for deletion vector support (#349)

    Support for Deletion Vectors has been added in Delta Lake version 2.4. 
    The upgrade also requires updating the spark runtime version to 3.4+
    
    In addition to chaning the version of the dependencies, this change also incorporates all the
    backward incompatible changes in the Delta API.
    1. getSnapshotAt change: it does not accept an optional timestamp (2nd method argument)
    anymore. This argument was not provided / used by XTable and can safely be ignored in all
    invocations.
    2. addFile api change: It now requires information about DeleteVector as a parameter. As Deletion
    vectors writing is not supported in the current version of XTable, a null is provided to the addFile
    method call.
    3. update transaction api change: It now requires an Catalyst Expression object, instead of a
    generic string object, to be linked to a update operation. This change replaces the string object
    used by XTable with a Literal-expression.
    4. getSnapshot api change: It does not require a timestamp to initialize current snapshot anymore.
    This change removes the additional argument in the method invocation in XTable.
    5. DeltaLog metadata change: The metadata is now available through the DeltaLog's snapshot instance,
    instead of being made available through the DeltaLog itself like in the older versions.
    6. change in the update api: it now requires the user to choose if defaults need to be ignored. It
    seems that the defaults need to be ingored for operations like copy. By default, the value for
    ignore-defaults is false for most operations. Hence it is the choosen value in XTable also.
    7. Update spark version requires catalog and sql extension configurations in the sessoin definition.
    This change adds these two configs wherever a spark instance is created for writing Delta Lake
    commits.
    8. Remove deprecated spark config spark.sql.iceberg.handle-timestamp-without-timezone
    9. swap hudi-utilities dependency for hudi-sync-common
    
    Co-authored-by: Timothy Brown <[email protected]>
    2 people authored and alberttwong committed Mar 1, 2024
    Configuration menu
    Copy the full SHA
    b310888 View commit details
    Browse the repository at this point in the history
  3. adding blog entry on E-commerce Funnel Analysis with StarRocks: 87 Mi…

    …llion Records, Apache Hudi, Apache Iceberg, Delta Lake (MinIO, Apache HMS, Apache xTable)
    
    Signed-off-by: alberttwong <[email protected]>
    alberttwong committed Mar 1, 2024
    Configuration menu
    Copy the full SHA
    5d54f5a View commit details
    Browse the repository at this point in the history

Commits on Mar 2, 2024

  1. Configuration menu
    Copy the full SHA
    f93c8f3 View commit details
    Browse the repository at this point in the history