[HUDI-3179] Extracted common `AbstractHoodieTableFileIndex` to be shared across engines #4520

alexeykudinkin · 2022-01-06T01:49:30Z

Tips

Thank you very much for contributing to Apache Hudi.
Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.

What is the purpose of the pull request

Extracted common Hudi table's file-index AbstractHoodieTableFileIndex to be shared across engines (Spark, Hive for now)

AbstractHoodieTableFileIndex is defined as engine agnostic table's file-index aspect of the current HoodieFileIndex implementation.

As such following split is established in this PR:

AbstractHoodieTableFileIndex: generic file-index component responsible for providing accurate file listings based on the timeline, and instant(s) of interest.
SparkHoodieTableFileIndex: Spark-specific implementation of the AbstractHoodieTableFileIndex
HoodieFileIndex: Spark SQL's FileIndex implementation for Hudi Tables based on SparkHoodieTableFileIndex

Brief change log

Please see above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

yihua

Overall LGTM. I left a few nits.

yihua · 2022-01-13T23:49:28Z

...park-datasource/hudi-spark/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala

+ * path with the partition columns in this case.
+ *
+ */
+abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext,


I think the naming convention for base/abstract classes should be made consistent in the repo, "Base*" or "*Base" or "Abstract*". It is inconsistent for other classes now. For new classes, should we pick one and stick to it, while cleaning up the rest later?

Agreed, some sort of guideline would be helpful to make sure code base is consistent.

I usually follow the rule of

If it contains some functionality that has to be extended (ie abstract class) then i go for Abstract

If it's just extracts some common functionality, but isn't abstract i go for Base

But am also fine to settle on particular suffix/prefix

Let's use Base as the prefix for abstract classes as well, since they have common functionality, and dev knows the class is abstract by keyword?

I'll merge this PR and you can follow up with the renaming in #4531 .

hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieInputFormatUtils.java

...rk3/src/main/scala/org/apache/spark/sql/execution/datasources/Spark3ParsePartitionUtil.scala

...park-datasource/hudi-spark/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala

hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

...park-datasource/hudi-spark/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala

yihua · 2022-01-14T03:02:37Z

...park-datasource/hudi-spark/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala

+    }
+
+    (tableType, queryType) match {
+      case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL) =>


Should incremental be supported here as well?

Yes, this is a bigger refactoring that we will tackle separately
HUDI-3247

…e Hudi tables file listing, filtering; Extracted `SparkHoodieTableFileIndex` to bear Spark specific extensions of the `HoodieTableFileIndex`

… engine-specific impls

…impl

…eIndex`

# Conflicts: # hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieTableFileIndex.java

… "hudi-spark-common"

hudi-bot · 2022-01-14T19:49:19Z

CI report:

f902af4 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

vinothchandar · 2022-01-19T00:35:53Z

@alexeykudinkin we have now introduced scala into hadoop-mr. Can we get rid of this? We will get into publishing different mr-bundles now with different scala versions. I really don't want to open that door.

vinothchandar

@alexeykudinkin could you confirm that this is just breaking up the classes with no code changes

vinothchandar · 2022-01-19T16:20:46Z

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala

@@ -69,122 +54,123 @@ import scala.util.{Failure, Success, Try}
 * , we read it as a Non-Partitioned table because we cannot know how to mapping the partition
 * path with the partition columns in this case.
 *
+ * TODO rename to HoodieSparkSqlFileIndex


vinothchandar · 2022-01-19T16:32:40Z

...-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala

+import org.apache.spark.unsafe.types.UTF8String
+
+/**
+ * Implementation of the [[AbstractHoodieTableFileIndex]] for Spark


In general, I would like to mode towards a model where base classes are called HoodieXXX and engine specific classes are called SparkXXX , SparkSQLXXX and so on. Just something to keep in mind as we pull out hierarchies

vinothchandar · 2022-01-19T19:06:41Z

...tasource/hudi-spark-common/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala

+ * @param shouldIncludePendingCommits flags whether file-index should exclude any pending operations
+ * @param fileStatusCache transient cache of fetched [[FileStatus]]es
+ */
+abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext,


I am not quite sure if this is the final abstraction for us to adopt. The FileIndex apis are pretty spark centric. But cool with it if it helps gets us to a better spot for now

…red across engines (apache#4520)

alexeykudinkin changed the title ~~[HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be shared across engines~~ [HUDI-3179][Stacked on 4417] Extracted common AbstractHoodieTableFileIndex to be shared across engines Jan 6, 2022

alexeykudinkin force-pushed the ak/rpath-ref-2 branch from 26b8cc2 to dfcae11 Compare January 6, 2022 02:47

yihua self-assigned this Jan 11, 2022

alexeykudinkin force-pushed the ak/rpath-ref-2 branch from bcbc166 to ce2a803 Compare January 12, 2022 01:50

alexeykudinkin changed the title ~~[HUDI-3179][Stacked on 4417] Extracted common AbstractHoodieTableFileIndex to be shared across engines~~ [HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be shared across engines Jan 12, 2022

yihua reviewed Jan 14, 2022

View reviewed changes

...park-datasource/hudi-spark/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala Outdated Show resolved Hide resolved

yihua reviewed Jan 14, 2022

View reviewed changes

Alexey Kudinkin added 21 commits January 14, 2022 10:25

Extracted HoodieTableFileIndex component abstracting handling of th…

2cb8a44

…e Hudi tables file listing, filtering; Extracted `SparkHoodieTableFileIndex` to bear Spark specific extensions of the `HoodieTableFileIndex`

Rebased HoodieFileIndex onto SparkHoodieTableFileIndex

1cd8c0a

Fixed refs

28b7742

Tidying up

5ecd7e2

HoodieTableFileIndex > AbstractHoodieTableFileIndex

3e87fc0

Carrying over changes lost during rebase (7d046f, 1f7afb)

6ae85f8

Fixed partitioning columns parsing after recent API changes

1a56efa

Fixed compilation

53761ce

Fixed incorrect instant being queried by

1a204a5

Tidying up

61cec5c

Cleaned up field-map generation seq

f23cc60

Cleaned up PartitionPath abstraction, preparing it to be split into…

3aee47d

… engine-specific impls

Abstracted parsePartitionRow to be defined in engine specific impls

a22bd6a

Tidying up

425853f

Abstracted specifiedQueryInstant to be provided by engine-specific …

21158d8

…impl

Tidying up

ab0726c

Removed dep on Spark's FileStatusCache from `AbstractHoodieTableFil…

9c076d7

…eIndex`

lint

27d50ae

Fixed tests

ac70491

Tidying up

9cde9d8

Tidying up java-docs

d3e8360

# Conflicts: # hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieTableFileIndex.java

Moving AbstractHoodieTableFileIndex, SparkHoodieTableFileIndex to…

f902af4

… "hudi-spark-common"

alexeykudinkin force-pushed the ak/rpath-ref-2 branch from ce2a803 to f902af4 Compare January 14, 2022 18:41

yihua merged commit 75caa7d into apache:master Jan 17, 2022

vinothchandar reviewed Jan 19, 2022

View reviewed changes

vinishjail97 mentioned this pull request Jan 24, 2022

FixIgnoreKey nsivabalan/hudi#11

Closed

5 tasks

vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022

[HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be sha…

e9e12b9

…red across engines (apache#4520)

liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022

[HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be sha…

ca2a782

…red across engines (apache#4520)

vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022

[HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be sha…

c1c14bf

…red across engines (apache#4520)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-3179] Extracted common `AbstractHoodieTableFileIndex` to be shared across engines #4520

[HUDI-3179] Extracted common `AbstractHoodieTableFileIndex` to be shared across engines #4520

alexeykudinkin commented Jan 6, 2022

yihua left a comment

yihua Jan 13, 2022 •

edited

Loading

alexeykudinkin Jan 14, 2022 •

edited

Loading

yihua Jan 17, 2022

yihua Jan 17, 2022

yihua Jan 14, 2022

alexeykudinkin Jan 14, 2022

yihua Jan 17, 2022

hudi-bot commented Jan 14, 2022

vinothchandar commented Jan 19, 2022

vinothchandar left a comment

vinothchandar Jan 19, 2022

vinothchandar Jan 19, 2022

vinothchandar Jan 19, 2022

[HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be shared across engines #4520

[HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be shared across engines #4520

Conversation

alexeykudinkin commented Jan 6, 2022

Tips

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

yihua left a comment

Choose a reason for hiding this comment

yihua Jan 13, 2022 • edited Loading

Choose a reason for hiding this comment

alexeykudinkin Jan 14, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hudi-bot commented Jan 14, 2022

CI report:

vinothchandar commented Jan 19, 2022

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

[HUDI-3179] Extracted common `AbstractHoodieTableFileIndex` to be shared across engines #4520

[HUDI-3179] Extracted common `AbstractHoodieTableFileIndex` to be shared across engines #4520

yihua Jan 13, 2022 •

edited

Loading

alexeykudinkin Jan 14, 2022 •

edited

Loading