-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-3179] Extracted common AbstractHoodieTableFileIndex
to be shared across engines
#4520
Conversation
AbstractHoodieTableFileIndex
to be shared across enginesAbstractHoodieTableFileIndex
to be shared across engines
26b8cc2
to
dfcae11
Compare
bcbc166
to
ce2a803
Compare
AbstractHoodieTableFileIndex
to be shared across enginesAbstractHoodieTableFileIndex
to be shared across engines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. I left a few nits.
* path with the partition columns in this case. | ||
* | ||
*/ | ||
abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the naming convention for base/abstract classes should be made consistent in the repo, "Base*" or "*Base" or "Abstract*". It is inconsistent for other classes now. For new classes, should we pick one and stick to it, while cleaning up the rest later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, some sort of guideline would be helpful to make sure code base is consistent.
I usually follow the rule of
- If it contains some functionality that has to be extended (ie abstract class) then i go for
Abstract
- If it's just extracts some common functionality, but isn't abstract i go for
Base
But am also fine to settle on particular suffix/prefix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use Base
as the prefix for abstract classes as well, since they have common functionality, and dev knows the class is abstract by keyword?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll merge this PR and you can follow up with the renaming in #4531 .
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/utils/HoodieInputFormatUtils.java
Show resolved
Hide resolved
...rk3/src/main/scala/org/apache/spark/sql/execution/datasources/Spark3ParsePartitionUtil.scala
Outdated
Show resolved
Hide resolved
...park-datasource/hudi-spark/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala
Outdated
Show resolved
Hide resolved
...park-datasource/hudi-spark/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala
Outdated
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/hudi/SparkHoodieTableFileIndex.scala
Outdated
Show resolved
Hide resolved
...park-datasource/hudi-spark/src/main/scala/org/apache/hudi/AbstractHoodieTableFileIndex.scala
Outdated
Show resolved
Hide resolved
} | ||
|
||
(tableType, queryType) match { | ||
case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should incremental be supported here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a bigger refactoring that we will tackle separately
HUDI-3247
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sg
…e Hudi tables file listing, filtering; Extracted `SparkHoodieTableFileIndex` to bear Spark specific extensions of the `HoodieTableFileIndex`
… engine-specific impls
# Conflicts: # hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HiveHoodieTableFileIndex.java
… "hudi-spark-common"
ce2a803
to
f902af4
Compare
@alexeykudinkin we have now introduced scala into hadoop-mr. Can we get rid of this? We will get into publishing different mr-bundles now with different scala versions. I really don't want to open that door. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexeykudinkin could you confirm that this is just breaking up the classes with no code changes
@@ -69,122 +54,123 @@ import scala.util.{Failure, Success, Try} | |||
* , we read it as a Non-Partitioned table because we cannot know how to mapping the partition | |||
* path with the partition columns in this case. | |||
* | |||
* TODO rename to HoodieSparkSqlFileIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
import org.apache.spark.unsafe.types.UTF8String | ||
|
||
/** | ||
* Implementation of the [[AbstractHoodieTableFileIndex]] for Spark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I would like to mode towards a model where base classes are called HoodieXXX
and engine specific classes are called SparkXXX
, SparkSQLXXX
and so on. Just something to keep in mind as we pull out hierarchies
* @param shouldIncludePendingCommits flags whether file-index should exclude any pending operations | ||
* @param fileStatusCache transient cache of fetched [[FileStatus]]es | ||
*/ | ||
abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not quite sure if this is the final abstraction for us to adopt. The FileIndex
apis are pretty spark centric. But cool with it if it helps gets us to a better spot for now
…red across engines (apache#4520)
…red across engines (apache#4520)
…red across engines (apache#4520)
Tips
What is the purpose of the pull request
Extracted common Hudi table's file-index
AbstractHoodieTableFileIndex
to be shared across engines (Spark, Hive for now)AbstractHoodieTableFileIndex
is defined as engine agnostic table's file-index aspect of the currentHoodieFileIndex
implementation.As such following split is established in this PR:
AbstractHoodieTableFileIndex
: generic file-index component responsible for providing accurate file listings based on the timeline, and instant(s) of interest.SparkHoodieTableFileIndex
: Spark-specific implementation of theAbstractHoodieTableFileIndex
HoodieFileIndex
: Spark SQL'sFileIndex
implementation for Hudi Tables based onSparkHoodieTableFileIndex
Brief change log
Please see above
Verify this pull request
This pull request is already covered by existing tests, such as (please describe tests).
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.