Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be shared across engines #4520

Merged
merged 22 commits into from
Jan 17, 2022

Conversation

alexeykudinkin
Copy link
Contributor

Tips

What is the purpose of the pull request

Extracted common Hudi table's file-index AbstractHoodieTableFileIndex to be shared across engines (Spark, Hive for now)

AbstractHoodieTableFileIndex is defined as engine agnostic table's file-index aspect of the current HoodieFileIndex implementation.

As such following split is established in this PR:

  • AbstractHoodieTableFileIndex: generic file-index component responsible for providing accurate file listings based on the timeline, and instant(s) of interest.
  • SparkHoodieTableFileIndex: Spark-specific implementation of the AbstractHoodieTableFileIndex
  • HoodieFileIndex: Spark SQL's FileIndex implementation for Hudi Tables based on SparkHoodieTableFileIndex

Brief change log

Please see above

Verify this pull request

This pull request is already covered by existing tests, such as (please describe tests).

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@alexeykudinkin alexeykudinkin changed the title [HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be shared across engines [HUDI-3179][Stacked on 4417] Extracted common AbstractHoodieTableFileIndex to be shared across engines Jan 6, 2022
@yihua yihua self-assigned this Jan 11, 2022
@alexeykudinkin alexeykudinkin changed the title [HUDI-3179][Stacked on 4417] Extracted common AbstractHoodieTableFileIndex to be shared across engines [HUDI-3179] Extracted common AbstractHoodieTableFileIndex to be shared across engines Jan 12, 2022
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. I left a few nits.

* path with the partition columns in this case.
*
*/
abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext,
Copy link
Contributor

@yihua yihua Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the naming convention for base/abstract classes should be made consistent in the repo, "Base*" or "*Base" or "Abstract*". It is inconsistent for other classes now. For new classes, should we pick one and stick to it, while cleaning up the rest later?

Copy link
Contributor Author

@alexeykudinkin alexeykudinkin Jan 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, some sort of guideline would be helpful to make sure code base is consistent.

I usually follow the rule of

  • If it contains some functionality that has to be extended (ie abstract class) then i go for Abstract
  • If it's just extracts some common functionality, but isn't abstract i go for Base

But am also fine to settle on particular suffix/prefix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use Base as the prefix for abstract classes as well, since they have common functionality, and dev knows the class is abstract by keyword?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll merge this PR and you can follow up with the renaming in #4531 .

}

(tableType, queryType) match {
case (MERGE_ON_READ, QUERY_TYPE_SNAPSHOT_OPT_VAL) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should incremental be supported here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a bigger refactoring that we will tackle separately
HUDI-3247

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 75caa7d into apache:master Jan 17, 2022
@vinothchandar
Copy link
Member

@alexeykudinkin we have now introduced scala into hadoop-mr. Can we get rid of this? We will get into publishing different mr-bundles now with different scala versions. I really don't want to open that door.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexeykudinkin could you confirm that this is just breaking up the classes with no code changes

@@ -69,122 +54,123 @@ import scala.util.{Failure, Success, Try}
* , we read it as a Non-Partitioned table because we cannot know how to mapping the partition
* path with the partition columns in this case.
*
* TODO rename to HoodieSparkSqlFileIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

import org.apache.spark.unsafe.types.UTF8String

/**
* Implementation of the [[AbstractHoodieTableFileIndex]] for Spark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I would like to mode towards a model where base classes are called HoodieXXX and engine specific classes are called SparkXXX , SparkSQLXXX and so on. Just something to keep in mind as we pull out hierarchies

* @param shouldIncludePendingCommits flags whether file-index should exclude any pending operations
* @param fileStatusCache transient cache of fetched [[FileStatus]]es
*/
abstract class AbstractHoodieTableFileIndex(engineContext: HoodieEngineContext,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not quite sure if this is the final abstraction for us to adopt. The FileIndex apis are pretty spark centric. But cool with it if it helps gets us to a better spot for now

@vinishjail97 vinishjail97 mentioned this pull request Jan 24, 2022
5 tasks
vingov pushed a commit to vingov/hudi that referenced this pull request Jan 26, 2022
liusenhua pushed a commit to liusenhua/hudi that referenced this pull request Mar 1, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants