Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs #4175

Merged
merged 3 commits into from
Mar 22, 2022

Conversation

rmahindra123
Copy link
Contributor

Refactor hive sync tool / config to use reflection and standardize configs

@rmahindra123 rmahindra123 changed the title [HUDI-2883] [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs Dec 1, 2021
@rmahindra123 rmahindra123 changed the title [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs [HUDI-2883] WIP Refactor hive sync tool / config to use reflection and standardize configs Dec 2, 2021
@rmahindra123 rmahindra123 marked this pull request as draft December 2, 2021 22:13
@rmahindra123 rmahindra123 force-pushed the rm_ref_hive_sync branch 5 times, most recently from 8e0bc42 to 9c13f28 Compare December 6, 2021 21:31
@rmahindra123 rmahindra123 changed the title [HUDI-2883] WIP Refactor hive sync tool / config to use reflection and standardize configs [HUDI-2883] Refactor hive sync tool / config to use reflection and standardize configs Dec 7, 2021
@rmahindra123 rmahindra123 marked this pull request as ready for review December 7, 2021 00:38
@vinothchandar vinothchandar self-assigned this Dec 7, 2021
@nsivabalan
Copy link
Contributor

@rmahindra123 : I can help review this patch. Can you rebase w/ latest master. also, lets sync up sometime to get some context on the patch.

@nsivabalan nsivabalan self-assigned this Jan 17, 2022
@nsivabalan nsivabalan added the priority:critical production down; pipelines stalled; Need help asap. label Feb 8, 2022
if (this.config.isHiveLocal()) {
writer.getDeltaStreamerWrapper().getDeltaSyncService().getDeltaSync()
.syncHive(getLocalHiveServer().getHiveConf());
hiveSyncTool = new HiveSyncTool(writer.getWriteConfig().getProps(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we removed or deprecated DeltaSync.syncHive(...) in this patch? if not, would prefer to go via deltaSync syncHive api.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just a helper function right? The flow seems a bit weird to call the method from DeltaSync. I had removed the method in DeltaSync as well since this was the only place it was used. Can discuss more 1:1 if required.

hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_ASSUME_DATE_PARTITION.key(), "true");
hiveSyncProps.setProperty(HiveSyncConfig.HIVE_USE_PRE_APACHE_INPUT_FORMAT.key(), "false");
hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_PARTITION_FIELDS.key(), "datestr");
hiveSyncProps.setProperty(HiveSyncConfig.HIVE_BATCH_SYNC_PARTITION_NUM.key(), "3");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see you are setting jdbcUrl and you are additionally setting this BATCH_SYNC_PARTITION_NUM in this patch. is that intentional ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Thanks for the thorough review.

  1. I haven't removed it, its set in line 123.
    hiveSyncProps.setProperty(HiveSyncConfig.HIVE_URL.key(), hiveTestService.getJdbcHive2Url());
    2 For batch sync partition num, literally every test was setting it manually to HiveTestUtil.hiveSyncConfig.batchSyncNum = 3;

@nsivabalan
Copy link
Contributor

@rmahindra123 : just to be cautious. can you try test this patch out for an existing hudi table for the hive sync functionality.

  1. create a new table and do hive sync using older hudi (master)
  2. and then use this patch to do another round of ingestion. may be add a new partition this time. ensure hive sync works w/o any change in configs compared to 1.

@nsivabalan
Copy link
Contributor

nsivabalan commented Feb 17, 2022

@xiarixiaoyao : can you help us test this patch out. Rajesh refactored hive sync quite a bit mostly around class structuring, config usages, instantiation etc. But if you can test this out and let us know if it looks good, would be nice.
btw, feel free to review the patch as well if you are interested.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments. for some of the resolved comments, I will sync up directly with you

@rmahindra123
Copy link
Contributor Author

addressed reviewer comments

@nsivabalan
Copy link
Contributor

@xiarixiaoyao : Did you get a chance to test this patch out.

@xiarixiaoyao
Copy link
Contributor

@nsivabalan @rmahindra123 will review and test this pr This weekend

@xushiyan xushiyan self-assigned this Mar 11, 2022
@xushiyan xushiyan added priority:blocker and removed priority:critical production down; pipelines stalled; Need help asap. labels Mar 11, 2022
hiveSyncProps.setProperty(HiveSyncConfig.META_SYNC_PARTITION_FIELDS.key(), "datestr");
hiveSyncProps.setProperty(HiveSyncConfig.HIVE_BATCH_SYNC_PARTITION_NUM.key(), "3");

hiveSyncConfig = new HiveSyncConfig(hiveSyncProps);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd prefer to have builder pattern to reliably construct HiveSyncConfig instead of passing uncontrollable "raw" props.


public AbstractSyncTool(Properties props, FileSystem fileSystem) {
public AbstractSyncTool(TypedProperties props, Configuration conf, FileSystem fs) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a cleaner interface would be

Suggested change
public AbstractSyncTool(TypedProperties props, Configuration conf, FileSystem fs) {
public AbstractSyncTool(HoodieSyncConfig syncConfig, FileSystem fs) {

hadoopConf should be set by fs.getConf(). And props is not controllable; hard for user to decide what to pass in here.

HoodieSyncConfig is the matching config class for SyncTool as the names suggest. Internally we can still do props = syncConfig.getProps();

Comment on lines 163 to 170
TypedProperties metaProps = new TypedProperties();
metaProps.putAll(props);
metaProps.put(HoodieSyncConfig.META_SYNC_BASE_PATH, cfg.targetBasePath);
metaProps.put(HoodieSyncConfig.META_SYNC_BASE_FILE_FORMAT, cfg.baseFileFormat);
if (props.getBoolean(HiveSyncConfig.HIVE_SYNC_BUCKET_SYNC.key(), HiveSyncConfig.HIVE_SYNC_BUCKET_SYNC.defaultValue())) {
metaProps.put(HiveSyncConfig.HIVE_SYNC_BUCKET_SYNC_SPEC, HiveSyncConfig.getBucketSpec(props.getString(HoodieIndexConfig.BUCKET_INDEX_HASH_FIELD.key()),
props.getInteger(HoodieIndexConfig.BUCKET_INDEX_NUM_BUCKETS.key())));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with builder pattern we should be able to simplify this sort of props construction

@xushiyan xushiyan force-pushed the rm_ref_hive_sync branch 2 times, most recently from 95caf2b to 60d9472 Compare March 17, 2022 13:49
@apache apache deleted a comment from hudi-bot Mar 17, 2022
this.bucketSpec = props.getString(HIVE_SYNC_BUCKET_SYNC_SPEC.key(), null);
this.bucketSpec = getStringOrDefault(HIVE_SYNC_BUCKET_SYNC_SPEC);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmahindra123 Not sure why this was particularly set to null. seems like the default "" is ok.

this.syncComment = getBooleanOrDefault(HIVE_SYNC_COMMENT);
}

// enhance the similar function in child class
public static HiveSyncConfig copy(HiveSyncConfig cfg) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used.

this(new TypedProperties(props), new Configuration(), fileSystem);
this(new TypedProperties(props), fileSystem.getConf(), fileSystem);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmahindra123 if user provides an fs, shall we always make use of its conf? Actually not sure why the new API allow passing a different conf rather than use the one from fileSystem

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@xushiyan xushiyan force-pushed the rm_ref_hive_sync branch 3 times, most recently from d20111e to de57bff Compare March 18, 2022 13:37
@apache apache deleted a comment from hudi-bot Mar 18, 2022
@apache apache deleted a comment from hudi-bot Mar 18, 2022
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

this(new TypedProperties(props), new Configuration(), fileSystem);
this(new TypedProperties(props), fileSystem.getConf(), fileSystem);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@nsivabalan nsivabalan merged commit 5f570ea into apache:master Mar 22, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Mar 23, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Mar 23, 2022
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Mar 24, 2022
vingov pushed a commit to vingov/hudi that referenced this pull request Apr 3, 2022
…andardize configs (apache#4175)

- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <[email protected]>
Co-authored-by: Rajesh Mahindra <[email protected]>
Co-authored-by: Raymond Xu <[email protected]>
stayrascal pushed a commit to stayrascal/hudi that referenced this pull request Apr 12, 2022
…andardize configs (apache#4175)

- Refactor hive sync tool / config to use reflection and standardize configs

Co-authored-by: sivabalan <[email protected]>
Co-authored-by: Rajesh Mahindra <[email protected]>
Co-authored-by: Raymond Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants