Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5019] feat: (hadoop-catalog): Add a framework to support multi-storage in a pluggable manner for fileset catalog #5020

Merged
merged 48 commits into from
Oct 16, 2024

Conversation

yuqi1129
Copy link
Contributor

What changes were proposed in this pull request?

Add a framework to support multiple storage system within Hadoop catalog

Why are the changes needed?

Some users want Gravitino to manage file system like S3 or GCS.

Fix: #5019

Does this PR introduce any user-facing change?

N/A.

How was this patch tested?

Existing test.

@yuqi1129
Copy link
Contributor Author

@xiaozcy
Please help to take a look if this PR meets your requirements, thanks.

build.gradle.kts Outdated Show resolved Hide resolved
@jerryshao
Copy link
Contributor

I'm OK with the interface design, please polish the code to make it ready to review, also refactor the client side GravitinoVirtualFileSystem to also leverage this interface to make the fs pluggable.

@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Oct 9, 2024

refactor the client side GravitinoVirtualFileSystem to also leverage this interface to make the fs pluggable

I will use another PR to refactor the Java client for the fileset.

@yuqi1129 yuqi1129 self-assigned this Oct 9, 2024
@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Oct 9, 2024

I'm OK with the interface design, please polish the code to make it ready to review, also refactor the client side GravitinoVirtualFileSystem to also leverage this interface to make the fs pluggable.

@jerryshao
I have updated the PR and please help to review it again, thanks.

@yuqi1129
Copy link
Contributor Author

@jerryshao
The code has been updated.

Copy link
Contributor

@jerryshao jerryshao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please carefully review your code for several times to avoid any typos, polish your code structure and simplify the logic as possible as you can.

(String)
propertiesMetadata
.catalogPropertiesMetadata()
.getOrDefault(config, HadoopCatalogPropertiesMetadata.DEFAULT_FS_PROVIDER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default value for this property, should if be "file"?

}

LOG.warn(
"Can't get schema from path: {} and default filesystem provider is null, using"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scheme...

return getFileSystemByScheme(LOCAL_FILE_SCHEME, config, path);
}

return getFileSystemByScheme(path.toUri().getScheme(), config, path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (scheme == null && defaultFSProvider == null) {
  LOG.warn(xxx)
}

newScheme = scheme == null ? defaultFSProvider.getScheme() : scheme;
getFileSystemByScheme(newScheme);

@@ -90,6 +90,10 @@ public class HadoopCatalogOperations implements CatalogOperations, SupportsSchem

private CatalogInfo catalogInfo;

private final Map<String, FileSystemProvider> fileSystemProvidersMap = Maps.newHashMap();

private String defaultFilesystemProvider;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can maintain a default FileSystemProvider, not just a string. Besides, it is FileSystem, not FileSystem.

throws IOException {
FileSystemProvider provider = fileSystemProvidersMap.get(scheme);
if (provider == null) {
throw new IllegalArgumentException("Unsupported scheme: " + scheme);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should clearly tell user why the exception is happened, the code here is not easy for user to know the actual reason.

PropertyEntry.stringOptionalPropertyEntry(
DEFAULT_FS_PROVIDER,
"Default file system provider, used to create the default file system "
+ "candidate value is 'local', 'hdfs' or others specified in the "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file, not local...

@jerqi
Copy link
Collaborator

jerqi commented Oct 15, 2024

pluginized -> plugable.

@jerryshao jerryshao changed the title [#5019] feat: (hadoop-catalog): Add a framework to support multi-storage in a pluginized manner for fileset catalog [#5019] feat: (hadoop-catalog): Add a framework to support multi-storage in a pluggable manner for fileset catalog Oct 15, 2024
@@ -71,6 +75,11 @@ tasks.build {
dependsOn("javadoc")
}

tasks.compileJava {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module depends on catalog-hadoop, however, due to the fact that the output directory of task :catalogs:catalog-hadoop:jar and :catalogs:catalog-hadoop:runtimeJars are the same (lib/build/), so we need to add these two dependency or gradle compile will fail, the same goes for others module

tasks.test {
  val skipITs = project.hasProperty("skipITs")
  if (skipITs) {
    exclude("**/integration/test/**")
  } else {
    dependsOn(":catalogs:catalog-hadoop:jar", ":catalogs:catalog-hadoop:runtimeJars")
    dependsOn(":catalogs:catalog-hive:jar", ":catalogs:catalog-hive:runtimeJars")
    dependsOn(":catalogs:catalog-kafka:jar", ":catalogs:catalog-kafka:runtimeJars")
  }
}

Comment on lines 146 to 150
this.defaultFileSystemProviderScheme =
StringUtils.isNotBlank(defaultFileSystemProviderClassName)
? FileSystemUtils.getSchemeByFileSystemProvider(
defaultFileSystemProviderClassName, fileSystemProvidersMap)
: LOCAL_FILE_SCHEME;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can create the default fs provider and get scheme from default fs provider, no need to use this logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the service load mechanism to load file system providers, I can't get why we need to provide the configuration filesystem-providers, please see below

// fileSystemProviders is useless if we use ServiceLoader 
  public static Map<String, FileSystemProvider> getFileSystemProviders(String fileSystemProviders) {
    Map<String, FileSystemProvider> resultMap = Maps.newHashMap();
    ServiceLoader<FileSystemProvider> fileSystemProvidersLoader =
        ServiceLoader.load(FileSystemProvider.class);
    fileSystemProvidersLoader.forEach(
        fileSystemProvider -> resultMap.put(fileSystemProvider.name(), fileSystemProvider));
    return resultMap;
  }

then the only configuration we need to provide is default-filesystem-provider, @jerryshao what do you think about it, is it okay for you?

@@ -125,6 +132,10 @@ public void initialize(URI name, Configuration configuration) throws IOException

initializeClient(configuration);

// Register the default local and HDFS FileSystemProvider
String fileSystemProviders = configuration.get(FS_FILESYSTEM_PROVIDERS);
fileSystemProvidersMap.putAll(FileSystemUtils.getFileSystemProviders(fileSystemProviders));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gvfs will depend on the hadoop-catalog, can you please check the dependencies of gvfs to make sure the shading jar is correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some existing IT has checked the current code, moreover, I have verified this logic when developing python GVFS client.

@@ -125,6 +132,10 @@ public void initialize(URI name, Configuration configuration) throws IOException

initializeClient(configuration);

// Register the default local and HDFS FileSystemProvider
String fileSystemProviders = configuration.get(FS_FILESYSTEM_PROVIDERS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need a default fileSystemProvider for gvfs?

docs/hadoop-catalog.md Outdated Show resolved Hide resolved
@jerryshao jerryshao merged commit b06b511 into apache:main Oct 16, 2024
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Add a framework to support multi-storage in a pluginized manner for fileset catalog.
5 participants