-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#5019] feat: (hadoop-catalog): Add a framework to support multi-storage in a pluggable manner for fileset catalog #5020
Conversation
@xiaozcy |
...g-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/DefaultConfigurationProvider.java
Outdated
Show resolved
Hide resolved
...atalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java
Outdated
Show resolved
Hide resolved
...ogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/FileSystemProvider.java
Outdated
Show resolved
Hide resolved
bundles/s3-bundle/src/main/java/org/apache/gravitino/fileset/s3/S3FileSystemProvider.java
Outdated
Show resolved
Hide resolved
...atalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java
Outdated
Show resolved
Hide resolved
...alog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/fs/HDFSFileSystemProvider.java
Outdated
Show resolved
Hide resolved
...ogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/FileSystemProvider.java
Outdated
Show resolved
Hide resolved
...ogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/FileSystemProvider.java
Outdated
Show resolved
Hide resolved
...log-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/fs/LocalFileSystemProvider.java
Outdated
Show resolved
Hide resolved
I'm OK with the interface design, please polish the code to make it ready to review, also refactor the client side GravitinoVirtualFileSystem to also leverage this interface to make the fs pluggable. |
I will use another PR to refactor the Java client for the fileset. |
@jerryshao |
@jerryshao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please carefully review your code for several times to avoid any typos, polish your code structure and simplify the logic as possible as you can.
(String) | ||
propertiesMetadata | ||
.catalogPropertiesMetadata() | ||
.getOrDefault(config, HadoopCatalogPropertiesMetadata.DEFAULT_FS_PROVIDER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the default value for this property, should if be "file"?
} | ||
|
||
LOG.warn( | ||
"Can't get schema from path: {} and default filesystem provider is null, using" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scheme...
return getFileSystemByScheme(LOCAL_FILE_SCHEME, config, path); | ||
} | ||
|
||
return getFileSystemByScheme(path.toUri().getScheme(), config, path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (scheme == null && defaultFSProvider == null) {
LOG.warn(xxx)
}
newScheme = scheme == null ? defaultFSProvider.getScheme() : scheme;
getFileSystemByScheme(newScheme);
@@ -90,6 +90,10 @@ public class HadoopCatalogOperations implements CatalogOperations, SupportsSchem | |||
|
|||
private CatalogInfo catalogInfo; | |||
|
|||
private final Map<String, FileSystemProvider> fileSystemProvidersMap = Maps.newHashMap(); | |||
|
|||
private String defaultFilesystemProvider; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can maintain a default FileSystemProvider
, not just a string. Besides, it is FileSystem
, not FileSystem
.
throws IOException { | ||
FileSystemProvider provider = fileSystemProvidersMap.get(scheme); | ||
if (provider == null) { | ||
throw new IllegalArgumentException("Unsupported scheme: " + scheme); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should clearly tell user why the exception is happened, the code here is not easy for user to know the actual reason.
PropertyEntry.stringOptionalPropertyEntry( | ||
DEFAULT_FS_PROVIDER, | ||
"Default file system provider, used to create the default file system " | ||
+ "candidate value is 'local', 'hdfs' or others specified in the " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file, not local...
|
@@ -71,6 +75,11 @@ tasks.build { | |||
dependsOn("javadoc") | |||
} | |||
|
|||
tasks.compileJava { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This module depends on catalog-hadoop
, however, due to the fact that the output directory of task :catalogs:catalog-hadoop:jar
and :catalogs:catalog-hadoop:runtimeJars
are the same (lib/build/), so we need to add these two dependency or gradle compile will fail, the same goes for others module
tasks.test {
val skipITs = project.hasProperty("skipITs")
if (skipITs) {
exclude("**/integration/test/**")
} else {
dependsOn(":catalogs:catalog-hadoop:jar", ":catalogs:catalog-hadoop:runtimeJars")
dependsOn(":catalogs:catalog-hive:jar", ":catalogs:catalog-hive:runtimeJars")
dependsOn(":catalogs:catalog-kafka:jar", ":catalogs:catalog-kafka:runtimeJars")
}
}
...adoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogPropertiesMetadata.java
Outdated
Show resolved
Hide resolved
this.defaultFileSystemProviderScheme = | ||
StringUtils.isNotBlank(defaultFileSystemProviderClassName) | ||
? FileSystemUtils.getSchemeByFileSystemProvider( | ||
defaultFileSystemProviderClassName, fileSystemProvidersMap) | ||
: LOCAL_FILE_SCHEME; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can create the default fs provider and get scheme from default fs provider, no need to use this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the service load mechanism to load file system providers, I can't get why we need to provide the configuration filesystem-providers
, please see below
// fileSystemProviders is useless if we use ServiceLoader
public static Map<String, FileSystemProvider> getFileSystemProviders(String fileSystemProviders) {
Map<String, FileSystemProvider> resultMap = Maps.newHashMap();
ServiceLoader<FileSystemProvider> fileSystemProvidersLoader =
ServiceLoader.load(FileSystemProvider.class);
fileSystemProvidersLoader.forEach(
fileSystemProvider -> resultMap.put(fileSystemProvider.name(), fileSystemProvider));
return resultMap;
}
then the only configuration we need to provide is default-filesystem-provider
, @jerryshao what do you think about it, is it okay for you?
.../catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/fs/FileSystemProvider.java
Outdated
Show resolved
Hide resolved
...atalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogOperations.java
Show resolved
Hide resolved
...alog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/fs/HDFSFileSystemProvider.java
Outdated
Show resolved
Hide resolved
...ogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/fs/FileSystemUtils.java
Outdated
Show resolved
Hide resolved
...ogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/fs/FileSystemUtils.java
Show resolved
Hide resolved
...adoop/src/main/java/org/apache/gravitino/catalog/hadoop/HadoopCatalogPropertiesMetadata.java
Outdated
Show resolved
Hide resolved
...ogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/fs/FileSystemUtils.java
Outdated
Show resolved
Hide resolved
@@ -125,6 +132,10 @@ public void initialize(URI name, Configuration configuration) throws IOException | |||
|
|||
initializeClient(configuration); | |||
|
|||
// Register the default local and HDFS FileSystemProvider | |||
String fileSystemProviders = configuration.get(FS_FILESYSTEM_PROVIDERS); | |||
fileSystemProvidersMap.putAll(FileSystemUtils.getFileSystemProviders(fileSystemProviders)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gvfs will depend on the hadoop-catalog, can you please check the dependencies of gvfs to make sure the shading jar is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some existing IT has checked the current code, moreover, I have verified this logic when developing python GVFS client.
...ogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/fs/FileSystemUtils.java
Outdated
Show resolved
Hide resolved
...ogs/catalog-hadoop/src/main/java/org/apache/gravitino/catalog/hadoop/fs/FileSystemUtils.java
Outdated
Show resolved
Hide resolved
@@ -125,6 +132,10 @@ public void initialize(URI name, Configuration configuration) throws IOException | |||
|
|||
initializeClient(configuration); | |||
|
|||
// Register the default local and HDFS FileSystemProvider | |||
String fileSystemProviders = configuration.get(FS_FILESYSTEM_PROVIDERS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also need a default fileSystemProvider for gvfs?
What changes were proposed in this pull request?
Add a framework to support multiple storage system within Hadoop catalog
Why are the changes needed?
Some users want Gravitino to manage file system like S3 or GCS.
Fix: #5019
Does this PR introduce any user-facing change?
N/A.
How was this patch tested?
Existing test.