-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support create object store source tables without depending on environment variables #5732
Conversation
… store table through ListingTableFactory
datafusion/core/Cargo.toml
Outdated
@@ -85,7 +81,7 @@ lazy_static = { version = "^1.4.0" } | |||
log = "^0.4" | |||
num-traits = { version = "0.2", optional = true } | |||
num_cpus = "1.13.0" | |||
object_store = "0.5.4" | |||
object_store = { version = "0.5.4", features = ["aws"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about introducing cloud storage features into datafusion/core. Perhaps we could push down options into object_store through a general-purpose interface and handle the s3/oss logic there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinks for this feedback.
I think the question here is whether we enable access to cloud storage by default for datafusion. If yes, it is the same whether to put the options in datafusion or in object_store. If not, I can make this as an optional feature for datafusion/core, and turn it on by default in the datafusion-cli.
Which do you think is better? happy to discuss this 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question here is whether we enable access to cloud storage by default for datafusion. If yes, it is the same whether to put the options in datafusion or in object_store. If not, I can make this as an optional feature for datafusion/core, and turn it on by default in the datafusion-cli.
I think the desire has been to keep ObjectStore specific dependencies out of DataFusion core (though it would be good to document that somewhere). DataFusion is already a large codebase to begin with and given it has hooks to extend the functionality I think we should be leveraging the extension points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @r4ntix -- I agree with @yjshen that we are trying to keep DataFusion agnostic to different object store implementations, so ideally this configuration would be handled by users of DataFusion (such as the datafusion-cli).
Among other reasons, we currently don't have any CI coverage for these object store implementations.
@r4ntix -- could you possibly try to move the object store configuration into datafusion-cli
? Perhaps you can look at how ballista does this: https://github.com/apache/arrow-ballista
datafusion/core/Cargo.toml
Outdated
@@ -85,7 +81,7 @@ lazy_static = { version = "^1.4.0" } | |||
log = "^0.4" | |||
num-traits = { version = "0.2", optional = true } | |||
num_cpus = "1.13.0" | |||
object_store = "0.5.4" | |||
object_store = { version = "0.5.4", features = ["aws"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the question here is whether we enable access to cloud storage by default for datafusion. If yes, it is the same whether to put the options in datafusion or in object_store. If not, I can make this as an optional feature for datafusion/core, and turn it on by default in the datafusion-cli.
I think the desire has been to keep ObjectStore specific dependencies out of DataFusion core (though it would be good to document that somewhere). DataFusion is already a large codebase to begin with and given it has hooks to extend the functionality I think we should be leveraging the extension points.
|
||
state | ||
.runtime_env() | ||
.register_object_store(table_path.as_ref(), store); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if you can move this code for registering the object store dynamically out of core and into datafusion-cli ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion!
Considering that there is already a dependency on object_store in the current datafusion core, I think it would be a natural thing to support different object store implementations directly in datafusion core.
@alamb @yjshen
Is it possible to make this as an optional feature of datafusion core? That way it would be optional and out-of-the-box for CLI or other services that use Datafusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb @yjshen I created a new branch and moved the code from datafusion core to datafusion-cli: r4ntix@f988046
Because the necessary info of object store is dynamically fetched from the SQL(cmd.options), there is an additional sql parsing overhead here:
async fn exec_and_print(
ctx: &mut SessionContext,
print_options: &PrintOptions,
sql: String,
) -> Result<()> {
let now = Instant::now();
// parsing sql to get external table information
let plan = ctx.state().create_logical_plan(&sql).await?;
let df = match plan {
LogicalPlan::CreateExternalTable(cmd) => {
create_external_table(&ctx, &cmd)?;
ctx.sql(&sql).await?
}
_ => ctx.sql(&sql).await?,
};
let results = df.collect().await?;
print_options.print_batches(&results, now)?;
Ok(())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think r4ntix@f988046 looks good too me -- if you want to get rid of the overhead of calling sql
twice you could perhaps allow the dispatch code on SessionContext diectly.
Like maybe add a
impl SessionContext {
pub async fn sql(&self, sql: &str) -> Result<DataFrame> {
// create a query planner
let plan = self.state().create_logical_plan(sql).await?;
self.execute_logical_plan(plan)?
}
// Execute a LogicalPlan (including command such as CreateExternalTable, etc)
fn execute_logical_plan(&self, plan: LogicalPlan) -> Result<DataFrame> {
match plan {
...
}
https://docs.rs/datafusion/latest/src/datafusion/execution/context.rs.html#318-320
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll submit a new commit for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @r4ntix -- I think this looks good to me. Thank you for moving the code.
I had some small documentation suggestions.
My biggest concern / question is about testing -- specifically do you think it is worth writing tests (especially for error conditions) for options
? I am mostly worried about potentially breaking this functionality with some future refactor but not catching that because of lack fo test coverage
docs/source/user-guide/cli.md
Outdated
- session_token | ||
- region | ||
|
||
It is also simplify sql statements by environment variables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also simplify sql statements by environment variables. | |
It is also possible to simplify sql statements by environment variables. |
docs/source/user-guide/cli.md
Outdated
|
||
The CLI can query data in S3 if the following environment variables are defined: | ||
S3 data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we should define S3 by adding a link:
S3 data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. | |
[AWS S3](https://aws.amazon.com/s3/) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. |
docs/source/user-guide/cli.md
Outdated
```bash | ||
$ aws s3 cp test.csv s3://my-bucket/ | ||
upload: ./test.csv to s3://my-bucket/test.csv | ||
OSS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OSS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. | |
[Alibaba cloud OSS](https://www.alibabacloud.com/product/object-storage-service) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. |
docs/source/user-guide/cli.md
Outdated
|
||
## Registering GCS Data Sources | ||
|
||
GCS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. | |
[Google Cloud Storage](https://cloud.google.com/storage) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. |
} | ||
|
||
#[cfg(test)] | ||
mod tests { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think there is any value to porting these tests?
datafusion-cli/src/object_storage.rs
Outdated
@@ -1,193 +0,0 @@ | |||
// Licensed to the Apache Software Foundation (ASF) under one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice that we can consolidate the usage
@alamb Thank you very much for your very detailed advice! About tests, I will add |
I agree integration tests would be great, but also that they would take non trivial time to implement and maintain Unit tests that verify we at least setup the object store correctly I think would be sufficient initially |
Agree, I will add the unit tests as soon as possible. |
the CI test https://github.com/apache/arrow-datafusion/actions/runs/4617873525/jobs/8164772317?pr=5732 seems to have failed due to environmental reasons. I have restarted it |
let err = create_external_table_test(location, &sql) | ||
.await | ||
.unwrap_err(); | ||
assert!(err.to_string().contains("No RSA key found in pem file")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for negative tests
Which issue does this PR close?
Closes #5731
Rationale for this change
See #5731
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?