Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support create object store source tables without depending on environment variables #5732

Merged
merged 11 commits into from
Apr 5, 2023

Conversation

r4ntix
Copy link
Contributor

@r4ntix r4ntix commented Mar 25, 2023

Which issue does this PR close?

Closes #5731

Rationale for this change

See #5731

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Mar 25, 2023
@r4ntix r4ntix marked this pull request as ready for review March 25, 2023 12:31
@@ -85,7 +81,7 @@ lazy_static = { version = "^1.4.0" }
log = "^0.4"
num-traits = { version = "0.2", optional = true }
num_cpus = "1.13.0"
object_store = "0.5.4"
object_store = { version = "0.5.4", features = ["aws"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about introducing cloud storage features into datafusion/core. Perhaps we could push down options into object_store through a general-purpose interface and handle the s3/oss logic there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinks for this feedback.
I think the question here is whether we enable access to cloud storage by default for datafusion. If yes, it is the same whether to put the options in datafusion or in object_store. If not, I can make this as an optional feature for datafusion/core, and turn it on by default in the datafusion-cli.
Which do you think is better? happy to discuss this 😄

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question here is whether we enable access to cloud storage by default for datafusion. If yes, it is the same whether to put the options in datafusion or in object_store. If not, I can make this as an optional feature for datafusion/core, and turn it on by default in the datafusion-cli.

I think the desire has been to keep ObjectStore specific dependencies out of DataFusion core (though it would be good to document that somewhere). DataFusion is already a large codebase to begin with and given it has hooks to extend the functionality I think we should be leveraging the extension points.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @r4ntix -- I agree with @yjshen that we are trying to keep DataFusion agnostic to different object store implementations, so ideally this configuration would be handled by users of DataFusion (such as the datafusion-cli).

Among other reasons, we currently don't have any CI coverage for these object store implementations.

@r4ntix -- could you possibly try to move the object store configuration into datafusion-cli? Perhaps you can look at how ballista does this: https://github.com/apache/arrow-ballista

@@ -85,7 +81,7 @@ lazy_static = { version = "^1.4.0" }
log = "^0.4"
num-traits = { version = "0.2", optional = true }
num_cpus = "1.13.0"
object_store = "0.5.4"
object_store = { version = "0.5.4", features = ["aws"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the question here is whether we enable access to cloud storage by default for datafusion. If yes, it is the same whether to put the options in datafusion or in object_store. If not, I can make this as an optional feature for datafusion/core, and turn it on by default in the datafusion-cli.

I think the desire has been to keep ObjectStore specific dependencies out of DataFusion core (though it would be good to document that somewhere). DataFusion is already a large codebase to begin with and given it has hooks to extend the functionality I think we should be leveraging the extension points.


state
.runtime_env()
.register_object_store(table_path.as_ref(), store);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you can move this code for registering the object store dynamically out of core and into datafusion-cli ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion!

Considering that there is already a dependency on object_store in the current datafusion core, I think it would be a natural thing to support different object store implementations directly in datafusion core.

@alamb @yjshen
Is it possible to make this as an optional feature of datafusion core? That way it would be optional and out-of-the-box for CLI or other services that use Datafusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @yjshen I created a new branch and moved the code from datafusion core to datafusion-cli: r4ntix@f988046

Because the necessary info of object store is dynamically fetched from the SQL(cmd.options), there is an additional sql parsing overhead here:

async fn exec_and_print(
    ctx: &mut SessionContext,
    print_options: &PrintOptions,
    sql: String,
) -> Result<()> {
    let now = Instant::now();

    // parsing sql to get external table information
    let plan = ctx.state().create_logical_plan(&sql).await?;
    let df = match plan {
        LogicalPlan::CreateExternalTable(cmd) => {
            create_external_table(&ctx, &cmd)?;
            ctx.sql(&sql).await?
        }
        _ => ctx.sql(&sql).await?,
    };

    let results = df.collect().await?;
    print_options.print_batches(&results, now)?;

    Ok(())
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think r4ntix@f988046 looks good too me -- if you want to get rid of the overhead of calling sql twice you could perhaps allow the dispatch code on SessionContext diectly.

Like maybe add a

impl SessionContext {
    pub async fn sql(&self, sql: &str) -> Result<DataFrame> {
        // create a query planner
        let plan = self.state().create_logical_plan(sql).await?;
        self.execute_logical_plan(plan)?
}

  // Execute a LogicalPlan (including command such as CreateExternalTable, etc)
  fn execute_logical_plan(&self, plan: LogicalPlan) -> Result<DataFrame> {
        match plan {
         ...
   }

https://docs.rs/datafusion/latest/src/datafusion/execution/context.rs.html#318-320

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll submit a new commit for this

@r4ntix
Copy link
Contributor Author

r4ntix commented Apr 1, 2023

Hi @alamb @yjshen Thank you for the suggestions! I've submitted a new commit, cloud you help review it?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @r4ntix -- I think this looks good to me. Thank you for moving the code.

I had some small documentation suggestions.

My biggest concern / question is about testing -- specifically do you think it is worth writing tests (especially for error conditions) for options? I am mostly worried about potentially breaking this functionality with some future refactor but not catching that because of lack fo test coverage

- session_token
- region

It is also simplify sql statements by environment variables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
It is also simplify sql statements by environment variables.
It is also possible to simplify sql statements by environment variables.


The CLI can query data in S3 if the following environment variables are defined:
S3 data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should define S3 by adding a link:

Suggested change
S3 data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement.
[AWS S3](https://aws.amazon.com/s3/) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement.

```bash
$ aws s3 cp test.csv s3://my-bucket/
upload: ./test.csv to s3://my-bucket/test.csv
OSS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
OSS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement.
[Alibaba cloud OSS](https://www.alibabacloud.com/product/object-storage-service) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement.


## Registering GCS Data Sources

GCS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GCS data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement.
[Google Cloud Storage](https://cloud.google.com/storage) data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement.

}

#[cfg(test)]
mod tests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think there is any value to porting these tests?

@@ -1,193 +0,0 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice that we can consolidate the usage

@r4ntix
Copy link
Contributor Author

r4ntix commented Apr 4, 2023

Thanks @r4ntix -- I think this looks good to me. Thank you for moving the code.

I had some small documentation suggestions.

My biggest concern / question is about testing -- specifically do you think it is worth writing tests (especially for error conditions) for options? I am mostly worried about potentially breaking this functionality with some future refactor but not catching that because of lack fo test coverage

@alamb Thank you very much for your very detailed advice!

About tests, I will add Unit tests for this first. Secondly, I am currently considering whether we can add Integration Tests, e.g. emulator tests of arrow-rs/object_store, but this may take extra time.

@alamb
Copy link
Contributor

alamb commented Apr 4, 2023

About tests, I will add Unit tests for this first. Secondly, I am currently considering whether we can add Integration Tests, e.g. emulator tests of arrow-rs/object_store, but this may take extra time.

I agree integration tests would be great, but also that they would take non trivial time to implement and maintain

Unit tests that verify we at least setup the object store correctly I think would be sufficient initially

@r4ntix
Copy link
Contributor Author

r4ntix commented Apr 4, 2023

I agree integration tests would be great, but also that they would take non trivial time to implement and maintain

Unit tests that verify we at least setup the object store correctly I think would be sufficient initially

Agree, I will add the unit tests as soon as possible.

@r4ntix r4ntix requested a review from alamb April 5, 2023 15:16
@alamb
Copy link
Contributor

alamb commented Apr 5, 2023

the CI test https://github.com/apache/arrow-datafusion/actions/runs/4617873525/jobs/8164772317?pr=5732 seems to have failed due to environmental reasons. I have restarted it

let err = create_external_table_test(location, &sql)
.await
.unwrap_err();
assert!(err.to_string().contains("No RSA key found in pem file"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for negative tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support create object store source tables without depending on environment variables
3 participants