Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SIP-95] Proposal for Catalog Support in SQL Lab #22862

Closed
JELGT2011 opened this issue Jan 26, 2023 · 22 comments
Closed

[SIP-95] Proposal for Catalog Support in SQL Lab #22862

JELGT2011 opened this issue Jan 26, 2023 · 22 comments
Assignees
Labels
sip Superset Improvement Proposal

Comments

@JELGT2011
Copy link

JELGT2011 commented Jan 26, 2023

[SIP] Proposal for Catalog Support in SQL Lab

Motivation

The current SQL Lab explorer is not very intuitive as there is no way to explore catalogs, and can only explore the default catalog.

Proposed Change

Add a new Catalog dropdown in the sqllab editor (similar to schema).

image

New or Changed Public Interfaces

This should only affect the sqllab query tab. The Catalog dropdown will be similar enough to the Schema explorer. The word schema is kind of overloaded throughout Superset, so it's not entirely clear to me yet whether a new endpoint is necessary.

New dependencies

None

Migration Plan and Compatibility

None

Rejected Alternatives

None

@betodealmeida
Copy link
Member

I would love to see this, but there are few complications. The biggest one is that we need support from the SQL Alchemy dialects, and I'm not sure how many support dynamic catalogs. For example, both Hive and Trino require the catalog to be specified when the connection is created:

  • trino://<username>:<password>@<host>:<port>/<catalog>/<schema>
  • hive://<host>:<port>/<catalog>

For engines like this the user would have to choose a default catalog when the database is added, and each DB engine spec in Superset would have to know how to replace the catalog in the URL, in case a user chooses one different from the default after the DB has been created.

It's definitely doable, and something we should do. We've been slowly adding support for catalogs:

catalog: Optional[str] = None

Are you planning to work on this, or are you just proposing it? If you're not going to work on it I can start taking a look.

@JELGT2011
Copy link
Author

I was planning to work on this, but of course I’d love help as I’ve never made a contribution to Superset. I was planning on doing a smaller task first (adding a configurable header/footer) since I figured this one might be complex.

@betodealmeida
Copy link
Member

I was planning to work on this, but of course I’d love help as I’ve never made a contribution to Superset. I was planning on doing a smaller task first (adding a configurable header/footer) since I figured this one might be complex.

Ah, great! From the wording of the SIP I wasn't sure, and people are welcome to propose changes even if they're not going to work on it. I'm happy to help in any way you need, let me know!

@afilipchik
Copy link

hey @betodealmeida, good to see you are still pushing Superset forward!

We worked around originally by concatenating catalog with schema, but patching got nasty, as we would pass it thought to Trino driver.

Here is how it looks:
image

@betodealmeida
Copy link
Member

betodealmeida commented Jan 26, 2023

I think a good approach for this would be modifying BaseEngineSpec.get_engine to allow passing a catalog. The method would then pass it to get_sqla_engine_with_context, which would call a DB engine specific method to mutate the SQLAlchemy URI so it uses the requested catalog. Something like (pseudocode):

BaseEngineSpec:

    @classmethod
    def get_engine(
        cls,
        database: Database,
        catalog: Optional[str] = None,
        schema: Optional[str] = None,
        source: Optional[QuerySource] = None,
    ) -> ContextManager[Engine]:
        # get_sqla_engine_with_context needs to call database.engine_spec.apply_catalog_to_sqlalchemy_uri
        return database.get_sqla_engine_with_context(
            catalog=catalog,
            schema=schema,
            source=source
        )
    
    @classmethod
    def apply_catalog_to_sqlalchemy_uri(cls, database: Database, catalog: str) -> URL:
        return database.sqlalchemy_uri

    @classmethod
    def get_catalog_names(
        cls,
        database: Database,
        inspect: Inspector,
    ) -> Set[str]:
        return set()

Individual DB engine specs like Trino/Hive/Snowflake/BigQuery would implement custom apply_catalog_to_sqlalchemy_uri methods to build the SQLAlchemy URI for a given catalog. They would also implement custom get_catalog_names to list all available catalogs, since there's no standard SQLAlchemy way of doing that, as far as I know.

@betodealmeida
Copy link
Member

betodealmeida commented Mar 14, 2023

Actually, we would need the apply_catalog_to_sqlalchemy_uri to also pass schema:

    @classmethod
    def apply_catalog_to_sqlalchemy_uri(
        cls,
        database: Database,
        catalog: Optional[str],
        schema: Optional[str],
    ) -> URL:
        try:
            connect_args = database.get_extra()["engine_params"]["connect_args"]
        except KeyError:
            connect_args = {}

        return database.sqlalchemy_uri, connect_args

For Postgres, this would look like this (untested):

    @classmethod
    def apply_catalog_to_sqlalchemy_uri(
        cls,
        database: Database,
        catalog: Optional[str],
        schema: Optional[str],
    ) -> Tuple[URL, Dict[str, Any]]:
        url, connect_args = super().apply_catalog_to_sqlalchemy_uri(database, catalog, schema)

        if catalog:
            # for Postgres a catalog is a database
            url = url.set(database=catalog)

        if schema:
            connect_args["options"] = f"-csearch_path={schema}"

        return url, connect_args

See the discussion at #23356.

@betodealmeida
Copy link
Member

Actually we already have a method to change the schema (adjust_database_uri), we should piggyback on that.

@betodealmeida
Copy link
Member

While working on an issue with schemas, I've done some foundational work for supporting catalogs:

Once we have that, I think we would still need the following steps:

  1. Implement catalog-level permissions (@dpgaspar can probably help here). We also need to modify the parser to extract the catalog from queries, and have the security manager handle catalogs when checking for access.
  2. Implement a get_catalog_names in the DB engine specs that support catalogs (Presto, Trino, Snowflake, BigQuery, Postgres, ...?). Also update their adjust_engine_params methods to modify the SQL Alchemy URI to use the chosen catalog.
  3. Add a catalog dropdown to SQL Lab, and make sure the selected catalog is being passed to _get_sqla_engine in the database model, and from there passed to adjust_engine_params.
  4. Modify the dataset model and add a catalog attribute.
  5. Add a catalog dropdown to the dataset creator.

@betodealmeida
Copy link
Member

I'm going to work on (1) and (2) this week, then we can put the SIP to vote and you (or someone else) can work on the UI part, @JELGT2011.

@JELGT2011
Copy link
Author

that's awesome news @betodealmeida! I've been a little delayed trying to upgrade our existing instance of superset, but I could definitely do some frontend changes (3), (4?), and (5). I think I have a draft work in progress on it already so that works out.

@rusackas rusackas added the sip Superset Improvement Proposal label Jun 7, 2023
@rusackas rusackas changed the title [SIP] Proposal for Catalog Support in SQL Lab [SIP-95] Proposal for Catalog Support in SQL Lab Jun 7, 2023
@rusackas
Copy link
Member

rusackas commented Jun 7, 2023

Numbering this as SIP-95. If you'd like to continue pursuing this, pleas submit it for a DISCUSS thread on the Superset @dev mailing list, so we can start progressing it toward a vote. Thanks!

@rusackas
Copy link
Member

rusackas commented Oct 3, 2023

It seems like this work is incrementally moving forward, but the SIP itself has still not been brought up for a DISCUSS thread or a VOTE thread on the Apache mailing list. @JELGT2011 / @betodealmeida is there any need or intent to continue with the SIP process here?

@kdazzle
Copy link

kdazzle commented Oct 5, 2023

It would be huge to get this going for Databricks. I could probably contribute - especially since part of the approach was laid out for different engines

@guenp
Copy link
Contributor

guenp commented Dec 5, 2023

I also support putting this SIP-95 up for a DISCUSS/VOTE thread! @motherduckdb has a similar scenario where a single DuckDB instance can connect to multiple databases (see https://duckdb.org/docs/sql/statements/attach.html). It would be great to be able to use the Catalog drop-down to scope the Schema options.

@rusackas
Copy link
Member

rusackas commented Feb 1, 2024

@guenp please feel free to kick off the Discussion thread on the [email protected] mailing list! Let me know if you need help with that, you can ping me here or on Superset Slack

@rusackas
Copy link
Member

rusackas commented Apr 3, 2024

Hi @JELGT2011 / @guenp if anyone wants to move this SIP forward, please put it up for a [DISCUSS] thread on the mailing list. Otherwise, it's at risk of being closed/discarded.

@betodealmeida
Copy link
Member

I'll bring it up for a discussion, I don't think we're far away from finishing this.

@john-bodley
Copy link
Member

@betodealmeida regarding your comment in the "[DISCUSS] [SIP-95] Proposal for Catalog Support in SQL Lab" email,

I'd like to increase the scope of SIP-95 by introducing catalogs not only in SQL Lab, but throughout the whole application.

I totally agree as it likely is a necessary requirement if one wanted to explore a SQL Lab result set.

@betodealmeida
Copy link
Member

@john-bodley right, I just called that out because the original SIP mentioned only SQL Lab in the title, but I don't see how that would work. 🙃

@rusackas
Copy link
Member

rusackas commented May 7, 2024

@betodealmeida I think you just need to close the vote for this with a [RESULT] thread

@betodealmeida
Copy link
Member

@betodealmeida I think you just need to close the vote for this with a [RESULT] thread

Done!

@betodealmeida
Copy link
Member

It's official, we now have catalog support in Apache Superset. Currently Postgres and Databricks are supported, and I have a PR out for:

  • Trino
  • Presto
  • BigQuery
  • Snowflake

Adding support to new DB engine specs should be straightforward, requiring just 3 methods. I suggest using #28416 as a reference implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sip Superset Improvement Proposal
Projects
Status: Implemented / Done
Development

No branches or pull requests

9 participants