[SIP-7] Ensuring Superset accurately represents the underlying data warehouse #5842

john-bodley · 2018-09-07T18:22:39Z

[SIP] Proposal for ensuring Superset accurately represents the underlying data warehouse

Motivation

Over time there can be a discrepancy between the actual metadata of a Druid datasource or SQLA table and the corresponding Superset representation. Columns or metrics can be added, updated, or deleted, and thus overtime creep grows between the systems. The proposal is to try to add functionality to better ensure consistency between Superset and the underlying data warehouses which should help boost user confidence.

Proposed Change

Currently there is a mechanism to refresh Druid metadata per the following menu item:

thus the plan would be to also include an option for refreshing SQLA table metadata. We could optionally also provide a mechanism to scan for new SQLA tables though we should be aware that there scale of datasources (and their corresponding columns/metrics) could negatively impact the performance of Superset.

Refreshing

Regarding the term "refreshing" I proposed the following is instrumented for both Druid and SQLA datasources.

Columns/Metrics

New entities are added
Existing entities are updated
Obsolete entities are deleted†

†Note derived entities which reference an obsolete entity will also be deleted.

For reference here's the current UI behavior. Starting with a baseline:

when one drops a column (sum_boys in this instance) from the underlying table the UI state remains unchanged and the query fails to execute (as expected) with the no such column: sum_boys error:

Finally if one deletes the column and corresponding metrics from the Superset datasource, the UI state remains unchanged (also expected as the state is defined in its entirety from the form-data), though the query never runs as Superset rightfully raises an error stating that the metric is non-valid:

These behaviors seem correct and should remain unchanged.

Datasources

An unanswered question remains about what should happen when a Druid or SQLA datasource is defined in Superset but no longer resides in the underlying data warehouse. One school of thought is that if the underlying Druid datasource or SQLA table no longer exists we should:

Delete all slices which reference the datasource
Delete all empty dashboards (if appropriate)†
Delete the datasource

†Note I'm unsure what the current logic is for having a dashboard with no slices.

The concern with this approach is it is a fairly destructive process, i.e., significant IP is potentially lost when one deletes a slice. Would it make more sense that the underlying institution controls when/how datasources are deleted? For example one could define a policy that if said datasource has been deleted (and not restored) for n consecutive days then it's probably safe to delete it from Superset.

New dependencies

The following existing PRs are required which ensures we have uniqueness at the datasource/column/metric level and the appropriate fields are non-nullable:

Note many of these PRs require migrations which may need manual intervention as the lack of constraints and non-nullability may have resulted in a corrupted (and thus complex) database in which procedurally defined migration rules are non-viable.

Open Questions

Should we delete datasources which no longer exist in the data warehouse?
How do we deal with custom SQL definitions when the underlying datasource changes? I suspect this is out-of-scope.

to: @betodealmeida @michellethomas @mistercrunch @timifasubaa

The text was updated successfully, but these errors were encountered:

kristw · 2018-09-07T19:14:08Z

Should be SIP-7

john-bodley · 2021-04-21T22:00:18Z

Note this is somewhat related to #8639 with regards to the auto-cleanup.

stale · 2021-06-26T03:07:43Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. For admin, please label this issue .pinned to prevent stale bot from closing the issue.

michael-s-molina · 2022-03-09T17:35:33Z

@john-bodley Closing this due to inactivity as part of SIP board cleaning. Please re-open if you think it's still active.

john-bodley added the sip Superset Improvement Proposal label Sep 7, 2018

john-bodley changed the title ~~[SIP-6] Ensuring Superset accurately represents the underlying data warehouse~~ [SIP-7] Ensuring Superset accurately represents the underlying data warehouse Sep 7, 2018

john-bodley mentioned this issue Oct 4, 2018

[SIP-11] Proposal for deprecating the native Druid NoSQL connector #6032

Closed

john-bodley mentioned this issue Jun 4, 2021

[SIP-68] A better model for Datasets #14909

Closed

stale bot added the inactive Inactive for >= 30 days label Jun 26, 2021

michael-s-molina closed this as completed Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SIP-7] Ensuring Superset accurately represents the underlying data warehouse #5842

[SIP-7] Ensuring Superset accurately represents the underlying data warehouse #5842

john-bodley commented Sep 7, 2018 •

edited

Loading

kristw commented Sep 7, 2018

john-bodley commented Apr 21, 2021

stale bot commented Jun 26, 2021

michael-s-molina commented Mar 9, 2022

[SIP-7] Ensuring Superset accurately represents the underlying data warehouse #5842

[SIP-7] Ensuring Superset accurately represents the underlying data warehouse #5842

Comments

john-bodley commented Sep 7, 2018 • edited Loading

Motivation

Proposed Change

Refreshing

Columns/Metrics

Datasources

New dependencies

Open Questions

kristw commented Sep 7, 2018

john-bodley commented Apr 21, 2021

stale bot commented Jun 26, 2021

michael-s-molina commented Mar 9, 2022

john-bodley commented Sep 7, 2018 •

edited

Loading