-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: refactor SIP-68 db migrations with INSERT SELECT FROM #19421
perf: refactor SIP-68 db migrations with INSERT SELECT FROM #19421
Conversation
superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
Outdated
Show resolved
Hide resolved
superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
Outdated
Show resolved
Hide resolved
❗ Please consider rebasing your branch to avoid db migration conflicts. |
1 similar comment
❗ Please consider rebasing your branch to avoid db migration conflicts. |
2cb286c
to
b739ca4
Compare
b739ca4
to
1e99938
Compare
182726c
to
8b0cdda
Compare
@@ -586,7 +586,7 @@ class BaseColumn(AuditMixinNullable, ImportExportMixin): | |||
type = Column(Text) | |||
groupby = Column(Boolean, default=True) | |||
filterable = Column(Boolean, default=True) | |||
description = Column(Text) | |||
description = Column(MediumText()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MediumText
is current type for these fields. They were updated in db migrations at some point. Updating for consistency.
@@ -130,6 +131,7 @@ | |||
"sum", | |||
"doubleSum", | |||
} | |||
ADDITIVE_METRIC_TYPES_LOWER = {op.lower() for op in ADDITIVE_METRIC_TYPES} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metric_type.lower()
is compared with doubleSum
, which will always be false
. Not sure if the original casing will be used elsewhere, so I added a new variable.
superset/migrations/shared/utils.py
Outdated
""" | ||
Return all the dependencies from a SQL sql_text. | ||
""" | ||
dialect = "generic" | ||
for dialect, sqla_dialects in sqloxide_dialects.items(): | ||
if sqla_dialect in sqla_dialects: | ||
break | ||
sql_text = RE_JINJA_BLOCK.sub(" ", sql_text) | ||
sql_text = RE_JINJA_VAR.sub("abc", sql_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interpolate Jinja vars to give sqloxite
a higher chance of successfully parsing the SQL text.
|
||
description = sa.Column(MediumText()) | ||
warning_text = sa.Column(MediumText()) | ||
unit = sa.Column(sa.Text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@betodealmeida unit
is in superset.columns.models but not in the migration script. Should I keep it?
|
||
|
||
def upgrade(): | ||
# Create tables for the new models. | ||
op.create_table( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The manual specification of these create_table
commands is not needed anymore. Tables are now created with Base.metadata.create_all(bind=bind, tables=new_tables)
.
superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
Outdated
Show resolved
Hide resolved
@@ -522,7 +524,7 @@ class SqlaTable(Model, BaseDatasource): # pylint: disable=too-many-public-metho | |||
foreign_keys=[database_id], | |||
) | |||
schema = Column(String(255)) | |||
sql = Column(Text) | |||
sql = Column(MediumText()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found out some columns need to be MediumText
only after I noticed sql parse was failing because some SQL statements were cut off when copying to the new table.
superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
Outdated
Show resolved
Hide resolved
43ba9bb
to
f52a737
Compare
Codecov Report
@@ Coverage Diff @@
## master #19421 +/- ##
==========================================
- Coverage 66.51% 66.44% -0.07%
==========================================
Files 1690 1689 -1
Lines 64616 64738 +122
Branches 6656 6649 -7
==========================================
+ Hits 42978 43016 +38
- Misses 19937 20019 +82
- Partials 1701 1703 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
3a1e9c3
to
05d39a1
Compare
@@ -30,9 +30,7 @@ | |||
|
|||
import sqlalchemy as sa | |||
from alembic import op | |||
from sqlalchemy import and_, func, or_ | |||
from sqlalchemy.dialects import postgresql | |||
from sqlalchemy.sql.schema import Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bycatch: clean up unused imports.
superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
Outdated
Show resolved
Hide resolved
) | ||
inspector = inspect(engine) | ||
|
||
# add missing tables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic of syncing table schema from datasources is removed. It should lie in another offline or async process. Currently users have to hit a "Sync columns" button in the datasource editor to trigger this.
superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
Outdated
Show resolved
Hide resolved
05d39a1
to
ad6e167
Compare
superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
Outdated
Show resolved
Hide resolved
194d572
to
38f34de
Compare
superset/migrations/versions/b8d3a24d9131_new_dataset_models.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per comment above, would recommend pulling out anything that has changed in the db structure that doesn't affect performance into a new migration.
@eschutho I propose to change current migration to no-op and move my updated code to a new migration. I DM'ed @betodealmeida and @hughhhh earlier on Slack. Reposting the messages here for visibility: Hi, I noticed we are making more adjustments to SIP-68 models and have prepared a couple of more db migrations. I’m wondering whether we should bundle all these migrations (including the first one that’s already merged) into one new migration and change the original migration to no-op. Pros:
Cons:
Happy to incorporate #19487 and #19425 to my PR if they are still needed. Btw, I think the |
c9a121a
to
cc7168b
Compare
bc5892e
to
f9d49dd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of work, @ktmud! Thanks for bringing this to next level and making sure Superset works at scale!
I left a few comments, there are a few places where I think you need to wrap the call to NewTable.load_or_create
in a list.
f9d49dd
to
d8ada89
Compare
d8ada89
to
24b8670
Compare
@betodealmeida I think I've addressed all you comments (and also changed how physical columns for tables were added during migrations). Can you take another look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, Jesse!
I get the following error when applying this migration. Even if I update the column to be I don't alembic much but provided the PR #19786 which I believe will resolve the problem. Feel free to adjust to your liking.
|
@cemremengu thanks for reporting this. I'll work on a fix. |
There is another one @ktmud not sure about the fix yet:
|
@cemremengu does changing L580 to if is_physical and drivername and expression: fix the error? It should not happen but it seems somehow some of your physical datasets have empty strings as the table name/ |
No, if it's a connection string error, you won't even reach line 3613 of |
Yes, I removed the + part and still got the error unfortunately.
Yes but it is strange since I have 41 datasets currently. Is there any chance that I did one of the previous migrations incorrectly? EDIT: It says it copied all 41 though:
|
@cemremengu that's very strange. can you downgrade and rerun the whole migration?
|
Downgrading combined with PS. I had to manually update nextval for the sl_* table sequences but that might be due to my previous attempts. |
SUMMARY
This is another take on #19406 and #19416. The goal is to rewrite the bulk of loading + rewriting operations from Python to native SQL statements by utilizing
INSERT SELECT FROM
.The whole migration happens in 3 steps:
tables
,table_columns
,sql_metrics
to the newsl_xxx
tables using INSERT SELECT FROM.sl_datastes
andsl_columns
toverbose_name
,d3format
, etc) in the original tables underextra_json
is_managed_externally
andexternal_url
to columnsComparing to #19406, this PR was able to cut down the migration time for our Superset instance from 7 hours to 1 hour while retaining the full functionality.
While working on the migration, I also noticed a couple of shortcomings in the original implementation, including problems in the shadow-writing logic. Most notably:
MediumText
for MySQLInstead of creating multiple db migrations to address these issues, it'd be much easier for end users if we just skip the original migration (drop data for who already migrated) and bundle everything in another migration:
created_on
,change_on
,created_by
andchanged_by
are also copied over when appropriate.expression
,description
, etc) are nowMediumText
. This is consistent with current column types in the old tables.uuid
,sqlatable_id
onNewDataset
is removed. However, there is no 1:1 mapping ofNewTable
in the old models so they acquires uuids. In order to linkNewTable
with associated columns,sqlatable_id
is added toNewTable
during the db migration, but dropped once migration is complete.Table.sync_columns
method, but in the future this should be consolidated withget_virtual_table_metadata
andget_physical_table_metadata
.dataset.owners
are also added (copying PR feat(sip-68)(wip): Add owners to Dataset Model #19487).Known issues and next steps
NewTable
will only be copied from the dataset when its being created. We'd need another mechanism to sync table columns. Either during re-populating afterfetch_metadata
or in an async script. Since the Table model is not used anywhere yet, I think it's okay to leave this in another followup PR.sync_columns
method for tables. It's not currently in use and does more or less the same asSqlaTable.fetch_metadata
. I think it's okay to keep this somewhat-duplicate not-in-active-use method asSqlaTable
will be remove eventually.NewTable.columns
will be updated only when a physicalSqlaTable
is first synced to the new models and aNewTable
is created. Updating columns in aSqlaTable
dataset will NOT updateSqlaTable.columns
as we consider table columns and dataset columns to be separate and the metadata syncing should only happen in one direction---that is from tables to datasets, not the other way around.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TESTING INSTRUCTIONS
To test the migration
superset db downgrade 2ed890b36b94 && SHOW_PROGRESS=1 superset db upgrade a9422eeaae74
A couple of queries to verify the correctness of the migrated data:
ADDITIONAL INFORMATION