Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix SQL type merging for pre-existing target tables #898

Merged

Conversation

BuzzCutNorman
Copy link
Contributor

This is a attempt at fixing issue 374. The solution looks to see if the current table column type can accommodate the new type and if it can't it calls the merge_sql_type() function and alters the table as before.

closes #374

@codecov
Copy link

codecov bot commented Aug 9, 2022

Codecov Report

Merging #898 (7dac7c2) into main (4d92a61) will decrease coverage by 0.07%.
The diff coverage is 55.00%.

@@            Coverage Diff             @@
##             main     #898      +/-   ##
==========================================
- Coverage   81.25%   81.18%   -0.08%     
==========================================
  Files          36       36              
  Lines        3543     3555      +12     
  Branches      712      718       +6     
==========================================
+ Hits         2879     2886       +7     
- Misses        489      492       +3     
- Partials      175      177       +2     
Impacted Files Coverage Δ
singer_sdk/streams/sql.py 77.93% <55.00%> (-0.85%) ⬇️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@edgarrmondragon edgarrmondragon changed the title 374 exiting table merge sql types fix: Fix SQL type merging for pre-existing target tables Aug 11, 2022
@edgarrmondragon
Copy link
Collaborator

Hi @BuzzCutNorman, this came up while working on #904. This seems like an important bug so thanks for tackling it!

Your patch does seem to work, so let me know when it's ready for the team to review 😄

cc @aaronsteers

@BuzzCutNorman
Copy link
Contributor Author

Hi @edgarrmondragon, Yay!!! I am glad to hear it worked for you. 🥳 I found after some testing there is some code I need to remove. I won't be able to get to it today but am planning to clean it up Monday.

cc @aaronsteers

@BuzzCutNorman BuzzCutNorman marked this pull request as ready for review August 15, 2022 16:35
@visch
Copy link
Contributor

visch commented Aug 15, 2022

Hi @edgarrmondragon, Yay!!! I am glad to hear it worked for you. 🥳 I found after some testing there is some code I need to remove. I won't be able to get to it today but am planning to clean it up Monday.

cc @aaronsteers

I like what you did better than my idea! Took me a second to understand it. I'll play with this fix in target-postgres I'm a little concerned about the type checking mechanism but I need to play with it a little, generally it looks good (It's hard for me to say yes lets do this without a few examples in the wild working, so I may just be a bit timid)

@BuzzCutNorman
Copy link
Contributor Author

BuzzCutNorman commented Aug 15, 2022

Yes, please test a ton. I did some checks with the StackOverflow database but that doesn't have that many column types. I just removed the issubclass check since that one never got used and if I forced it the conversion didn't work. I misunderstood how a TypeEngine subclass worked.

singer_sdk/streams/sql.py Outdated Show resolved Hide resolved
singer_sdk/streams/sql.py Outdated Show resolved Hide resolved
singer_sdk/streams/sql.py Outdated Show resolved Hide resolved
@visch
Copy link
Contributor

visch commented Aug 15, 2022

Tried running this with a test on target-postgres And I hit an odd error @BuzzCutNorman see MeltanoLabs/target-postgres#9 (comment)

@BuzzCutNorman
Copy link
Contributor Author

@edgarrmondragon @aaronsteers Looks like there are still issues with this approach. I am going to put this back to draft until I can look at the areas @visch highlighted.

@BuzzCutNorman BuzzCutNorman marked this pull request as draft August 15, 2022 21:47
@BuzzCutNorman
Copy link
Contributor Author

Reviewing

Played with this a little bit

  if isinstance(sql_type, type(current_type)):
 (Pdb) print(f"{sql_type=},{str(sql_type)=}. {current_type=}, {str(current_type)=}")
 sql_type=VARCHAR(),str(sql_type)='VARCHAR'. current_type=VARCHAR(), str(current_type)='VARCHAR'
 (Pdb) print(type(sql_type))
 <class 'sqlalchemy.sql.sqltypes.VARCHAR'>

This will work for our "generic" types of VARCHAR. If we get to caring about varchar(6) vs varchar(1000) and needing to change this will present an issue but for the generic SDK this might be alright!

This is where str(sql_type) might be better. Ideally eq in this case would work for us, but SQLAlchemy overrides that for some other purposes (I'm not well versed enough in SQLAlchemy to understand it all yet)

Setup

My source is 5000 row version of the User table from the StackOverflow database on a mssql server 2019 server
My target is a blank datawarehouse database on a postgres14 server

I ran a meltano run job that tapped the UserTest table on my source and created a UserTest table on the target. In mssql the table looks like this.

CREATE TABLE [dbo].[UsersTest](
	[Id] [int] IDENTITY(1,1) NOT NULL,
	[AboutMe] [nvarchar](max) NULL,
	[Age] [int] NULL,
	[CreationDate] [datetime] NOT NULL,
	[DisplayName] [nvarchar](40) NOT NULL,
	[DownVotes] [int] NOT NULL,
	[EmailHash] [nvarchar](40) NULL,
	[LastAccessDate] [datetime] NOT NULL,
	[Location] [nvarchar](100) NULL,
	[Reputation] [int] NOT NULL,
	[UpVotes] [int] NOT NULL,
	[Views] [int] NOT NULL,
	[WebsiteUrl] [nvarchar](200) NULL,
	[AccountId] [int] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO

The newly created table in postgres looks like this.

CREATE TABLE IF NOT EXISTS public."UsersTest"
(
    "Id" integer,
    "AboutMe" character varying COLLATE pg_catalog."default",
    "Age" integer,
    "CreationDate" timestamp without time zone,
    "DisplayName" character varying COLLATE pg_catalog."default",
    "DownVotes" integer,
    "EmailHash" character varying COLLATE pg_catalog."default",
    "LastAccessDate" timestamp without time zone,
    "Location" character varying COLLATE pg_catalog."default",
    "Reputation" integer,
    "UpVotes" integer,
    "Views" integer,
    "WebsiteUrl" character varying COLLATE pg_catalog."default",
    "AccountId" integer
)

I added the following logging in the _adapt_column_type function to assist in reviewing the difference between str() and type()

       if isinstance(sql_type, type(current_type)):
            # The current column and sql type are the same
            # Nothing to do
            self.logger.info("!!! I found a same to same match. !!!")
            self.logger.info(f"sql_type: {str(sql_type)} current_type class {str(type(current_type))} ")
            self.logger.info(f"sql_type: {str(sql_type)} current_type class {str(current_type)} ")
            return
        elif isinstance(sql_type.as_generic(), type(current_type.as_generic())):
            # The current column and sql generic types are the same
            # Nothing to do
            self.logger.info("!!! I found a generic to generic match. !!!")
            self.logger.info(f"sql_type: {str(sql_type.as_generic())} current_type class {str(type(current_type.as_generic()))} ")
            self.logger.info(f"sql_type: {str(sql_type)} current_type class {str(type(current_type))} ")
            self.logger.info(f"sql_type: {str(sql_type)} current_type class {str(current_type)} ")
        compatible_sql_type = self.merge_sql_types([current_type, sql_type])

        self.logger.info("!!! I asked merge_sql_type for a merge match. !!!")
        self.logger.info(f"sql_type: {str(sql_type)} current_type class {str(type(current_type))} compatible_sql_type {str(type(compatible_sql_type))} ")
        self.logger.info(f"sql_type: {str(sql_type)} current_type class {str(current_type)} compatible_sql_type {str(compatible_sql_type)}")

        if compatible_sql_type == current_type:
            # Nothing to do
            self.logger.info("!!! I believe these to be equal types !!!")
            self.logger.info(f"current_type class {str(current_type)} compatible_sql_type {str(compatible_sql_type)}")

Testing

First thing I noticed is the table when created is not copying over the length of varchar columns present in the source to the target.

I will truncate the UsersTest table on the target side and run the meltano run again to get and review the logging info that is returned.

level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'>
level=INFO message=sql_type: INTEGER current_type class INTEGER
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'>
level=INFO message=sql_type: VARCHAR current_type class VARCHAR
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'>
level=INFO message=sql_type: INTEGER current_type class INTEGER
level=INFO message=!!! I found a generic to generic match. !!!
level=INFO message=sql_type: DATETIME current_type class <class 'sqlalchemy.sql.sqltypes.DateTime'>
level=INFO message=sql_type: TIMESTAMP current_type class <class 'sqlalchemy.dialects.postgresql.base.TIMESTAMP'>
level=INFO message=sql_type: TIMESTAMP current_type class TIMESTAMP
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'>
level=INFO message=sql_type: VARCHAR current_type class VARCHAR
level=INFO message=!!! I found a same to same match. !!! 
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'> 
level=INFO message=sql_type: INTEGER current_type class INTEGER 
level=INFO message=!!! I found a same to same match. !!! 
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'> 
level=INFO message=sql_type: VARCHAR current_type class VARCHAR 
level=INFO message=!!! I found a generic to generic match. !!! 
level=INFO message=sql_type: DATETIME current_type class <class 'sqlalchemy.sql.sqltypes.DateTime'>
level=INFO message=sql_type: TIMESTAMP current_type class <class 'sqlalchemy.dialects.postgresql.base.TIMESTAMP'>
level=INFO message=sql_type: TIMESTAMP current_type class TIMESTAMP
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'>
level=INFO message=sql_type: VARCHAR current_type class VARCHAR
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'>
level=INFO message=sql_type: INTEGER current_type class INTEGER
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'>
level=INFO message=sql_type: INTEGER current_type class INTEGER 
level=INFO message=!!! I found a same to same match. !!! 
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'> 
level=INFO message=sql_type: INTEGER current_type class INTEGER 
level=INFO message=!!! I found a same to same match. !!! 
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'> 
level=INFO message=sql_type: VARCHAR current_type class VARCHAR 
level=INFO message=!!! I found a same to same match. !!! 
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'> 
level=INFO message=sql_type: INTEGER current_type class INTEGER 

The majority of matches where same to same except the TIMESTAMP columns which were generic to generic matches. Looking to the fourth logging line we can see that evaluating str() values would have give us a same to same match up front.

Lets pick on the DisplayName column which is a nvarchar(40) on the source. On the target I am going to truncate the table and change it to be a varchar(40) just like the source and do a meltano run.

level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'>
level=INFO message=sql_type: VARCHAR current_type class VARCHAR(40)

They were a same to same match but would have failed if evaluating string to string. On the target I am going to truncate the table and change it to be a varchar(5) which is to small for the source data and do a meltano run.

level=INFO message=!!! I found a same to same match. !!! 
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'> 
level=INFO message=sql_type: VARCHAR current_type class VARCHAR(5) 

It is still seen as a same to same match which allowed meltano run to run head long into a write failure. The string to string evaluation would have failed this out. I am going to switch to the following string to string check and comment out the generic to generic check.

       # Check if the existing column type and the sql type are the same
        if str(sql_type) == str(current_type):

On the target I dropped the UsersTest table ran the meltano run and truncated the table.

level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'>
level=INFO message=sql_type: INTEGER current_type class INTEGER
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'>
level=INFO message=sql_type: VARCHAR current_type class VARCHAR
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'>
level=INFO message=sql_type: INTEGER current_type class INTEGER
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: TIMESTAMP current_type class <class 'sqlalchemy.dialects.postgresql.base.TIMESTAMP'>
level=INFO message=sql_type: TIMESTAMP current_type class TIMESTAMP
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'>
level=INFO message=sql_type: VARCHAR current_type class VARCHAR
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'>
level=INFO message=sql_type: INTEGER current_type class INTEGER
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'>
level=INFO message=sql_type: VARCHAR current_type class VARCHAR
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: TIMESTAMP current_type class <class 'sqlalchemy.dialects.postgresql.base.TIMESTAMP'>
level=INFO message=sql_type: TIMESTAMP current_type class TIMESTAMP 
level=INFO message=!!! I found a same to same match. !!! 
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'> 
level=INFO message=sql_type: VARCHAR current_type class VARCHAR 
level=INFO message=!!! I found a same to same match. !!! 
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'> 
level=INFO message=sql_type: INTEGER current_type class INTEGER 
level=INFO message=!!! I found a same to same match. !!! 
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'> 
level=INFO message=sql_type: INTEGER current_type class INTEGER 
level=INFO message=!!! I found a same to same match. !!! 
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'> 
level=INFO message=sql_type: INTEGER current_type class INTEGER 
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'>
level=INFO message=sql_type: VARCHAR current_type class VARCHAR
level=INFO message=!!! I found a same to same match. !!!
level=INFO message=sql_type: INTEGER current_type class <class 'sqlalchemy.sql.sqltypes.INTEGER'>
level=INFO message=sql_type: INTEGER current_type class INTEGER

Everything was a same to same match which is nice. the TIMESTAMP got evaluated as the same right off the bat and didn't need a second check. Lets truncate the UsersTest tabele and change DisplayName to a varchar(40) .

level=INFO message=sql_type: TIMESTAMP current_type class TIMESTAMP
level=INFO message=!!! I asked merge_sql_type for a merge match. !!!
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'> compatible_sql_type <class 'sqlalchemy.sql.sqltypes.VARCHAR'>
level=INFO message=sql_type: VARCHAR current_type class VARCHAR(40) compatible_sql_type VARCHAR(40)
level=INFO message=!!! I believe these to be equal types !!!
level=INFO message=current_type class VARCHAR(40) compatible_sql_type VARCHAR(40)

Since it failed the same to same match it called the merge_sql_types function which is great this make more sense than the previous isinstance() evaluation flow. The current_type and compatible_sql_type pass the same to same evaluation which we see in the last two lines. Last let truncated UsersTest again and change DisplayName to a varchar(5).

level=INFO message=!!! I asked merge_sql_type for a merge match. !!!
level=INFO message=sql_type: VARCHAR current_type class <class 'sqlalchemy.sql.sqltypes.VARCHAR'> compatible_sql_type <class 'sqlalchemy.sql.sqltypes.VARCHAR'>
level=INFO message=sql_type: VARCHAR current_type class VARCHAR(5) compatible_sql_type VARCHAR(5)
level=INFO message=!!! I believe these to be equal types !!!
level=INFO message=current_type class VARCHAR(5) compatible_sql_type VARCHAR(5)

Again the initial same to same check fails and the merge_sql_types gets called. I don't agree with the merge_sql_type solution but I like how I got to this failure.

I am going to change to a single string to string check ( if str(sql_type) == str(current_type): ). The logging of these checks I found handy and was wonder if I should leave a terse version of them in at a debug level.

@BuzzCutNorman
Copy link
Contributor Author

BuzzCutNorman commented Aug 19, 2022

@edgarrmondragon I have completed reviewing the comments @visch made during his review and made changes to address them. I also attempted a refresh of merge_sql_types. I am figuring that merge_sql_types will only be called when the column types are different. It does a loop to see if any of types in the sorted type list sql_types will work. If there is a string or byte type class or subclass and its length is None or 0 it will return it. It will also return if the type is a similar class type to the current_type. The main senario for the similar class type testing on postgres was TIMESTAMP vs DATETIME. Please review this when you get a chance.

I also have a couple of questions:

  1. What are the minimum column adaptions the SDK do should do by default? Which is a question form of @visch comment from his review.

If we get to caring about varchar(6) vs varchar(1000) and needing to change this will present an issue but for the generic SDK this might be alright!

  1. Is the merge_column_type a function developers should be directed to override to match merge behavior of the SQL RDMS they are using?

@BuzzCutNorman BuzzCutNorman marked this pull request as ready for review August 19, 2022 17:25
@aaronsteers
Copy link
Contributor

aaronsteers commented Sep 6, 2022

@BuzzCutNorman - re:

I am figuring that merge_sql_types will only be called when the column types are different.

I think there may be cases where the column types might be the same, or at least equivalent with each other. Equivalence is probably more likely than equality but just to be safe, I think an equality check probably does make sense within the method. Some types are not 'identical' even though they are 'equivalent'. For instance, in Redshift, INT8 is an alias for INTEGER, and DECIMAL is an alias for NUMERIC. Because equivalence can be tricky, it is probably a good idea to handle cases of differences as well as equivalence/equality.

I also have a couple of questions:

  1. What are the minimum column adaptions the SDK do should do by default? Which is a question form of @visch comment from his review.

If we get to caring about varchar(6) vs varchar(1000) and needing to change this will present an issue but for the generic SDK this might be alright!

I'm not sure I fully understand the question, but in case it helps, I do think the SDK should (eventually) know how to auto-expand from varchar(6) to varchar(1000). Does that address your question?

A bit of a tangent, but it could also nice if the developer can choose to always bias towards larger data types if they prefer, something like varchar(MAX) to prevent future resizing on columnar database targets where storage consumption is identical anyway.

  1. Is the merge_column_type a function developers should be directed to override to match merge behavior of the SQL RDMS they are using?

I think the goal here with merge_column_type() would be that (over time) we tackle as many generic cases as possible. Since we inevitably will not get perfect coverage over night, and since some databases will have non-standard types and conversion rules, I have imagined this as the developer pretty much always having a merge_column_types() implementation and that they'd handle zero or more special cases before then invoking the base class's implementation for everything else.

What do you think?

cc @edgarrmondragon

@edgarrmondragon
Copy link
Collaborator

I think the goal here with merge_column_type() would be that (over time) we tackle as many generic cases as possible. Since we inevitably will not get perfect coverage over night, and since some databases will have non-standard types and conversion rules, I have imagined this as the developer pretty much always having a merge_column_types() implementation and that they'd handle zero or more special cases before then invoking the base class's implementation for everything else.

@aaronsteers I agree. Once developers start filing bugs and feature requests, we can start figuring what parts make sense to abstract away from them and which to expose in the "public" API. I think that's been the general approach for the trickier or more niche features in the SDK. For now, as rule of thumb, we could be contempt with supporting SQLite and Postgres DDL.

@BuzzCutNorman
Copy link
Contributor Author

@aaronsteers, @edgarrmondragon

Thank you for answering my questions. Yes, you did help me understand better how you want to shape target table columns to match a source, how developers should interact and extend this process, and the dialects you initially want to support. This all sounds great to me.

The only item not touch on was sql enginge conversions. Examples are integers inserted into a varchar column or integers inserted into a numeric column. The conversions that happen automagically when working with sql.

Please let me know if you want to move forward with the code as it is now and work on the items above in subsequent PRs or would you like me to attempt to incorporate all or some of the items you mentioned into this PR?

@aaronsteers
Copy link
Contributor

aaronsteers commented Sep 8, 2022

@aaronsteers, @edgarrmondragon

Thank you for answering my questions. Yes, you did help me understand better how you want to shape target table columns to match a source, how developers should interact and extend this process, and the dialects you initially want to support. This all sounds great to me.

🙌 🙏

The only item not touch on was sql enginge conversions. Examples are integers inserted into a varchar column or integers inserted into a numeric column. The conversions that happen automagically when working with sql.

Agreed, this part is pretty hard. Generally, I think of this as varchar being "able" to hold an int (or datetime, etc.) but not vice versa. This means, if we take a pure "ALTER" approach, we may end up with strings if a column morphs across reasonable type barriers. Other implementations will sometimes "rename and stash" the old column in order to avoid this. So, if a last_activity column was previously datetime type and is becoming int, we could rename the current column to last_activity__bak_20220801 or similar and then add a new column last_activity as the int type.

This more advanced accommodation though, is not something we have to tackle right away. And arguably, there are strong benefits to keeping all historic data in the same column even if it makes for some awkward type casing.

Please let me know if you want to move forward with the code as it is now and work on the items above in subsequent PRs or would you like me to attempt to incorporate all or some of the items you mentioned into this PR?

I love what you have so far and I don't think we can/should try to solve every case at once. @edgarrmondragon - I'll look to you for the code signoff/approval side. If this is working and provides and improvement over what we had previously, it seems to be to be is a valuable + stable increment.

Copy link
Collaborator

@edgarrmondragon edgarrmondragon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BuzzCutNorman this looks great! Thanks for taking the time to contribute 😄. Codecov's complaining about a couple of branches not being covered but they're alright to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL Targets: Processing fails when target table already exists
4 participants