-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Online DDL: avoid SQL's CONVERT(...)
, convert programmatically if needed
#16597
Online DDL: avoid SQL's CONVERT(...)
, convert programmatically if needed
#16597
Conversation
…ify column's charset or collation Signed-off-by: Shlomi Noach <[email protected]>
…onvert for vplayer Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
go/vt/vttablet/onlineddl/vrepl.go
Outdated
@@ -289,6 +289,9 @@ func (v *VRepl) generateFilterQuery() error { | |||
sb.WriteString(fmt.Sprintf("CONCAT(%s)", escapeName(name))) | |||
case sourceCol.Type() == "json": | |||
sb.WriteString(fmt.Sprintf("convert(%s using utf8mb4)", escapeName(name))) | |||
case targetCol.Type() == "json" && sourceCol.Type() != "json": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This moves up from below so as to eliminate a case before we compare charsets for JSONs, which is not required and not beneficial.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16597 +/- ##
==========================================
- Coverage 68.85% 68.84% -0.02%
==========================================
Files 1557 1557
Lines 199891 200003 +112
==========================================
+ Hits 137644 137697 +53
- Misses 62247 62306 +59 ☔ View full report in Codecov by Sentry. |
@@ -646,6 +654,24 @@ func appendFromRow(pq *sqlparser.ParsedQuery, buf *bytes2.Buffer, fields []*quer | |||
buf.WriteString(sqltypes.NullStr) | |||
} else { | |||
vv := sqltypes.MakeTrusted(typ, row.Values[col.offset:col.offset+col.length]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this also allocate and later on too? Is it worth avoiding creating this if we overwrite it later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. No double allocation. Also, converged the two codepaths that do charset.Convert()
into a single convertStringCharset()
function.
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
@@ -257,7 +258,7 @@ func (tp *TablePlan) applyBulkInsert(sqlbuffer *bytes2.Buffer, rows []*querypb.R | |||
if i > 0 { | |||
sqlbuffer.WriteString(", ") | |||
} | |||
if err := appendFromRow(tp.BulkInsertValues, sqlbuffer, tp.Fields, row, tp.FieldsToSkip); err != nil { | |||
if err := tp.appendFromRow(tp.BulkInsertValues, sqlbuffer, tp.Fields, row, tp.FieldsToSkip); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make this change, which I'm OK with, then we don't need to pass in the other tp
struct values:
tp.appendFromRow(sqlbuffer, row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💅 This is great! I think that this will solve so many edge cases we've seen in production. ❤️ Just a couple of minor points so far.
go/vt/vttablet/onlineddl/vrepl.go
Outdated
if trivialCharset(fromCollation) && trivialCharset(toCollation) && targetCol.Type() != "json" { | ||
sb.WriteString(escapeName(name)) | ||
} else if fromCollation == toCollation && targetCol.Type() != "json" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want && targetCol.Type() != "json"
here and just above, do we? We already handle the non-JSON
to JSON
case above. We'd fall into the else case below where we'd say there's a collation conversion necessary even though there isn't. No?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In any event, I don't think this is a major issue as the primary issue we've seen on the target/vplayer
side is where we were unable to use the desired index because of the CONVERT
usage and you can't add indexes directly on JSON
columns anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already handle the non-JSON to JSON case above.
You're right! We changed the case
ordering and now we don't need this check. Fixed: removed three unnecessary checks in total.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we're still left with a few CONVERT(...)
s yet in the code: for JSONs and for ENUMs. For JSONs it's as you say - not something you can even put in a primary key
or any unique key
; for ENUMs it's more complex. I'll take it to another PR.
case sourceCol.Type() == "json": | ||
sb.WriteString(fmt.Sprintf("convert(%s using utf8mb4)", escapeName(name))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbussink do you think this is still needed? I don't think so anymore, now that we have native JSON
type support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(against a v21 vtgate here)
❯ mysql commerce -e "create table json_test (id int not null primary key, j1 json); insert into json_test values (1, '{\"name\":\"Matt\"}')"
❯ mysql commerce -e "insert into json_test select id+10, j1 from json_test"
❯ mysql commerce -e "select * from json_test" --column-type-info
Field 1: `id`
Catalog: `def`
Database: `commerce`
Table: `json_test`
Org_table: `json_test`
Type: LONG
Collation: binary (63)
Length: 11
Max_length: 2
Decimals: 0
Flags: NOT_NULL PRI_KEY NO_DEFAULT_VALUE NUM PART_KEY
Field 2: `j1`
Catalog: `def`
Database: `commerce`
Table: `json_test`
Org_table: `json_test`
Type: JSON
Collation: binary (63)
Length: 4294967295
Max_length: 16
Decimals: 0
Flags: BLOB BINARY
+----+------------------+
| id | j1 |
+----+------------------+
| 1 | {"name": "Matt"} |
| 11 | {"name": "Matt"} |
+----+------------------+
I expect this to be bytes we pass on to MySQL "on the other side" and they are interpreted there as either a JSON
field or serialized as a utf8mb4
string if some other type on the target.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, I don't think it's a major deal on the source/vcopier
side as the primary problems we've seen there are when these CONVERT
calls then preclude us from using the desired index in the rowstreamer
query and you can't add indexes directly on JSON
columns anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave it like so for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON is a bit special anyway, since we can't use the direct textual representation, but we turn it into a sql expression using JSON_OBJECT
so we lose as little type information as possible.
|
||
if conversion, ok := tp.ConvertCharset[col.field.Name]; ok && col.length >= 0 { | ||
// Non-null string value, for which we have a charset conversion instruction | ||
fromCollation := tp.CollationEnv.DefaultCollationForCharset(conversion.FromCharset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to rely on the default collation for the charset (on from and to side)? If we take utf8mb4
for example:
mysql> show collation where charset = 'utf8mb4';
+----------------------------+---------+-----+---------+----------+---------+---------------+
| Collation | Charset | Id | Default | Compiled | Sortlen | Pad_attribute |
+----------------------------+---------+-----+---------+----------+---------+---------------+
| utf8mb4_0900_ai_ci | utf8mb4 | 255 | Yes | Yes | 0 | NO PAD |
| utf8mb4_0900_as_ci | utf8mb4 | 305 | | Yes | 0 | NO PAD |
| utf8mb4_0900_as_cs | utf8mb4 | 278 | | Yes | 0 | NO PAD |
| utf8mb4_0900_bin | utf8mb4 | 309 | | Yes | 1 | NO PAD |
| utf8mb4_bg_0900_ai_ci | utf8mb4 | 318 | | Yes | 0 | NO PAD |
| utf8mb4_bg_0900_as_cs | utf8mb4 | 319 | | Yes | 0 | NO PAD |
| utf8mb4_bin | utf8mb4 | 46 | | Yes | 1 | PAD SPACE |
...
| utf8mb4_turkish_ci | utf8mb4 | 233 | | Yes | 8 | PAD SPACE |
| utf8mb4_unicode_520_ci | utf8mb4 | 246 | | Yes | 8 | PAD SPACE |
| utf8mb4_unicode_ci | utf8mb4 | 224 | | Yes | 8 | PAD SPACE |
| utf8mb4_vietnamese_ci | utf8mb4 | 247 | | Yes | 8 | PAD SPACE |
| utf8mb4_vi_0900_ai_ci | utf8mb4 | 277 | | Yes | 0 | NO PAD |
| utf8mb4_vi_0900_as_cs | utf8mb4 | 300 | | Yes | 0 | NO PAD |
| utf8mb4_zh_0900_as_cs | utf8mb4 | 308 | | Yes | 0 | NO PAD |
+----------------------------+---------+-----+---------+----------+---------+---------------+
89 rows in set (0.00 sec)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're up for squeezing another change in here... I think we might want to make it ConvertCollation
that we use in OnlineDDL — or if we leave the field name the same, just use the collation name when possible rather than the charset name. The collation is specific, and it implies the character set. Perhaps we truly only care about the character set in this scenario though... 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have to rely on the default collation for the charset (on from and to side)? If we take utf8mb4 for example:
It's a bit moot. We only use Collation
as an intermediate step to get from the named charset (e.g. "latin1"
) into a Charset
object. So we may as well use the default collection to get there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we truly only care about the character set in this scenario though... 🤔
This is worth digging into. If we do end up using collation rather than charset, then there's a few proto changes to make, so this will be outside the scope of this PR.
// Non-null string value, for which we have a charset conversion instruction | ||
fromCollation := tp.CollationEnv.DefaultCollationForCharset(conversion.FromCharset) | ||
if fromCollation == collations.Unknown { | ||
return vterrors.Errorf(vtrpcpb.Code_INVALID_ARGUMENT, "Character set %s not supported for column %s", conversion.FromCharset, col.field.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, but errors aren't supposed to be capitalized (due to wrapping). That applies throughout the new code in the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed! One place where I did leave the message capitalized is in "Incorrect string value"
- this string mimics the error message MySQL would have given for the equivalent SQL CONVERT(...)
function, and I think we should keep this as it promotes consistency.
@@ -646,6 +654,24 @@ func appendFromRow(pq *sqlparser.ParsedQuery, buf *bytes2.Buffer, fields []*quer | |||
buf.WriteString(sqltypes.NullStr) | |||
} else { | |||
vv := sqltypes.MakeTrusted(typ, row.Values[col.offset:col.offset+col.length]) | |||
|
|||
if conversion, ok := tp.ConvertCharset[col.field.Name]; ok && col.length >= 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want col.length > 0
here? If there are no chars/bytes then I wouldn't think we need to do anything in this regard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to my bad English, I'm not sure if you mean we should use col.length >= 0
or if you mean we shouldn't use col.length >= 0
.
Just in case you mean the former, we do have col.length >= 0
at the end of this line, in case you've missed it.
If you meant the latter, then col.length >= 0
in this context is an indicator that the value is not NULL, and we should test this or otherwise the conversion will break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbussink pointed out that you meant to highlight > 0
rather than >= 0
. Agreed, and fixed!
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
Signed-off-by: Shlomi Noach <[email protected]>
I'm backporting this to all supported versions as I see this as an important bugfix. |
…eeded (vitessio#16597) Signed-off-by: Shlomi Noach <[email protected]>
Description
Fixes #16023
We have a clear picture and a fix to #16023. The original reason why we needed
convert()
in the first place is thatvreplication
andvstreamer
both issue aSET NAMES binary
. We will want to change that in the future, but this PR in the meantime confirms to thebinary
connection charset.So we used
convert()
to turn textual values intoutf8mb4
. On the other side,vplayer
is reading events from the binary log. It used programmatic conversion (charset.Convert()
) of the data toutf8mb4
to align withvcopier
.What we are doing now:
convert()
, solving the sorting issue described in Bug Report: OnlineDDL PK conversion results in table scans #16023 (comment)vcopier
read data, We do introduce programmatic conversion of non-utf columns into their designated charsets.vplayer
, we do not convert at all if both source and target have same charsetvplayer
, we do apply programmatic conversion of non-utf columns into their designated charsets, in a similar logic as forvcopier
.charset.Convert()
error, we translate it intoERROR 1366
("Incorrect string value ..."), which is a terminal error in vreplication, and so the migration bails out as soon as that happens. This can happen if e.g. we're converting a UTF column into ASCII and the UTF column contains a smiley emoji.Because we do not convert the original charset to
utf8mb4
, we get to programmatically convert it to the specific target column. Previously (and this is perhaps the last piece of magic I have not digged into yet, and again likely to be caused by thebinary
charset) we did not need to convert into the target charset.All the tests remain the same, and we introduce a couple new ones.
Related Issue(s)
Backport
I wish to backport this to all supported versions, seeing that this is a bugfix: without this fix some migrations will slow down to a near halt.
Checklist
Deployment Notes