Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

executor: fix hash join between datetime and timestamp #25915

Merged
merged 7 commits into from
Jul 6, 2021

Conversation

sylzd
Copy link
Contributor

@sylzd sylzd commented Jul 4, 2021

What problem does this PR solve?

Issue Number: close #25902

Problem Summary:

What is changed and how it works?

Proposal: xxx

What's Changed:

How it Works:

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression
    • Consumes more CPU
    • Consumes more MEM
  • Breaking backward compatibility

Release note

  • No release note

@ti-chi-bot ti-chi-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 4, 2021
@ti-srebot
Copy link
Contributor

@github-actions github-actions bot added the sig/execution SIG execution label Jul 4, 2021
@ti-chi-bot ti-chi-bot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 4, 2021
@sylzd
Copy link
Contributor Author

sylzd commented Jul 4, 2021

Timestamp hash table cannot be encoded by utc zone, because it's not a persistent table, and used by hash join. Is that used by any other places?

@ti-chi-bot ti-chi-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 4, 2021
@sylzd
Copy link
Contributor Author

sylzd commented Jul 5, 2021

/cc @XuHuaiyu @morgo

@ti-chi-bot ti-chi-bot requested review from morgo and XuHuaiyu July 5, 2021 01:53
@XuHuaiyu XuHuaiyu requested review from wshwsh12 and tiancaiamao and removed request for morgo and XuHuaiyu July 5, 2021 02:49
@XuHuaiyu
Copy link
Contributor

XuHuaiyu commented Jul 5, 2021

@wshwsh12 @tiancaiamao
May you help to review this bug-fix PR?

@mmyj
Copy link
Member

mmyj commented Jul 5, 2021

@wshwsh12 @tiancaiamao
May you help to review this bug-fix PR?

I can help to review it.

@tiancaiamao
Copy link
Contributor

@wshwsh12 @tiancaiamao
May you help to review this bug-fix PR?

I'll take a look.

@@ -330,14 +330,7 @@ func encodeHashChunkRowIdx(sc *stmtctx.StatementContext, row chunk.Row, tp *type
case mysql.TypeDate, mysql.TypeDatetime, mysql.TypeTimestamp:
flag = uintFlag
t := row.GetTime(idx)
// Encoding timestamp need to consider timezone.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here is right, the decoding process should take the timezone into consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we do not store the timezone information in the TiKV, all timestamp data is stored as UTC.
They are converted to UTC first, and then into a uint64

The read back process should convert the uint64 to the correct timestamp considering the timezone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp hash table cannot be encoded by utc zone, because it's not a persistent table, and used by hash join. Is that used by any other places?

As I see, the hash table is not a persistent table, It's used to do join operator after being getted from kv and converted to the right timezone, other than store in kv.

Copy link
Contributor Author

@sylzd sylzd Jul 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, this sql can produce the right result, because we do not convert timezone when do join. Timezone convertion should be considered when it was getted from kv.

select  /*+ MERGE_JOIN(tt1, tt2) */ * from tt1 inner join tt2 on tt1.ts=tt2.ts;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly

@tiancaiamao
Copy link
Contributor

Here is what we need to change:

tidb/util/codec/codec.go

Lines 333 to 340 in cb06e11

// Encoding timestamp need to consider timezone.
// If it's not in UTC, transform to UTC first.
if t.Type() == mysql.TypeTimestamp && sc.TimeZone != time.UTC {
err = t.ConvertTimeZone(sc.TimeZone, time.UTC)
if err != nil {
return
}
}

tidb/util/codec/codec.go

Lines 510 to 517 in cb06e11

// Encoding timestamp need to consider timezone.
// If it's not in UTC, transform to UTC first.
if t.Type() == mysql.TypeTimestamp && sc.TimeZone != time.UTC {
err = t.ConvertTimeZone(sc.TimeZone, time.UTC)
if err != nil {
return
}
}

Could you help us remove it in this pull request, and also add a test like that:

mysql> select * from tt1 where ts in (select ts from tt2);
+---------------------+
| ts                  |
+---------------------+
| 2001-01-01 00:00:00 |
+---------------------+
1 row in set (0.00 sec)

mysql> set @@session.time_zone = '+10:00';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from tt1 where ts in (select ts from tt2);
Empty set (0.01 sec)

Bug fix should comes with a unit test to prevent regression. @sylzd

@tiancaiamao
Copy link
Contributor

Explain: tikv doesn't store the time_zone information, so the timestamp is converted to UTC first before stored.
When the timestamp data is loaded into chunk, they are convert back considering the session time_zone.

mysql> insert into tt1 values ("2001-01-01 00:00:00");
Query OK, 1 row affected (0.00 sec)
# stored as `2000-12-31 16:00:00` in UTC

mysql> select * from tt1;
+---------------------+
| ts                  |
+---------------------+
| 2001-01-01 00:00:00 |
+---------------------+
1 row in set (0.01 sec)
# read back `2000-12-31 16:00:00` considering session time_zone=+8:00 gets `2001-01-01 00:00:00` UTC+8

mysql> set @@session.time_zone = '+10:00';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from tt1;
+---------------------+
| ts                  |
+---------------------+
| 2001-01-01 02:00:00 |
+---------------------+
1 row in set (0.00 sec)
# read back `2000-12-31 16:00:00` considering session time_zone=+8:00 gets `2001-01-01 02:00:00` UTC+10

The bug of that issue is, the loading to chunk process had already handle the time_zone converting, so when hash join key comparing, the converting should not be done again.

func (c *hashRowContainer) matchJoinKey(buildRow, probeRow chunk.Row, probeHCtx *hashContext) (ok bool, err error) {

@sylzd
Copy link
Contributor Author

sylzd commented Jul 5, 2021

Explain: tikv doesn't store the time_zone information, so the timestamp is converted to UTC first before stored.
When the timestamp data is loaded into chunk, they are convert back considering the session time_zone.

mysql> insert into tt1 values ("2001-01-01 00:00:00");
Query OK, 1 row affected (0.00 sec)
# stored as `2000-12-31 16:00:00` in UTC

mysql> select * from tt1;
+---------------------+
| ts                  |
+---------------------+
| 2001-01-01 00:00:00 |
+---------------------+
1 row in set (0.01 sec)
# read back `2000-12-31 16:00:00` considering session time_zone=+8:00 gets `2001-01-01 00:00:00` UTC+8

mysql> set @@session.time_zone = '+10:00';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from tt1;
+---------------------+
| ts                  |
+---------------------+
| 2001-01-01 02:00:00 |
+---------------------+
1 row in set (0.00 sec)
# read back `2000-12-31 16:00:00` considering session time_zone=+8:00 gets `2001-01-01 02:00:00` UTC+10

The bug of that issue is, the loading to chunk process had already handle the time_zone converting, so when hash join key comparing, the converting should not be done again.

func (c *hashRowContainer) matchJoinKey(buildRow, probeRow chunk.Row, probeHCtx *hashContext) (ok bool, err error) {

yes, that's what I mean~

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Jul 5, 2021
@sylzd
Copy link
Contributor Author

sylzd commented Jul 5, 2021

@mmyj PTAL~


func (s *testSuiteJoinSerial) TestIssue25902(c *C) {
tk := testkit.NewTestKitWithInit(c, s.store)
tk.MustExec("drop table if exists tt1,tt2; ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tt3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

tk.MustExec("insert into tt3 values (\"2001-01-01 00:00:00\");")
tk.MustQuery("select * from tt1 where ts in (select ts from tt2);").Check(testkit.Rows("2001-01-01 00:00:00"))
tk.MustQuery("select * from tt1 where ts in (select ts from tt3);").Check(testkit.Rows("2001-01-01 00:00:00"))
tk.MustExec("set @@session.time_zone = '+10:00';")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we reset to time_zone to +08?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx, it fixed~

Copy link
Member

@mmyj mmyj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • mmyj
  • tiancaiamao

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Jul 5, 2021
@tiancaiamao
Copy link
Contributor

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 66fa1bd

@ti-chi-bot ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jul 6, 2021
@ti-chi-bot ti-chi-bot merged commit dc53b52 into pingcap:master Jul 6, 2021
@guo-shaoge
Copy link
Collaborator

/label needs-cherry-pick-4.0
/label needs-cherry-pick-5.0
/label needs-cherry-pick-5.1

@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 in PR #25988

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jul 6, 2021
@ti-srebot
Copy link
Contributor

cherry pick to release-5.0 in PR #25990

ti-srebot pushed a commit to ti-srebot/tidb that referenced this pull request Jul 6, 2021
@ti-srebot
Copy link
Contributor

cherry pick to release-5.1 in PR #25991

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-5.0 needs-cherry-pick-release-5.1 sig/execution SIG execution size/M Denotes a PR that changes 30-99 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error implicit convertion between varchar and timestamp when join
7 participants