Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lightning: specify collation when parquet value to string datum #38391

Merged
merged 3 commits into from
Oct 17, 2022

Conversation

dsdashun
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #38351

Problem Summary:

What is changed and how it works?

For parquet parser, when setting a value into the string datum, use the "utf8mb4_bin" collation instead of an empty collation. This will make the string conversion logic not report errors, thus improving the performance.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Oct 11, 2022

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • lance6716
  • okJiang

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-triage-completed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 11, 2022
@dsdashun
Copy link
Contributor Author

/run-integration-br-test

@dsdashun
Copy link
Contributor Author

/cc @lance6716 @D3Hunter

@dsdashun dsdashun marked this pull request as ready for review October 11, 2022 05:10
@ti-chi-bot ti-chi-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 11, 2022
@@ -458,7 +458,7 @@ func setDatumByString(d *types.Datum, v string, meta *parquet.SchemaElement) {
ts = ts.UTC()
v = ts.Format(utcTimeLayout)
}
d.SetString(v, "")
d.SetString(v, "utf8mb4_bin")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many places need to consider string encodings, one is string data in parquet file, the other one is string variables in the memory of lightning process which read by parquet reader. Since golang string is always assumed utf8-encoded I think this PR is OK. But I'm not sure if parquet file has another encoding for string data and go-parquet reader wrongly cast it to golang string without encode/decode.

@ti-chi-bot ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Oct 11, 2022
@dsdashun
Copy link
Contributor Author

@ti-chi-bot ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Oct 14, 2022
@dsdashun
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 62d9688

@ti-chi-bot ti-chi-bot added status/can-merge Indicates a PR has been approved by a committer. needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. needs-cherry-pick-release-6.3 and removed do-not-merge/needs-triage-completed labels Oct 17, 2022
@dsdashun
Copy link
Contributor Author

/merge

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #38487.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Oct 17, 2022
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #38488.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Oct 17, 2022
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #38489.

@dsdashun dsdashun deleted the fix-38351 branch October 17, 2022 01:35
@sre-bot
Copy link
Contributor

sre-bot commented Oct 17, 2022

TiDB MergeCI notify

🔴 Bad News! [1] CI still failing after this pr merged.
These failed integration tests don't seem to be introduced by the current PR.

CI Name Result Duration Compare with Parent commit
idc-jenkins-ci-tidb/integration-common-test 🔴 failed 2, success 15, total 17 17 min Existing failure
idc-jenkins-ci/integration-cdc-test 🟢 all 37 tests passed 26 min Existing passed
idc-jenkins-ci-tidb/integration-ddl-test 🟢 all 6 tests passed 22 min Existing passed
idc-jenkins-ci-tidb/common-test 🟢 all 11 tests passed 8 min 52 sec Existing passed
idc-jenkins-ci-tidb/tics-test 🟢 all 1 tests passed 5 min 52 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-2 🟢 all 28 tests passed 4 min 10 sec Existing passed
idc-jenkins-ci-tidb/sqllogic-test-1 🟢 all 26 tests passed 4 min 1 sec Existing passed
idc-jenkins-ci-tidb/mybatis-test 🟢 all 1 tests passed 3 min 0 sec Existing passed
idc-jenkins-ci-tidb/integration-compatibility-test 🟢 all 1 tests passed 2 min 47 sec Existing passed
idc-jenkins-ci-tidb/plugin-test 🟢 build success, plugin test success 4min Existing passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-5.4 Should cherry pick this PR to release-5.4 branch. needs-cherry-pick-release-6.1 Should cherry pick this PR to release-6.1 branch. release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. status/can-merge Indicates a PR has been approved by a committer. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lightning: Performance Regression on 6.2.0 Compared with 5.3.3 on Parquet Data Source with Strings
5 participants