Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge system - issue importing from phpbb3 utf8-bin - duplicated entrie #209

Open
edipoferreira opened this issue Jun 19, 2018 · 3 comments

Comments

@edipoferreira
Copy link

Hi, I'm trying to use the merge but when importin users I have issue with names with special characters, let me show:
On the phpbb I have two users
jonatas and Jônatas, the encode is utf8_bin, but when mybb try to import, he considere jonatas and Jônatas the same user and because that issue a message of duplicated entry.

Have anyone faced this problem?
Tried a couple of configurations for encode on the merge but nothing worked.
To add more information, I changed the collation of the field username on mybb_users to utf8_bin, someone know if there is some type of problem if I let the field remain as this?

ALTER TABLE mybb_users CHANGE username username VARCHAR(120) CHARACTER SET utf8 COLLATE utf8_bin NOT NULL DEFAULT '';

Appears that I will be unable to migrate from phpbb3 due this encoding issue.

@euantorano
Copy link
Member

euantorano commented Jun 20, 2018 via email

@edipoferreira
Copy link
Author

I think that is a issue with the used collation, I installed a fresh version of the mybb, I tried create both users to test, no sucess, because the collation it compares Jônatas with jonatas and take them as equal. After my conversion to utf8_bin the second user was created normally.
I opened a issue on the mybb code to see if I can maintain the username field as utf8-bin.
mybb/mybb#3267

@yuliu
Copy link
Member

yuliu commented Oct 6, 2019

I think this issue does not relate to SQL's collation but the character set. The mechanism of user duplicate checking is coded in the users base module with the consideration of UTF-8. However, there's more in database collation perspective.

For @lordgittux 's problem, Jônatas is indeed duplicate of jonatas by the logic of code in the base module, in case-insensitive collations:

  • j is regarded the same as J
  • ô is regarded the same as o

@euantorano, yep here's the interesting point, looks like the duplicate check in base users module will not cover scenarios of Circumflex diacritical mark (ˆ) or letter variations. I'll dig more later.

I opened #226, in which there's some discussion of the UTF-8 problem would potentially relate to the user duplicate check.

Edited: typo.
Edited: more investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants