Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emojis and 4-byte characters cause Jethro to break #754

Closed
wants to merge 3 commits into from
Closed

Emojis and 4-byte characters cause Jethro to break #754

wants to merge 3 commits into from

Conversation

jefft
Copy link
Contributor

@jefft jefft commented Jan 29, 2022

Jethro currently breaks when emojis are entered. For instance, if I try to save SMS text containing '😊' into a Person note, it fails and an error is logged:

SQLSTATE[22007]: Invalid datetime format: 1366 Incorrect string value: '\xF0\x9F\x98\x8A' for column `jethro`.`note_comment`.`contents` at row 1
Line 240 of /srv/www/jethro/2.31.1/include/db_object.class.php

USER:       1
REFERER:    https://jethro/?view=_edit_note&note_type=person&noteid=20649
USER_AGENT: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36

REQUEST:
Array
(
    [view] => _edit_note
    [note_type] => person
    [noteid] => 20649
    [update_note_submitted] => 1
    [contents] => All good thanks, looking at getting back very soon
Thanks for reaching out 😊
    [status] => pending
    [assignee] => 1
    [action_date_d] => 24
    [action_date_m] => 1
    [action_date_y] => 2022
)

The problem is that MySQL's 'utf8' encoding isn't real UTF-8, but a 3-byte subset that doesn't support all characters. The correct encoding to use in MySQL is 'utfmb4'. For more information, see here or here.

When I checked my Jethro database I found the database-level encoding was correct:

MariaDB [jethro]> SELECT default_character_set_name FROM information_schema.SCHEMATA                                                                               
    -> WHERE schema_name = "jethro";                                                                                                                               
                                                                                                                                                                   
+----------------------------+                                                                                                                                     
| default_character_set_name |                                                                                                                                     
+----------------------------+                                                                                                                                     
| utf8mb4                    |                                                                                                                                     
+----------------------------+   

but table-level encodings were wrong:

MariaDB [jethro]> SELECT CCSA.character_set_name FROM information_schema.`TABLES` T,        information_schema.`COLLATION_CHARACTER_SET_APP                        
LICABILITY` CCSA WHERE CCSA.collation_name = T.table_collation   AND T.table_schema = "jethro";                                                                    
+--------------------+                                                                                                                                             
| character_set_name |                                                                                                                                             
+--------------------+                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               |                                                                                                                                             
| utf8               | 

Fixing this needs two changes:

  1. The database and table encodings need to be changed from utf8 to utf8mb4.
  2. Jethro needs to initiate connections with utf8mb4 rather than utf8, as is currently hardcoded when Jethro constructs the connection dsn:
    $dsn = $type . ':host=' . $host . (strlen($port) ? (';port=' . $port) : '') . ';dbname=' . DB_DATABASE . ';charset=utf8';

Correspondingly, the fix requires two parts:

  1. Fix the database encodings. ​I followed this advice to generate appropriate SQL to upgrade all database tables. The result is in upgrades/2022-upgrade-to-2.32.sql
  2. Fix the database DSN connection string; done in include/jethrodb.php.

This requires MySQL 5.5.3 and above (released in 2010). I have adjusted the minimum MySQL version in README.md.

Note: I hope changing character sets isn't opening a can of worms. I've only tested on one (mariadb 10.3.32) instance. There may be more places where utf8 needs changing to utf8mb4. I hope other users can test and confirm that upgrades go smoothly.

@tbar0970
Copy link
Owner

tbar0970 commented Feb 3, 2022

Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.

@tbar0970
Copy link
Owner

tbar0970 commented Mar 17, 2022

Nice summary of the issue here: https://make.wordpress.org/core/2015/04/02/the-utf8mb4-upgrade/

Holding off on this for now since on MySQL 5.1 emojis are working fine...

@vanoudt
Copy link
Contributor

vanoudt commented Sep 5, 2022

Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.

Could it be that you have the sms sanitising option turned on? That would automatically strip out all multi-byte characters... and you might never see the error!

@vanoudt
Copy link
Contributor

vanoudt commented Sep 5, 2022

I've updated #606 too match this too!

@tbar0970
Copy link
Owner

tbar0970 commented Sep 5, 2022

Hmm, I couldn't reproduce the original problem in my systems. Will dig in further to work out why.

Could it be that you have the sms sanitising option turned on? That would automatically strip out all multi-byte characters... and you might never see the error!

That wouldn't be it - the original problem was triggered by pasting emojis into a Jethro note, not by sending anything out.

@jefft jefft closed this by deleting the head repository Jan 23, 2023
@tbar0970
Copy link
Owner

tbar0970 commented Mar 7, 2023

Hey I'm thinking now would be a good time to merge this.
Not sure what "closed by deleting the head repository" means or came from?

@tbar0970 tbar0970 reopened this Mar 7, 2023
@jefft
Copy link
Contributor Author

jefft commented Mar 7, 2023

Holding off on this for now since on MySQL 5.1 emojis are working fine...

Yes, it was wise to hold off for that reason. The SQL in this PR would corrupt some older databases.

As your experience shows, emojis in MySQL 5.1 do work, if your database encoding is latin1 and you're using ancient PHP (5.3.3) which seems hardcode latin1 as the transfer encoding (ignoring charset=utf8 in include/jethrodb.php). Say we have a latin1 table, and we ask MySQL to insert the 4 utf8 bytes representing 😊. MySQL goes "close enough lol ¯_(ツ)_/¯" and adds the 4 raw bytes, despite them not being valid latin1. When asked to retrieve data, MySQL goes 'here's your bytes whatever they are'. Everything just works because PHP and MySQL store and load the raw bytes unaltered.

Now what happens if you take one of these latin1 databases containing utf8 bytes, and run the SQL I proposed in this PR? E.g.:

ALTER TABLE `family_note` CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

I suspect that might break your characters. Say you have a '😊' encoded as 4 bytes: MySQL will try to map those 4 utf8 bytes one by one, to whatever unicode characters live at those 3 codepoints.

For latin1 databases, what's needed is a more involved process involving a mysqldump, then search & replace, and import, as documented at https://www.whitesmith.co/blog/latin1-to-utf8/. I have a draft guide written on how to do that.

What about the less severe case, where tables are created as utf8, and we run SQL to CONVERT CHARACTER SET utf8mb4? I think that would be safe.

So there's 2 kinds of Jethro databases: those with old latin1 databases that need surgery, and those with utf8 databases that just need converting, and we cannot recommend the same SQL for both cases. Perhaps we'll need a PHP script, upgrades/2023-fixcharsets.php, which does the right thing.

@jefft jefft closed this Mar 7, 2023
@tbar0970
Copy link
Owner

tbar0970 commented Mar 8, 2023

Nice analysis.

Any Jethro system created in the last few years will have utf8 encoding on each table, but possibly latin1 as the encoding for the overall database. Does this put it in category 1 (major surgery) or category 2 (smooth conversion)?

If it's only a few really ancient databases that are in category 1, I'd be happy to cover them with a warning in the upgrade notes and some instructions about how to dump, tweak and re-import. And/or ancient emojis can just suffer.

@jefft
Copy link
Contributor Author

jefft commented Mar 8, 2023

Any Jethro system created in the last few years will have utf8 encoding on each table, but possibly latin1 as the encoding for the overall database. Does this put it in category 1 (major surgery) or category 2 (smooth conversion)?

I don't know, but I'd expect the column-level charset to take precedence. If your columns are utf8 you're probably okay - just run some SQL to convert it (and the columns) to utf8mb4.

If it's only a few really ancient databases that are in category 1, I'd be happy to cover them with a warning in the upgrade notes and some instructions about how to dump, tweak and re-import.

Yes, I think that's fine.

I've tweaked my comment above, to point out that 😁 has a 4-byte utf8 representation.

As for how to tell category 1 and category 2 databases apart, 4-byte emojis are a useful shibboleth. Ask your Jethro "can you pronounce 😁"?
image
A modern Jethro using utf8 will not be able to store that. utf8 can only store 3-byte unicode, not 4-byte. You can see a utf8 Jethro fall over at https://easyjethro.com.au/demo/.
However an old Jethro, with latin1 database and PHP 5.x will be able to store and display 😁. If the smiley shows, you know you'll need database surgery to convert latin1 to utf8mb4.

@tbar0970
Copy link
Owner

Further diagnostic steps from Jeff:

Try running this query:
SELECT T.table_name, CCSA.character_set_name FROM information_schema.TABLES T, information_schema.COLLATION_CHARACTER_SET_APPLICABILITY CCSA WHERE CCSA.collation_name = T.table_collation AND T.table_schema = DATABASE();

If you get lots of 'latin1', but smileys still work, your database is basically corrupt - storing utf8 bytes in latin1 tables. The fix is as described in https://www.whitesmith.co/blog/latin1-to-utf8/, to dump the database with no charset transcoding:

mysqldump --routines --events --no-data --skip-set-charset --default-character-set=latin1 jethro > jethro.sql

Then do a massive search & replace on the .sql dump:

cat jethro.sql | sed -e 's/CHARACTER SET [^ ]+/CHARACTER SET utf8mb4/g'
-e 's/DEFAULT CHARSET=[^;]+;/DEFAULT CHARSET=utf8mb4;/g'
-e 's/SET (character_set_client|character_set_results)( +)= ([^@][^ ]+)([ ;])/SET \1\2= utf8mb4\4/g'
-e 's/SET (collation_connection)( +)= ([^ ]+) /SET \1\2= utf8mb4_bin /g'
-e 's/COLLATE [^ ]+_ci/COLLATE utf8mb4_unicode_ci/g'
-e 's/SET NAMES [^ ]+ /SET NAMES utf8mb4 /g' \

fixedjethro.sql

Then import the fixed SQL.

@jefft
Copy link
Contributor Author

jefft commented Sep 30, 2024

That diagnostic SQL query, with non-broken formatting:

SELECT t.table_name, 
       ccsa.character_set_name,
       t.table_collation
FROM information_schema.`tables` t,
     information_schema.`collation_character_set_applicability` ccsa
WHERE ccsa.collation_name = t.table_collation
  AND t.table_schema = database();

This shows the encoding and collation of each table, e.g.:

+----------------------------------+--------------------+--------------------+
| table_name                       | character_set_name | table_collation    |
+----------------------------------+--------------------+--------------------+
| person_status                    | utf8mb4            | utf8mb4_bin        |
| 2fa_trust                        | utf8mb4            | utf8mb4_bin        |
| custom_field                     | utf8mb4            | utf8mb4_unicode_ci |
| person_group_membership_status   | utf8mb4            | utf8mb4_unicode_ci |
| _abstract_note_old_backup        | utf8mb4            | utf8mb4_unicode_ci |
| staff_member                     | utf8mb4            | utf8mb4_unicode_ci |
....

Encodings

Your character_set_name might be:

  • utf8mb4 - this is ideal
  • utf8 or utf8mb3 (an alias for utf8) - this is less ideal but livable. You won't be able to store 4-byte characters like 😁 in Jethro. You might like to convert all tables to utf8mb4.
  • latin1. This is a problem if your database has any non-ASCII, international characters (including emojis). See above comment on how to dump and re-import your database.

Collations

As for collations: whatever the table collation is (for non-binary tables), must be the same for all tables:

SELECT DISTINCT 
CHARACTER_SET_NAME,
                table_collation,
                count(*)
FROM information_schema.`tables` t,
     information_schema.`collation_character_set_applicability` ccsa
WHERE ccsa.collation_name = t.table_collation
  AND t.table_schema = database()
GROUP BY 1,
         2
ORDER BY 3 DESC;
+--------------------+--------------------+----------+
| CHARACTER_SET_NAME | table_collation    | count(*) |
+--------------------+--------------------+----------+
| utf8mb4            | utf8mb4_unicode_ci |       56 |
| utf8mb4            | utf8mb4_bin        |        4 |
+--------------------+--------------------+----------+
2 rows in set (0.002 sec)

If you mix, say, utf8_general_ci and utf8_unicode_ci, you will get "General error: 1271 Illegal mix of collations for operation 'UNION'" errors.

@jefft jefft mentioned this pull request Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants