Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compose: drop the 65535 node limit #343

Merged
merged 2 commits into from
May 15, 2023
Merged

Conversation

alois31
Copy link
Contributor

@alois31 alois31 commented May 14, 2023

In commit 1638409, the number of compose nodes was limited to 65535 to enable "future optimizations", which apparently means slightly reduced memory usage due to fitting in a uint16_t. At this time, it was mentioned that the author was not aware of "any compose files which come close".

However, I'm one of the users that actually do require a larger number of nodes for their compose file. Thus, drop the limit and use a uint32_t again.

@wismill
Copy link
Member

wismill commented May 14, 2023

These are the biggest compose files that I know:

  • Standard en_US is under 6k entries.
  • Bépo: adds about 2k entries to base en_US file.
  • Neo2: base file is under 9k entries, complex one reaches 40k entries. Complete one reaches 80k: it adds entries to input characters by their Unicode code point, such as: “uu17f” for U+017F “ſ”. I am sceptical to input rarely used characters this way.

I am curious to know how do you get so much entries in your Compose file. Would you like to explain?

@alois31
Copy link
Contributor Author

alois31 commented May 14, 2023

Neo2: base file is under 9k entries, complex one reaches 40k entries. Complete one reaches 80k: it adds entries to input characters by their Unicode code point, such as: “uu17f” for U+017F “ſ”. I am sceptical to input rarely used characters this way.

I'm using a derivative of the Neo2 compose file that has a bit more than 40k entries (with Unicode, without keypad, but with a newer Unicode version hacked in). However, several compose sequences seem to require more than one node, since the warnings start shortly after line 35000.

@alois31
Copy link
Contributor Author

alois31 commented May 14, 2023

Thinking a bit more about it, a binary tree that does not remember its insertion order should not require more than an average 2 pointer bits per node. Maybe there is a way to support large compose files without increasing memory usage at all.

Copy link
Member

@bluetech bluetech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware the were files that big, knowing about https://git.neo-layout.org/neo/neo-layout/src/branch/master/Compose/src would make optimizing the compose implementation more fun :)

This PR increases the compose_node size from 16 bytes to 20 bytes. Let's take an average system, which uses the en_US.UTF-8 compose file (~7000 nodes) and has ~25 GUI apps with Compose loaded. Before the memory usage would be 2.67MiB, after 3.33MiB. That sounds OK to me.

Of course with larger Compose files like the Neo one, let's say 100,000 nodes, the difference becomes larger ~38MiB vs. ~47MiB. But that's not the average case and not too bad either.

I left one comment, but LGTM otherwise.

@@ -75,16 +75,13 @@
* \0 is so offset 0 points to an empty string).
*/

/* Fits in uint16_t, also a good idea to have some limit. */
#define MAX_COMPOSE_NODES 65535
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's please keep some limit; I think limits are generally good to have for robustness. How about 1 million?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've raised the limit to 8 million. That way, even a Neo2 compose with all potential Unicode characters (a bit more than 1 million) should fit easily.

@bluetech
Copy link
Member

Thinking a bit more about it, a binary tree that does not remember its insertion order should not require more than an average 2 pointer bits per node. Maybe there is a way to support large compose files without increasing memory usage at all.

The implementation is currently a ternary tree; it was previously a regular trie. It's a pretty fun problem if you can come up with a better data structure. The evaluation parameters are: parsing time, lookup time, memory usage, graceful handling of edge cases.

@alois31
Copy link
Contributor Author

alois31 commented May 15, 2023

The implementation is currently a ternary tree; it was previously a regular trie. It's a pretty fun problem if you can come up with a better data structure. The evaluation parameters are: parsing time, lookup time, memory usage, graceful handling of edge cases.

I'm not saying that a ternary tree is bad, I'm saying that storing the nodes in insertion order increases memory usage. That said, I'm not aware of an implementation that can do better than 20 bytes per node without O(n^2) parse time (which would obviously be unacceptable).

In commit 1638409, the number of
compose nodes was limited to 65535 to enable "future optimizations",
which apparently means slightly reduced memory usage due to fitting in
a uint16_t. At this time, it was mentioned that the author was not
aware of "any compose files which come close".

However, I'm one of the users that actually do require a larger number
of nodes for their compose file. Thus, use a uint32_t again and raise
the limit significantly.
@wismill wismill added enhancement Indicates new feature requests compose Indicates a need for improvements or additions to Compose handling labels May 15, 2023
src/compose/parser.c Outdated Show resolved Hide resolved
@bluetech bluetech merged commit f3210cb into xkbcommon:master May 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compose Indicates a need for improvements or additions to Compose handling enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants