Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make all regexes constants lazy loaded from pregenerated files #9

Merged
merged 5 commits into from
Oct 5, 2021

Conversation

radarek
Copy link
Contributor

@radarek radarek commented Sep 30, 2021

This PR addresses the problem #8.

Instead of generating regexes on the fly, when gem is loaded, they are now pre-generated. Every regex constants is in separate file and is loaded through Ruby's autoload. If it is not used then it doesn't take memory.

To generate regexes, just run:

ruby data/generate_constants.rb

Before this PR, memory usage presented as:

$ ruby mem.rb
7.65625 # MB

Every run gives a little bit different result but it oscillates around ~7.5MB on mac os, and 6.5MB on linux.

After the this PR:

$ ruby mem.rb
0.82421875 # MB

Both on mac os and linux it oscillates around 0.82MB. Difference is 8-9x.

This was measured on ruby 2.7, by this program:

require 'get_process_mem'

def mem(&block)
  raise ArgumentError, 'missing block' unless block

  mem = GetProcessMem.new
  before = mem.mb
  block.call
  after = mem.mb
  return after - before
end

puts mem { require "./lib/unicode/emoji"; Unicode::Emoji::REGEX }

Also, time of require dropped 4x:

# before, main branch
$ ruby -rbenchmark -e 'puts Benchmark.measure { require "./lib/unicode/emoji" }'
  0.071269   0.012206   0.083475 (  0.087639)
# after, PR branch
$ ruby -rbenchmark -e 'puts Benchmark.measure { require "./lib/unicode/emoji" }'
  0.013224   0.011000   0.024224 (  0.024226)

@radarek
Copy link
Contributor Author

radarek commented Sep 30, 2021

I removed the code checking if native regex properties can be used. I'm not sure if I get the idea. Why not use the same regexes on all environment, even if they support emoji properties? Wouldn't it be better to have consistent behaviour on every platform? They way it was done can make some nuances, depending whether non-native or native emoji properties was used.

However, if this is still needed then I can handle that as well. It would require to generate two versions of regexes and dynamically decide which one should be loaded.

@janlelis
Copy link
Owner

janlelis commented Sep 30, 2021

Thanks for the PR, I will take a closer look over the next few days.

Regarding the native Emoji properties: It certainly adds complexity, but there is a lot to be won when Ruby's own regex properties can be used: A much smaller regex gets generated, which is good for not only for memory usage, but also for performance. I agree that there is potential for undesired behavior there, but since both - this library and the Ruby core - integrate the Unicode Standard data very closely, there shouldn't be any deviations (if so, it's a bug).

@radarek
Copy link
Contributor Author

radarek commented Oct 3, 2021

I made two changes.

Support for native regex properties

Generated regexes are saved to lib/unicode/emoji/generated and lib/unicode/emoji/generated_native directories.

Character classes

For a list of consecutive characters (at least 3) instead of big union a|b|c|d|...|x character class [a-x] is used.

To get memory usage of regexes, I used this code:

require 'objspace'

Unicode::Emoji.constants.sort.filter_map { const = Unicode::Emoji.const_get(_1); const.is_a?(Regexp) ? [_1, ObjectSpace.memsize_of(const)] : nil }

ObjectSpace.memsize_of - returns memory size for given object. It doesn't follow referenced objects, so for Array and Hashes it won't take into account actual sizes of elements. Regexp objects don't have a references so it is no a problem.

Comparing main branch with this PR:

# const name, usage on main (bytes), usage on PR (bytes), diff (main - PR bytes), diff %
REGEX main: 149000, PR: 123672, diff: 25328, -16%
REGEX_ANY main: 55024, PR: 5088, diff: 49936, -90%
REGEX_BASIC main: 60800, PR: 6680, diff: 54120, -89%
REGEX_INCLUDE_TEXT main: 179340, PR: 129996, diff: 49344, -27%
REGEX_PICTO main: 139240, PR: 2924, diff: 136316, -97%
REGEX_PICTO_NO_EMOJI main: 86176, PR: 4824, diff: 81352, -94%
REGEX_TEXT main: 61064, PR: 6760, diff: 54304, -88%
REGEX_VALID main: 399228, PR: 282988, diff: 116240, -29%
REGEX_VALID_INCLUDE_TEXT main: 429568, PR: 289312, diff: 140256, -32%
REGEX_WELL_FORMED main: 368200, PR: 40036, diff: 328164, -89%
REGEX_WELL_FORMED_INCLUDE_TEXT main: 428880, PR: 46360, diff: 382520, -89%
Total main: 2356520, PR: 938640, diff: 1417880, -60%

I didn't check it but after this change, some regexes may perform better. I suspect that when a big list of unions is used then regex engine has to check every union element one by one. When character class with a range is used it (probably) checks by comparing to beginning and end of a range.

data/generate_constants.rb Outdated Show resolved Hide resolved
@janlelis
Copy link
Owner

janlelis commented Oct 4, 2021

Hi Radosław, thank you for all the work - the PR looks great and is a huge memory improvement.

I am going to merge this, but I'd like you to ask about two small details:

  • What do you think about the above suggestion?
  • Could you create a (super simple) rake task, which triggers the regex build script?

@radarek
Copy link
Contributor Author

radarek commented Oct 4, 2021

Hi Radosław, thank you for all the work - the PR looks great and is a huge memory improvement.

I am going to merge this, but I'd like you to ask about two small details:

Before merging I was about to ask whether you think that current specs are sufficient to feel comfortable with this PR changes? I didn't dig too much but I see they are checking some stuff but not everything. For example, I had a misspelled constant name EXTENDED_PICTOGRAPHIC_NO_EMOJ (missing I) and it didn't catch that.

  • What do you think about the above suggestion?

I like it! I have no idea why I didn't think about that :D. I doesn't make a huge difference (from what I checked briefly, it saves just couple of kilobytes of memory) but there is no reason to not have it. I will create a commit.

  • Could you create a (super simple) rake task, which triggers the regex build script?

Absolutely, I'll do it.

@radarek radarek marked this pull request as ready for review October 4, 2021 16:23

desc "#{gemspec.name} | Generates all regex constants and saves them to lib/unicode/emoji/{generated,generated_native} directories"
task :generate_constants do
load "data/generate_constants.rb", true
Copy link
Contributor Author

@radarek radarek Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing true to load will "wrap" loaded code inside a temporary module. So for example, when loaded code make global include or defines global methods (like our script "generate_constants.rb" does) then it won't affect a global (main) object.

@janlelis
Copy link
Owner

janlelis commented Oct 5, 2021

Great thanks for the last adaptions!

Before merging I was about to ask whether you think that current specs are sufficient to feel comfortable with this PR changes? I didn't dig too much but I see they are checking some stuff but not everything. For example, I had a misspelled constant name EXTENDED_PICTOGRAPHIC_NO_EMOJ (missing I) and it didn't catch that.

Do you remember the details about this? (should have been caught by

describe "REGEX_PICTO" do
it "matches codepoints with Extended_Pictograph property, but no Emoji property" do
matches = "U+1F32D 🌭 HOT DOG, U+203C ‼ DOUBLE EXCLAMATION MARK, U+26E8 ⛨ BLACK CROSS ON SHIELD".scan(Unicode::Emoji::REGEX_PICTO_NO_EMOJI)
assert_equal ["⛨"], matches
end
end
)

@janlelis janlelis merged commit 8b107b9 into janlelis:main Oct 5, 2021
@radarek
Copy link
Contributor Author

radarek commented Oct 5, 2021

Great thanks for the last adaptions!
No problem. I'm happy you merged my PR!

Do you remember the details about this? (should have been caught by

describe "REGEX_PICTO" do
it "matches codepoints with Extended_Pictograph property, but no Emoji property" do
matches = "U+1F32D 🌭 HOT DOG, U+203C ‼ DOUBLE EXCLAMATION MARK, U+26E8 ⛨ BLACK CROSS ON SHIELD".scan(Unicode::Emoji::REGEX_PICTO_NO_EMOJI)
assert_equal ["⛨"], matches
end
end

)

This spec checks different constant (EXTENDED_PICTOGRAPHIC_NO_EMOJI vs REGEX_PICTO_NO_EMOJI).
Searching by "EXTENDED_PICTOGRAPHIC_NO_EMOJI" phrase, there is no single match in spec/ directory.

@janlelis
Copy link
Owner

janlelis commented Oct 6, 2021

@radarek Although the constants are useful and (I think) very clean, it's not yet encouraged to use them directly. I have opened #10 to track this

@jywarren
Copy link

Hi, just to clarify, does the major version number bump indicate a breaking change for downstream users of this gem, or can we continue using it as we had before? Thank you! Love the library!

@janlelis
Copy link
Owner

Hi @jywarren, I had chosen the major version step to indicate that there were fundamental changes in the implementation, so it might behave differently (better memory usage) and might have bugs because of that (doesn't seems so, so far) - however, there were no API changes. Thanks for using the library!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants