Make all regexes constants lazy loaded from pregenerated files #9

radarek · 2021-09-30T20:30:55Z

This PR addresses the problem #8.

Instead of generating regexes on the fly, when gem is loaded, they are now pre-generated. Every regex constants is in separate file and is loaded through Ruby's autoload. If it is not used then it doesn't take memory.

To generate regexes, just run:

ruby data/generate_constants.rb

Before this PR, memory usage presented as:

$ ruby mem.rb
7.65625 # MB

Every run gives a little bit different result but it oscillates around ~7.5MB on mac os, and 6.5MB on linux.

After the this PR:

$ ruby mem.rb
0.82421875 # MB

Both on mac os and linux it oscillates around 0.82MB. Difference is 8-9x.

This was measured on ruby 2.7, by this program:

require 'get_process_mem'

def mem(&block)
  raise ArgumentError, 'missing block' unless block

  mem = GetProcessMem.new
  before = mem.mb
  block.call
  after = mem.mb
  return after - before
end

puts mem { require "./lib/unicode/emoji"; Unicode::Emoji::REGEX }

Also, time of require dropped 4x:

# before, main branch
$ ruby -rbenchmark -e 'puts Benchmark.measure { require "./lib/unicode/emoji" }'
  0.071269   0.012206   0.083475 (  0.087639)
# after, PR branch
$ ruby -rbenchmark -e 'puts Benchmark.measure { require "./lib/unicode/emoji" }'
  0.013224   0.011000   0.024224 (  0.024226)

radarek · 2021-09-30T20:42:19Z

I removed the code checking if native regex properties can be used. I'm not sure if I get the idea. Why not use the same regexes on all environment, even if they support emoji properties? Wouldn't it be better to have consistent behaviour on every platform? They way it was done can make some nuances, depending whether non-native or native emoji properties was used.

However, if this is still needed then I can handle that as well. It would require to generate two versions of regexes and dynamically decide which one should be loaded.

janlelis · 2021-09-30T21:10:04Z

Thanks for the PR, I will take a closer look over the next few days.

Regarding the native Emoji properties: It certainly adds complexity, but there is a lot to be won when Ruby's own regex properties can be used: A much smaller regex gets generated, which is good for not only for memory usage, but also for performance. I agree that there is potential for undesired behavior there, but since both - this library and the Ruby core - integrate the Unicode Standard data very closely, there shouldn't be any deviations (if so, it's a bug).

radarek · 2021-10-03T19:47:29Z

I made two changes.

Support for native regex properties

Generated regexes are saved to lib/unicode/emoji/generated and lib/unicode/emoji/generated_native directories.

Character classes

For a list of consecutive characters (at least 3) instead of big union a|b|c|d|...|x character class [a-x] is used.

To get memory usage of regexes, I used this code:

require 'objspace'

Unicode::Emoji.constants.sort.filter_map { const = Unicode::Emoji.const_get(_1); const.is_a?(Regexp) ? [_1, ObjectSpace.memsize_of(const)] : nil }

ObjectSpace.memsize_of - returns memory size for given object. It doesn't follow referenced objects, so for Array and Hashes it won't take into account actual sizes of elements. Regexp objects don't have a references so it is no a problem.

Comparing main branch with this PR:

# const name, usage on main (bytes), usage on PR (bytes), diff (main - PR bytes), diff %
REGEX main: 149000, PR: 123672, diff: 25328, -16%
REGEX_ANY main: 55024, PR: 5088, diff: 49936, -90%
REGEX_BASIC main: 60800, PR: 6680, diff: 54120, -89%
REGEX_INCLUDE_TEXT main: 179340, PR: 129996, diff: 49344, -27%
REGEX_PICTO main: 139240, PR: 2924, diff: 136316, -97%
REGEX_PICTO_NO_EMOJI main: 86176, PR: 4824, diff: 81352, -94%
REGEX_TEXT main: 61064, PR: 6760, diff: 54304, -88%
REGEX_VALID main: 399228, PR: 282988, diff: 116240, -29%
REGEX_VALID_INCLUDE_TEXT main: 429568, PR: 289312, diff: 140256, -32%
REGEX_WELL_FORMED main: 368200, PR: 40036, diff: 328164, -89%
REGEX_WELL_FORMED_INCLUDE_TEXT main: 428880, PR: 46360, diff: 382520, -89%
Total main: 2356520, PR: 938640, diff: 1417880, -60%

I didn't check it but after this change, some regexes may perform better. I suspect that when a big list of unions is used then regex engine has to check every union element one by one. When character class with a range is used it (probably) checks by comparing to beginning and end of a range.

data/generate_constants.rb

janlelis · 2021-10-04T07:35:32Z

Hi Radosław, thank you for all the work - the PR looks great and is a huge memory improvement.

I am going to merge this, but I'd like you to ask about two small details:

What do you think about the above suggestion?
Could you create a (super simple) rake task, which triggers the regex build script?

radarek · 2021-10-04T16:14:33Z

Hi Radosław, thank you for all the work - the PR looks great and is a huge memory improvement.

I am going to merge this, but I'd like you to ask about two small details:

Before merging I was about to ask whether you think that current specs are sufficient to feel comfortable with this PR changes? I didn't dig too much but I see they are checking some stuff but not everything. For example, I had a misspelled constant name EXTENDED_PICTOGRAPHIC_NO_EMOJ (missing I) and it didn't catch that.

What do you think about the above suggestion?

I like it! I have no idea why I didn't think about that :D. I doesn't make a huge difference (from what I checked briefly, it saves just couple of kilobytes of memory) but there is no reason to not have it. I will create a commit.

Could you create a (super simple) rake task, which triggers the regex build script?

Absolutely, I'll do it.

Co-authored-by: Jan Lelis <[email protected]>

radarek · 2021-10-04T16:25:17Z

Rakefile

+
+desc "#{gemspec.name} | Generates all regex constants and saves them to lib/unicode/emoji/{generated,generated_native} directories"
+task :generate_constants do
+  load "data/generate_constants.rb", true


Passing true to load will "wrap" loaded code inside a temporary module. So for example, when loaded code make global include or defines global methods (like our script "generate_constants.rb" does) then it won't affect a global (main) object.

janlelis · 2021-10-05T20:16:59Z

Great thanks for the last adaptions!

Before merging I was about to ask whether you think that current specs are sufficient to feel comfortable with this PR changes? I didn't dig too much but I see they are checking some stuff but not everything. For example, I had a misspelled constant name EXTENDED_PICTOGRAPHIC_NO_EMOJ (missing I) and it didn't catch that.

Do you remember the details about this? (should have been caught by

unicode-emoji/spec/unicode_emoji_spec.rb

Lines 410 to 415 in 91a6172

    
           describe "REGEX_PICTO" do 
        
             it "matches codepoints with Extended_Pictograph property, but no Emoji property" do 
        
               matches = "U+1F32D 🌭 HOT DOG, U+203C ‼ DOUBLE EXCLAMATION MARK, U+26E8 ⛨ BLACK CROSS ON SHIELD".scan(Unicode::Emoji::REGEX_PICTO_NO_EMOJI) 
        
               assert_equal ["⛨"], matches 
        
             end 
        
           end

)

radarek · 2021-10-05T21:06:24Z

Great thanks for the last adaptions!
No problem. I'm happy you merged my PR!

Do you remember the details about this? (should have been caught by

unicode-emoji/spec/unicode_emoji_spec.rb

Lines 410 to 415 in 91a6172

describe "REGEX_PICTO" do

it "matches codepoints with Extended_Pictograph property, but no Emoji property" do

matches = "U+1F32D 🌭 HOT DOG, U+203C ‼ DOUBLE EXCLAMATION MARK, U+26E8 ⛨ BLACK CROSS ON SHIELD".scan(Unicode::Emoji::REGEX_PICTO_NO_EMOJI)

assert_equal ["⛨"], matches

end

end

)

This spec checks different constant (EXTENDED_PICTOGRAPHIC_NO_EMOJI vs REGEX_PICTO_NO_EMOJI).
Searching by "EXTENDED_PICTOGRAPHIC_NO_EMOJI" phrase, there is no single match in spec/ directory.

janlelis · 2021-10-06T19:26:46Z

@radarek Although the constants are useful and (I think) very clean, it's not yet encouraged to use them directly. I have opened #10 to track this

jywarren · 2021-11-30T15:58:51Z

Hi, just to clarify, does the major version number bump indicate a breaking change for downstream users of this gem, or can we continue using it as we had before? Thank you! Love the library!

janlelis · 2021-11-30T20:08:22Z

Hi @jywarren, I had chosen the major version step to indicate that there were fundamental changes in the implementation, so it might behave differently (better memory usage) and might have bugs because of that (doesn't seems so, so far) - however, there were no API changes. Thanks for using the library!

Make all regexes constants lazy loaded from pregenerated files

505c78b

Add support for generating regexes with native properties

fbb25d7

radarek force-pushed the lower-memory-overhead branch from 5532e0f to fbb25d7 Compare October 3, 2021 18:15

Use character classes [a-b] for a union of consecutive characters

add3bdd

radarek force-pushed the lower-memory-overhead branch from 1acd3ff to add3bdd Compare October 3, 2021 18:47

janlelis reviewed Oct 4, 2021

View reviewed changes

data/generate_constants.rb Outdated Show resolved Hide resolved

janlelis mentioned this pull request Oct 4, 2021

New architecture proposal to reduce memory usage #8

Closed

radarek and others added 2 commits October 4, 2021 18:22

Create single character class for entire list of ords.

f7aec4d

Co-authored-by: Jan Lelis <[email protected]>

Add rake task to generate regex constants

3a8b744

radarek force-pushed the lower-memory-overhead branch from 377f74a to 3a8b744 Compare October 4, 2021 16:22

radarek marked this pull request as ready for review October 4, 2021 16:23

radarek commented Oct 4, 2021

View reviewed changes

janlelis merged commit 8b107b9 into janlelis:main Oct 5, 2021

jywarren mentioned this pull request Nov 30, 2021

remove unicode-emoji from 2.8.0 to 3.1.0 publiclab/plots2#10532

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make all regexes constants lazy loaded from pregenerated files #9

Make all regexes constants lazy loaded from pregenerated files #9

radarek commented Sep 30, 2021 •

edited

Loading

radarek commented Sep 30, 2021

janlelis commented Sep 30, 2021 •

edited

Loading

radarek commented Oct 3, 2021 •

edited

Loading

janlelis commented Oct 4, 2021

radarek commented Oct 4, 2021

radarek Oct 4, 2021 •

edited

Loading

janlelis commented Oct 5, 2021

radarek commented Oct 5, 2021

janlelis commented Oct 6, 2021 •

edited

Loading

jywarren commented Nov 30, 2021

janlelis commented Nov 30, 2021

Make all regexes constants lazy loaded from pregenerated files #9

Make all regexes constants lazy loaded from pregenerated files #9

Conversation

radarek commented Sep 30, 2021 • edited Loading

radarek commented Sep 30, 2021

janlelis commented Sep 30, 2021 • edited Loading

radarek commented Oct 3, 2021 • edited Loading

Support for native regex properties

Character classes

janlelis commented Oct 4, 2021

radarek commented Oct 4, 2021

radarek Oct 4, 2021 • edited Loading

Choose a reason for hiding this comment

janlelis commented Oct 5, 2021

radarek commented Oct 5, 2021

janlelis commented Oct 6, 2021 • edited Loading

jywarren commented Nov 30, 2021

janlelis commented Nov 30, 2021

radarek commented Sep 30, 2021 •

edited

Loading

janlelis commented Sep 30, 2021 •

edited

Loading

radarek commented Oct 3, 2021 •

edited

Loading

radarek Oct 4, 2021 •

edited

Loading

janlelis commented Oct 6, 2021 •

edited

Loading