Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize _expand_named_fields #139

Merged
merged 1 commit into from
Nov 9, 2023
Merged

Conversation

Luffbee
Copy link
Contributor

@Luffbee Luffbee commented Dec 2, 2021

The origin re.match just do a simple job, using find and slicing is more efficient.
I find this problem when parsing large files, and my patterns only use simple field name like 'aaa'. (I think simple name is the common case, which should be optimized). What I did with the large file is like this:

pat = parsing.compile("some pattern with {simple_name}")
with open(fname, "r") as f:
  for line in f.readlines():
    res = pat.parse(line)
    # use the res to construct some simple objects
    # ...

Here is the timing and profiling by ipython:

# timing (ipython %timeit)

# origin code with re.match
6.34 s ± 21.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# this PR with find and slicing
5.02 s ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# profiling (ipython %prun, truncated)

# origin code with re.match
         49321473 function calls in 13.133 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   657504    2.693    0.000   10.139    0.000 parse.py:961(evaluate_result)
   657504    1.189    0.000    3.676    0.000 parse.py:941(_expand_named_fields)
        1    1.098    1.098   13.070   13.070 rate.py:46(from_log)
  3945024    1.038    0.000    1.038    0.000 {method 'match' of 're.Pattern' objects}
  4602528    1.007    0.000    1.388    0.000 re.py:289(_compile)
  1315008    0.921    0.000    2.028    0.000 parse.py:537(__call__)
  3287520    0.710    0.000    2.216    0.000 re.py:188(match)
  3945024    0.603    0.000    0.843    0.000 parse.py:985(<genexpr>)
  8547552    0.593    0.000    0.593    0.000 {built-in method builtins.isinstance}
  3287520    0.554    0.000    0.735    0.000 parse.py:1289(__getitem__)
   657504    0.403    0.000   11.083    0.000 parse.py:886(parse)


# this PR with find and slicing
         36171393 function calls in 10.062 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   657504    2.544    0.000    7.208    0.000 parse.py:966(evaluate_result)
        1    0.974    0.974   10.001   10.001 rate.py:46(from_log)
  1315008    0.917    0.000    2.043    0.000 parse.py:537(__call__)
   657504    0.654    0.000    0.946    0.000 parse.py:941(_expand_named_fields)
  3945024    0.584    0.000    0.801    0.000 parse.py:990(<genexpr>)
  3287520    0.514    0.000    0.707    0.000 parse.py:1294(__getitem__)
   657504    0.481    0.000    0.481    0.000 {method 'match' of 're.Pattern' objects}
   657504    0.389    0.000    8.148    0.000 parse.py:886(parse)

@wimglenn
Copy link
Collaborator

wimglenn commented Nov 9, 2023

Seems reasonable. This is not exactly logically equivalent, for example if the input was '[aaa]' the existing code will raise, but this code will return basename, subkeys as "", "[aaa]". But I don't see any handling around the AttributeError so I don't think it should cause any issue.

@wimglenn wimglenn merged commit 286bcb1 into r1chardj0n3s:master Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants