Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance? #7

Open
AntiCompositeNumber opened this issue Mar 23, 2020 · 8 comments
Open

Improve performance? #7

AntiCompositeNumber opened this issue Mar 23, 2020 · 8 comments

Comments

@AntiCompositeNumber
Copy link
Owner

AntiCompositeNumber commented Mar 23, 2020

>>> p.sort_stats(pstats.SortKey.TIME).print_stats(30)
Mon Mar 23 02:27:58 2020    sigprobs_profile

         5200087 function calls (5155274 primitive calls) in 495.606 seconds

   Ordered by: internal time
   List reduced from 1999 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      104  402.013    3.866  402.013    3.866 {method 'recv_into' of '_socket.socket' objects}
     2036   85.032    0.042   85.032    0.042 {method 'read' of '_ssl._SSLSocket' objects}
     5979    0.262    0.000    0.262    0.000 {built-in method posix.stat}
     4189    0.183    0.000    0.264    0.000 url.py:210(_encode_invalid_chars)
        1    0.167    0.167  495.195  495.195 sigprobs.py:378(main)
      205    0.150    0.001    0.172    0.001 <frozen importlib._bootstrap_external>:914(get_data)
    91358    0.149    0.000    0.364    0.000 os.py:673(__getitem__)
   108346    0.141    0.000    0.647    0.000 _collections_abc.py:742(__iter__)
     1901    0.141    0.000    0.141    0.000 {method 'write' of '_ssl._SSLSocket' objects}
      110    0.135    0.001    0.137    0.001 {built-in method io.open}
   401706    0.129    0.000    0.171    0.000 {built-in method builtins.isinstance}
   534880    0.127    0.000    0.127    0.000 {method 'lower' of 'str' objects}
     1015    0.116    0.000    0.308    0.000 feedparser.py:471(_parse_headers)
    21322    0.104    0.000    0.312    0.000 parse.py:361(urlparse)
   311422    0.102    0.000    0.102    0.000 {method 'decode' of 'bytes' objects}
    91358    0.088    0.000    0.138    0.000 os.py:751(encode)
    48898    0.086    0.000    0.086    0.000 {method 'match' of 're.Pattern' objects}
     2030    0.085    0.000    0.667    0.000 request.py:2456(getproxies_environment)
     1015    0.084    0.000    0.890    0.001 client.py:203(parse_headers)
   166462    0.083    0.000    0.148    0.000 os.py:755(decode)
    49753    0.080    0.000    0.092    0.000 parse.py:109(_coerce_args)
    24371    0.080    0.000    0.140    0.000 parse.py:412(urlsplit)
     9135    0.077    0.000    0.142    0.000 message.py:462(get)
     5077    0.076    0.000    0.220    0.000 _collections_abc.py:824(update)
    31322    0.076    0.000    0.128    0.000 _policybase.py:293(header_source_parse)
     2030    0.070    0.000    0.577    0.000 feedparser.py:218(_parsegen)
    85260    0.068    0.000    0.140    0.000 os.py:696(__iter__)
166141/166140    0.067    0.000    0.067    0.000 {method 'encode' of 'str' objects}
 6877/985    0.059    0.000    0.496    0.001 utils.py:36(parse_anything)
31440/11160    0.059    0.000    0.109    0.000 cookiejar.py:1214(deepvalues)
>>> p.sort_stats(pstats.SortKey.CUMULATIVE).print_stats(30)
Mon Mar 23 02:27:58 2020    sigprobs_profile

         5200087 function calls (5155274 primitive calls) in 495.606 seconds

   Ordered by: cumulative time
   List reduced from 1999 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    219/1    0.002    0.000  495.638  495.638 {built-in method builtins.exec}
        1    0.000    0.000  495.638  495.638 sigprobs.py:21(<module>)
        1    0.167    0.167  495.195  495.195 sigprobs.py:378(main)
     2140    0.016    0.000  487.096    0.228 socket.py:575(readinto)
      902    0.008    0.000  402.222    0.446 sigprobs.py:60(iter_active_user_sigs)
     1410    0.021    0.000  402.091    0.285 connections.py:648(_read_packet)
      100    0.001    0.000  402.068    4.021 cursors.py:151(execute)
      100    0.001    0.000  402.061    4.021 cursors.py:451(_query)
      100    0.001    0.000  402.058    4.021 connections.py:508(query)
      100    0.001    0.000  402.047    4.020 connections.py:720(_read_query_result)
      100    0.002    0.000  402.046    4.020 connections.py:1086(init_unbuffered_query)
     2820    0.012    0.000  402.045    0.143 connections.py:687(_read_bytes)
     2861    0.006    0.000  402.021    0.141 {method 'read' of '_io.BufferedReader' objects}
      104  402.013    3.866  402.013    3.866 {method 'recv_into' of '_socket.socket' objects}
      886    0.039    0.000   92.566    0.104 sigprobs.py:197(check_sig)
     1015    0.027    0.000   91.829    0.090 sessions.py:463(request)
     1015    0.041    0.000   89.084    0.088 sessions.py:614(send)
     1015    0.034    0.000   88.646    0.087 adapters.py:394(send)
     1015    0.048    0.000   87.597    0.086 connectionpool.py:494(urlopen)
     1015    0.046    0.000   87.008    0.086 connectionpool.py:351(_make_request)
      886    0.012    0.000   86.469    0.098 sigprobs.py:218(get_lint_errors)
      886    0.009    0.000   86.334    0.097 sessions.py:567(post)
     1015    0.012    0.000   86.136    0.085 client.py:1292(getresponse)
     1015    0.031    0.000   86.081    0.085 client.py:299(begin)
     1015    0.035    0.000   85.076    0.084 client.py:266(_read_status)
    33404    0.041    0.000   85.059    0.003 {method 'readline' of '_io.BufferedReader' objects}
     2036    0.012    0.000   85.055    0.042 ssl.py:1041(recv_into)
     2036    0.010    0.000   85.042    0.042 ssl.py:902(read)
     2036   85.032    0.042   85.032    0.042 {method 'read' of '_ssl._SSLSocket' objects}
      129    0.001    0.000    5.505    0.043 sessions.py:534(get)
>>> p.strip_dirs().sort_stats(pstats.SortKey.FILENAME, pstats.SortKey.CUMULATIVE).print_stats()
Mon Mar 23 02:27:58 2020    sigprobs_profile

         5200087 function calls (5155274 primitive calls) in 495.606 seconds

   Ordered by: file name, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)

(...)
        1    0.000    0.000  495.638  495.638 sigprobs.py:21(<module>)
        1    0.167    0.167  495.195  495.195 sigprobs.py:378(main)
      902    0.008    0.000  402.222    0.446 sigprobs.py:60(iter_active_user_sigs)
      886    0.039    0.000   92.566    0.104 sigprobs.py:197(check_sig)
      886    0.012    0.000   86.469    0.098 sigprobs.py:218(get_lint_errors)
      128    0.002    0.000    5.436    0.042 sigprobs.py:317(evaluate_subst)
      886    0.011    0.000    4.599    0.005 sigprobs.py:236(check_links)
      886    0.002    0.000    1.447    0.002 sigprobs.py:345(check_tildes)
      985    0.028    0.000    0.596    0.001 sigprobs.py:255(compare_links)
        1    0.001    0.001    0.085    0.085 sigprobs.py:122(get_site_data)
     2724    0.010    0.000    0.014    0.000 sigprobs.py:189(normal_name)
      886    0.003    0.000    0.003    0.000 sigprobs.py:333(check_fanciness)
        1    0.000    0.000    0.003    0.003 sigprobs.py:39(load_config)
      886    0.001    0.000    0.002    0.000 sigprobs.py:370(check_length)
        1    0.000    0.000    0.000    0.000 sigprobs.py:159(<dictcomp>)
        1    0.000    0.000    0.000    0.000 sigprobs.py:163(<dictcomp>)
        1    0.000    0.000    0.000    0.000 sigprobs.py:168(<setcomp>)
        1    0.000    0.000    0.000    0.000 sigprobs.py:173(<listcomp>)
        1    0.000    0.000    0.000    0.000 sigprobs.py:174(<listcomp>)
(...)
>>> p.strip_dirs().sort_stats(pstats.SortKey.FILENAME, pstats.SortKey.TIME).print_stats()
Mon Mar 23 02:27:58 2020    sigprobs_profile

         5200087 function calls (5155274 primitive calls) in 495.606 seconds

   Ordered by: file name, internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
(...)
        1    0.167    0.167  495.195  495.195 sigprobs.py:378(main)
      886    0.039    0.000   92.566    0.104 sigprobs.py:197(check_sig)
      985    0.028    0.000    0.596    0.001 sigprobs.py:255(compare_links)
      886    0.012    0.000   86.469    0.098 sigprobs.py:218(get_lint_errors)
      886    0.011    0.000    4.599    0.005 sigprobs.py:236(check_links)
     2724    0.010    0.000    0.014    0.000 sigprobs.py:189(normal_name)
      902    0.008    0.000  402.222    0.446 sigprobs.py:60(iter_active_user_sigs)
      886    0.003    0.000    0.003    0.000 sigprobs.py:333(check_fanciness)
      128    0.002    0.000    5.436    0.042 sigprobs.py:317(evaluate_subst)
      886    0.002    0.000    1.447    0.002 sigprobs.py:345(check_tildes)
      886    0.001    0.000    0.002    0.000 sigprobs.py:370(check_length)
        1    0.001    0.001    0.085    0.085 sigprobs.py:122(get_site_data)
        1    0.000    0.000    0.003    0.003 sigprobs.py:39(load_config)
        1    0.000    0.000    0.000    0.000 sigprobs.py:159(<dictcomp>)
        1    0.000    0.000    0.000    0.000 sigprobs.py:163(<dictcomp>)
        1    0.000    0.000  495.638  495.638 sigprobs.py:21(<module>)
        1    0.000    0.000    0.000    0.000 sigprobs.py:168(<setcomp>)
        1    0.000    0.000    0.000    0.000 sigprobs.py:173(<listcomp>)
        1    0.000    0.000    0.000    0.000 sigprobs.py:174(<listcomp>)
(...)
@AntiCompositeNumber
Copy link
Owner Author

The bulk of the time appears to be spent checking lint errors, more specifically waiting for the data to come back. Not tons that can be done, other than setting up a linter locally. And that sounds like a bad idea.

@AntiCompositeNumber
Copy link
Owner Author

Hmm..The parsoid api doesn't take batch requests, but that doesn't mean I can't batch requests. Combine a block of signatures into one newline-delimited string, send that off. If it passes the linter, assume all are good. If not, iterate over them all.

@AntiCompositeNumber
Copy link
Owner Author

Also: Short-circuit on check_fanciness. If there's no fancy markup, there's no potential for lint errors.

@AntiCompositeNumber
Copy link
Owner Author

batched lint checks and short circuiting implemented, testing...

@AntiCompositeNumber
Copy link
Owner Author

AntiCompositeNumber commented Mar 29, 2020

alright, implementing batched lint queries at n=5 results in about a 0.072s savings per signature and a 30% reduction in time spent waiting for lint errors. Short circuiting was not fully implemented, will try again with better implementation.

@AntiCompositeNumber
Copy link
Owner Author

fixing short circuiting reduced 1-sig check time by another .09s, and linter time by .036s, for a combined (batch lint and short circuit) reduction of about 0.161s. Over the 4754 enwiki signatures checked, that adds up to about 13 minutes.
The other source of delay that I've identified is the database query time. Switching back to a buffered cursor could speed this up, it's worth trying at least.

@AntiCompositeNumber
Copy link
Owner Author

AntiCompositeNumber commented Mar 29, 2020

The per-call time for iter_active_sigs fluctuates, so it's hard to tell if there's been a real effect. I'm going to switch to the buffered cursor though, it's likely nicer on the database.

@AntiCompositeNumber
Copy link
Owner Author

>>> p.strip_dirs().sort_stats(pstats.SortKey.CUMULATIVE).print_stats(30)
Sun Mar 29 04:53:37 2020    sigprobs_new_ub_profile

         1847135 function calls (1827525 primitive calls) in 143.256 seconds

   Ordered by: cumulative time
   List reduced from 1118 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/1    0.000    0.000  143.262  143.262 {built-in method builtins.exec}
        1    0.000    0.000  143.262  143.262 <string>:1(<module>)
        1    0.009    0.009  143.262  143.262 sigprobs.py:363(handle_args)
        1    0.016    0.016  143.230  143.230 sigprobs.py:284(main)
      752    0.014    0.000  134.366    0.179 socket.py:575(readinto)
      319    0.017    0.000  107.795    0.338 db.py:40(iter_active_user_sigs)
      100    0.011    0.000  107.613    1.076 cursors.py:151(execute)
      100    0.001    0.000  107.585    1.076 cursors.py:324(_query)
      100    0.006    0.000  107.582    1.076 connections.py:508(query)
      100    0.002    0.000  107.536    1.075 connections.py:720(_read_query_result)
      100    0.005    0.000  107.533    1.075 connections.py:1073(read)
      821    0.024    0.000  107.488    0.131 connections.py:648(_read_packet)
     1642    0.015    0.000  107.420    0.065 connections.py:687(_read_bytes)
     1674    0.017    0.000  107.391    0.064 {method 'read' of '_io.BufferedReader' objects}
      103  107.369    1.042  107.369    1.042 {method 'recv_into' of '_socket.socket' objects}
      323    0.019    0.000   34.045    0.105 api.py:39(backoff_retry)
      323    0.025    0.000   33.903    0.105 sessions.py:463(request)
      323    0.049    0.000   31.310    0.097 sessions.py:614(send)
      323    0.031    0.000   30.782    0.095 adapters.py:394(send)
      323    0.061    0.000   29.681    0.092 connectionpool.py:494(urlopen)
      323    0.048    0.000   29.006    0.090 connectionpool.py:351(_make_request)
      323    0.011    0.000   28.101    0.087 client.py:1292(getresponse)
       62    0.009    0.000   28.052    0.452 sigprobs.py:254(batch_check_lint)
      323    0.035    0.000   28.051    0.087 client.py:299(begin)
      237    0.032    0.000   27.975    0.118 sigprobs.py:88(get_lint_errors)
      323    0.045    0.000   26.991    0.084 client.py:266(_read_status)
    10248    0.057    0.000   26.976    0.003 {method 'readline' of '_io.BufferedReader' objects}
      649    0.011    0.000   26.973    0.042 ssl.py:1041(recv_into)
      649    0.010    0.000   26.962    0.042 ssl.py:902(read)
      649   26.952    0.042   26.952    0.042 {method 'read' of '_ssl._SSLSocket' objects}


<pstats.Stats object at 0x7ff3b34b84e0>
>>> p.strip_dirs().sort_stats(pstats.SortKey.TIME).print_stats(30)
Sun Mar 29 04:53:37 2020    sigprobs_new_ub_profile

         1847135 function calls (1827525 primitive calls) in 143.256 seconds

   Ordered by: internal time
   List reduced from 1118 to 30 due to restriction <30>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      103  107.369    1.042  107.369    1.042 {method 'recv_into' of '_socket.socket' objects}
      649   26.952    0.042   26.952    0.042 {method 'read' of '_ssl._SSLSocket' objects}
     1378    0.223    0.000    0.336    0.000 url.py:210(_encode_invalid_chars)
     1647    0.172    0.000    0.172    0.000 {built-in method posix.stat}
      560    0.144    0.000    0.144    0.000 {method 'write' of '_ssl._SSLSocket' objects}
   152133    0.139    0.000    0.180    0.000 {built-in method builtins.isinstance}
    29083    0.116    0.000    0.280    0.000 os.py:673(__getitem__)
    34388    0.115    0.000    0.504    0.000 _collections_abc.py:742(__iter__)
      323    0.114    0.000    0.303    0.001 feedparser.py:471(_parse_headers)
    15178    0.114    0.000    0.114    0.000 {method 'match' of 're.Pattern' objects}
   163666    0.114    0.000    0.114    0.000 {method 'lower' of 'str' objects}
 4872/354    0.109    0.000    0.965    0.003 utils.py:36(parse_anything)
     5818    0.106    0.000    0.294    0.000 parse.py:361(urlparse)
   123637    0.100    0.000    0.100    0.000 {method 'decode' of 'bytes' objects}
      323    0.088    0.000    0.947    0.003 client.py:203(parse_headers)
     1615    0.083    0.000    0.222    0.000 _collections_abc.py:824(update)
      354    0.082    0.000    0.088    0.000 {method 'tokenize' of '_tokenizer.CTokenizer' objects}
  745/396    0.077    0.000    0.625    0.002 builder.py:245(_handle_tag)
     6789    0.076    0.000    0.128    0.000 parse.py:412(urlsplit)
      646    0.075    0.000    0.528    0.001 request.py:2456(getproxies_environment)
    13899    0.073    0.000    0.082    0.000 parse.py:109(_coerce_args)
      549    0.072    0.000    0.136    0.000 __init__.py:293(__init__)
     2907    0.071    0.000    0.149    0.000 message.py:462(get)
     9564    0.070    0.000    0.121    0.000 _policybase.py:293(header_source_parse)
53154/53153    0.069    0.000    0.069    0.000 {method 'encode' of 'str' objects}
    29083    0.066    0.000    0.110    0.000 os.py:751(encode)
      646    0.065    0.000    0.589    0.001 feedparser.py:218(_parsegen)
    52977    0.063    0.000    0.112    0.000 os.py:755(decode)
      323    0.061    0.000   29.681    0.092 connectionpool.py:494(urlopen)
8055/2903    0.058    0.000    0.116    0.000 cookiejar.py:1214(deepvalues)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant