gh-91432: Replace JUMP+FOR_ITER with FOR_END #70016

sweeneyde · 2022-04-10T08:50:14Z

TODO:

sweeneyde · 2022-04-10T09:05:51Z

Below are some benchmarks. My machine is not the most stable, but I believe there is some consistent measurable speedup.

PyPerformance:

Slower (9):

scimark_sparse_mat_mult: 4.87 ms +- 0.23 ms -> 5.04 ms +- 0.19 ms: 1.04x slower
spectral_norm: 96.6 ms +- 1.4 ms -> 100.0 ms +- 1.7 ms: 1.03x slower
regex_v8: 23.3 ms +- 0.4 ms -> 24.0 ms +- 0.5 ms: 1.03x slower
pathlib: 19.7 ms +- 0.4 ms -> 20.2 ms +- 0.8 ms: 1.03x slower
sympy_sum: 184 ms +- 2 ms -> 188 ms +- 6 ms: 1.02x slower
telco: 6.46 ms +- 0.12 ms -> 6.56 ms +- 0.19 ms: 1.02x slower
sqlalchemy_imperative: 24.5 ms +- 1.0 ms -> 24.8 ms +- 0.7 ms: 1.02x slower
regex_dna: 209 ms +- 3 ms -> 212 ms +- 3 ms: 1.01x slower
logging_simple: 6.55 us +- 0.12 us -> 6.64 us +- 0.28 us: 1.01x slower

Faster (20):

pickle_list: 4.77 us +- 0.17 us -> 4.44 us +- 0.15 us: 1.07x faster
logging_silent: 105 ns +- 5 ns -> 98.0 ns +- 2.0 ns: 1.07x faster
nbody: 105 ms +- 5 ms -> 99.1 ms +- 3.7 ms: 1.06x faster
pickle_dict: 28.6 us +- 0.4 us -> 26.9 us +- 0.5 us: 1.06x faster
scimark_lu: 111 ms +- 2 ms -> 105 ms +- 3 ms: 1.05x faster
pickle: 12.6 us +- 1.1 us -> 12.0 us +- 1.4 us: 1.05x faster
xml_etree_iterparse: 116 ms +- 16 ms -> 110 ms +- 2 ms: 1.05x faster
deltablue: 4.27 ms +- 0.18 ms -> 4.12 ms +- 0.09 ms: 1.04x faster
regex_effbot: 3.19 ms +- 0.09 ms -> 3.08 ms +- 0.04 ms: 1.04x faster
scimark_monte_carlo: 72.9 ms +- 1.5 ms -> 70.4 ms +- 1.2 ms: 1.03x faster
django_template: 42.1 ms +- 2.2 ms -> 40.9 ms +- 2.4 ms: 1.03x faster
float: 81.6 ms +- 1.5 ms -> 79.7 ms +- 1.3 ms: 1.02x faster
pickle_pure_python: 351 us +- 10 us -> 343 us +- 8 us: 1.02x faster
chaos: 80.3 ms +- 1.8 ms -> 78.8 ms +- 2.1 ms: 1.02x faster
unpickle_pure_python: 263 us +- 6 us -> 259 us +- 7 us: 1.02x faster
nqueens: 87.0 ms +- 1.4 ms -> 85.5 ms +- 2.2 ms: 1.02x faster
raytrace: 329 ms +- 9 ms -> 324 ms +- 10 ms: 1.01x faster
unpickle_list: 5.11 us +- 0.12 us -> 5.05 us +- 0.08 us: 1.01x faster
regex_compile: 152 ms +- 3 ms -> 151 ms +- 3 ms: 1.01x faster
scimark_sor: 119 ms +- 2 ms -> 118 ms +- 1 ms: 1.01x faster

Benchmark hidden because not significant (29): 2to3, chameleon, crypto_pyaes, dulwich_log, fannkuch, go, hexiom, json_dumps, json_loads, logging_format, mako, meteor_contest, pidigits, pyflate, python_startup, python_startup_no_site, richards, scimark_fft, sqlalchemy_declarative, sqlite_synth, sympy_expand, sympy_integrate, sympy_str, tornado_http, unpack_sequence, unpickle, xml_etree_parse, xml_etree_generate, xml_etree_process

Geometric mean: 1.01x faster

Microbenchmarks:

benchmark code:

from itertools import repeat
from pyperf import Runner, perf_counter

runner = Runner()
def time_this(func):
    runner.bench_time_func(func.__name__, func)
    return func

###############################

@time_this
def range_sum(loops):
    s = 0
    r = iter(range(loops))
    t0 = perf_counter()
    for x in r:
        s += x
    return perf_counter() - t0

@time_this
def list_sum(loops):
    s = 0.0
    r = iter([1.0] * loops)
    t0 = perf_counter()
    for x in r:
        s += x
    return perf_counter() - t0

@time_this
def repeat_sum(loops):
    s = 0.0
    r = repeat(1.0, loops)
    t0 = perf_counter()
    for x in r:
        s += x
    return perf_counter() - t0

###############################

@time_this
def range_all(loops):
    r = iter(range(1, loops + 1))
    t0 = perf_counter()
    for x in r:
        if not x:
            break
    return perf_counter() - t0

@time_this
def list_all(loops):
    r = iter([True] * loops)
    t0 = perf_counter()
    for x in r:
        if not x:
            break
    return perf_counter() - t0

@time_this
def repeat_all(loops):
    r = repeat(True, loops)
    t0 = perf_counter()
    for x in r:
        if not x:
            break
    return perf_counter() - t0

Faster (6):

range_all: 17.5 ns +- 0.1 ns -> 16.6 ns +- 0.3 ns: 1.05x faster
list_all: 10.6 ns +- 0.7 ns -> 10.3 ns +- 0.4 ns: 1.03x faster
repeat_sum: 15.2 ns +- 0.2 ns -> 14.8 ns +- 0.5 ns: 1.03x faster
list_sum: 17.3 ns +- 0.2 ns -> 16.9 ns +- 0.3 ns: 1.03x faster
range_sum: 31.5 ns +- 0.3 ns -> 30.9 ns +- 0.3 ns: 1.02x faster
repeat_all: 7.98 ns +- 0.06 ns -> 7.90 ns +- 0.10 ns: 1.01x faster

Geometric mean: 1.03x faster

These don't make too much sense (some macro-benchmarks speeding up more than the tightest micro-benchmarks?), so if someone with a stable machine is willing to re-measure, that would be appreciated.

sweeneyde · 2022-04-11T02:28:13Z

Python/ceval.c

+        int prev_op = _PyOpcode_Deopt[_Py_OPCODE(code[instr_prev])];
+        int next_op = _PyOpcode_Deopt[_Py_OPCODE(*frame->prev_instr)];
+        // Trace before FOR_END, not after, even though a backwards
+        // jump happens after. However, don't trace on the first FOR_END
+        // of the for loop, since we're staying on the same line.
+        // Also don't trace JUMP_NO_INTERRUPT --> SEND.
+        bool line_number_changed = (line != lastline);
+        bool first_iteration = (next_op == FOR_END &&
+                                prev_op == JUMP_FORWARD &&
+                                !line_number_changed);
+        bool for_loop_end = (next_op == FOR_END && !first_iteration);
+        bool back_jump = (_PyInterpreterFrame_LASTI(frame) < instr_prev &&
+                          next_op != SEND && prev_op != FOR_END);
+        if (line_number_changed || for_loop_end || back_jump) {


I'm not yet totally convinced of the robustness of the first_iteration check. It works on the whole test suite, and I couldn't come up with any failing cases, but it feels fragile.

Something like this correctly traces the backwards jumps because the POP_TOP has line number -1:

def for_loop_same_line_with_jump(): # line 0 for x in (0, 1, 2): +x if x >= 0 else -x # line 1

I don't know of any way to get JUMP_FORWARD targeted at FOR_END on the same line, unless it's the first loop iteration.

I could add something like this if preferred:

#ifdef Py_DEBUG if (first_iteration) { int prevprev_op = _PyOpcode_Deopt[_Py_OPCODE(code[instr_prev-1])]; assert(prevprev_op == GET_ITER || prevprev_op == LOAD_FAST); if (prevprev_op == LOAD_FAST) { assert(_Py_OPARG(code[instr_prev-1]) == 0); PyObject *names = frame->f_code->co_localsplusnames; PyObject *name = PyTuple_GET_ITEM(names, 0); assert(_PyUnicode_EqualToASCIIString(name, ".0")); } } #endif

…or_end

markshannon · 2022-05-20T09:29:57Z

@sweeneyde thanks for trying this, but it looks like we want to make all branches go forward (except JUMP_BACKWARD and JUMP_BACKWARD_NO_INTERRUPT.

faster-cpython/ideas#297 (comment)

iritkatriel · 2022-05-20T13:30:45Z

Is this an older version of the same idea? #46711

sweeneyde added 17 commits February 20, 2022 01:17

Implement FOR_END

f7c220a

Use jabs, not jrel

4178c30

Bump magic

245b43f

merge with main

61a85f0

merge with main

b0030d5

Remove FOR_ITER entirely

8f3cd5a

Fix some failng tests

4520a5a

exclude test_frozenmain

6284f20

merge with main

3a0d53e

fix some tracing tests

3b328e7

fix swap pop/push in frameobject.c

bdd7f07

merge with main

b20fed8

Get test.test_trace passing

77d52e6

when marking backwards jumps, set todo=1

53a8734

remove debugging code

7d32011

update test_dis

4a2e781

make FOR_END a relative jump

0af7e1e

bedevere-bot added the awaiting core review label Apr 10, 2022

the-knights-who-say-ni added the CLA signed label Apr 10, 2022

make test.test_bdb and test.test_pdb pass

d3552aa

sweeneyde marked this pull request as ready for review April 10, 2022 23:29

sweeneyde requested review from brandtbucher and markshannon as code owners April 10, 2022 23:29

sweeneyde changed the title ~~Replace JUMP_ABSOLUTE/FOR_ITER with FOR_END~~ gh-91432: Replace JUMP+FOR_ITER with FOR_END Apr 10, 2022

sweeneyde added 5 commits April 10, 2022 21:24

refactor maybe_call_line_trace logic

563ca00

add blurb

719e756

FOR_ITER --> FOR_END in dis.rst

8626569

FOR_ITER --> FOR_END in specialize.c

2884727

Update whatsnew

44baacc

sweeneyde commented Apr 11, 2022

View reviewed changes

Merge branch 'main' into for_end

0adffbd

This comment was marked as outdated.

Sign in to view

ambv closed this Apr 11, 2022

ambv reopened this Apr 11, 2022

sweeneyde and others added 5 commits April 11, 2022 16:55

merge with main

7a62601

Merge branch 'for_end' of https://github.com/sweeneyde/cpython into f…

92e8c6d

…or_end

fix test_dis

e4f240e

Merge branch 'main' into for_end

6c2e53d

fix opcode.h

32bed37

This was referenced Apr 12, 2022

Optimizing FOR_ITER bytecode #91432

Closed

Consider FOR_ITER family for specialization? faster-cpython/ideas#67

Closed

sweeneyde added 2 commits April 20, 2022 00:55

merge

3f7ec88

merge

577bd89

sweeneyde closed this May 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-91432: Replace JUMP+FOR_ITER with FOR_END #70016

gh-91432: Replace JUMP+FOR_ITER with FOR_END #70016

sweeneyde commented Apr 10, 2022 •

edited

Loading

sweeneyde commented Apr 10, 2022 •

edited

Loading

sweeneyde Apr 11, 2022 •

edited

Loading

sweeneyde Apr 11, 2022

This comment was marked as outdated.

markshannon commented May 20, 2022

iritkatriel commented May 20, 2022

gh-91432: Replace JUMP+FOR_ITER with FOR_END #70016

gh-91432: Replace JUMP+FOR_ITER with FOR_END #70016

Conversation

sweeneyde commented Apr 10, 2022 • edited Loading

sweeneyde commented Apr 10, 2022 • edited Loading

PyPerformance:

Microbenchmarks:

sweeneyde Apr 11, 2022 • edited Loading

Choose a reason for hiding this comment

sweeneyde Apr 11, 2022

Choose a reason for hiding this comment

This comment was marked as outdated.

markshannon commented May 20, 2022

iritkatriel commented May 20, 2022

sweeneyde commented Apr 10, 2022 •

edited

Loading

sweeneyde commented Apr 10, 2022 •

edited

Loading

sweeneyde Apr 11, 2022 •

edited

Loading