Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mor1kx does not meet timing on kasli (artix 7) #891

Closed
jordens opened this issue Jan 15, 2018 · 15 comments
Closed

mor1kx does not meet timing on kasli (artix 7) #891

jordens opened this issue Jan 15, 2018 · 15 comments

Comments

@jordens
Copy link
Member

jordens commented Jan 15, 2018

There are a few really long paths in mor1kx in Kasli.
Happens with 2017.2 and 2017.4.
Happens in both mor1kx cpus (comms and kernel).
We used to run the Spartan 6 mor1kx at 83 MHz on pipistrello to deal with this. It now seems close enough that some tweaking might allow the CPUs to run at 125 MHz on Artix 7.

Max Delay Paths
--------------------------------------------------------------------------------------
Slack (VIOLATED) :        -0.137ns  (required time - arrival time)
  Source:                 mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_ctrl_cappuccino/spr_sr_reg[5]/C
                            (rising edge-triggered cell FDRE clocked by sys_clk  {[email protected] [email protected] period=8.000ns})
  Destination:            mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/rdata_reg[6]/D
                            (rising edge-triggered cell FDRE clocked by sys_clk  {[email protected] [email protected] period=8.000ns})
  Path Group:             sys_clk
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            8.000ns  (sys_clk [email protected] - sys_clk [email protected])
  Data Path Delay:        8.083ns  (logic 2.475ns (30.619%)  route 5.608ns (69.381%))
  Logic Levels:           14  (CARRY4=2 LUT2=2 LUT3=2 LUT5=1 LUT6=6 RAMD64E=1)
  Clock Path Skew:        -0.039ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    1.260ns = ( 9.260 - 8.000 )
    Source Clock Delay      (SCD):    1.365ns
    Clock Pessimism Removal (CPR):    0.066ns
  Clock Uncertainty:      0.047ns  ((TSJ^2 + DJ^2)^1/2) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Discrete Jitter          (DJ):    0.061ns
    Phase Error              (PE):    0.000ns

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock sys_clk rise edge)    0.000     0.000 r
    BUFGCTRL_X0Y0        BUFG                         0.000     0.000 r  BUFG/O
                         net (fo=10686, routed)       1.365     1.365    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_ctrl_cappuccino/sys_clk
    SLICE_X54Y87         FDRE                                         r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_ctrl_cappuccino/spr_sr_reg[5]/C
  -------------------------------------------------------------------    -------------------
    SLICE_X54Y87         FDRE (Prop_fdre_C_Q)         0.433     1.798 f  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_ctrl_cappuccino/spr_sr_reg[5]/Q
                         net (fo=61, routed)          0.471     2.268    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_execute_ctrl_cappuccino/spr_sr_reg[5]
    SLICE_X54Y88         LUT2 (Prop_lut2_I1_O)        0.105     2.373 r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_execute_ctrl_cappuccino/mem_reg_1_i_4/O
                         net (fo=7, routed)           0.545     2.918    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_execute_ctrl_cappuccino/din[50]
    SLICE_X54Y89         LUT6 (Prop_lut6_I3_O)        0.105     3.023 r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_execute_ctrl_cappuccino/check_way_match[0]_carry_i_3__0/O
                         net (fo=1, routed)           0.000     3.023    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/rdata_reg[11][1]
    SLICE_X54Y89         CARRY4 (Prop_carry4_S[1]_CO[3])
                                                      0.444     3.467 r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/check_way_match[0]_carry/CO[3]
                         net (fo=1, routed)           0.000     3.467    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/check_way_match[0]_carry_n_0
    SLICE_X54Y90         CARRY4 (Prop_carry4_CI_CO[2])
                                                      0.191     3.658 f  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/check_way_match[0]_carry__0/CO[2]
                         net (fo=5, routed)           0.459     4.117    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/CO[0]
    SLICE_X53Y91         LUT5 (Prop_lut5_I0_O)        0.252     4.369 r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/access_done_i_5/O
                         net (fo=1, routed)           0.262     4.631    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/access_done_i_5_n_0
    SLICE_X52Y91         LUT6 (Prop_lut6_I0_O)        0.105     4.736 r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/access_done_i_4/O
                         net (fo=1, routed)           0.119     4.856    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/store_buffer_gen.mor1kx_store_buffer/write_pending_reg
    SLICE_X52Y91         LUT6 (Prop_lut6_I5_O)        0.105     4.961 f  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/store_buffer_gen.mor1kx_store_buffer/access_done_i_2/O
                         net (fo=3, routed)           0.376     5.336    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/store_buffer_gen.mor1kx_store_buffer/lsu_ack
    SLICE_X49Y91         LUT2 (Prop_lut2_I1_O)        0.105     5.441 f  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/store_buffer_gen.mor1kx_store_buffer/atomic_gen.atomic_flag_set_i_4/O
                         net (fo=6, routed)           0.254     5.695    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/store_buffer_gen.mor1kx_store_buffer/atomic_gen.atomic_flag_clear_reg
    SLICE_X51Y91         LUT6 (Prop_lut6_I2_O)        0.105     5.800 f  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/store_buffer_gen.mor1kx_store_buffer/ctrl_rfd_adr_o[4]_i_4/O
                         net (fo=1, routed)           0.317     6.118    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_ctrl_cappuccino/p_16_in
    SLICE_X51Y91         LUT6 (Prop_lut6_I3_O)        0.105     6.223 f  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_ctrl_cappuccino/ctrl_rfd_adr_o[4]_i_1/O
                         net (fo=97, routed)          0.324     6.547    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_decode_execute_cappuccino/padv_execute_o
    SLICE_X51Y90         LUT3 (Prop_lut3_I0_O)        0.105     6.652 r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_decode_execute_cappuccino/mem_reg_0_63_0_2_i_22/O
                         net (fo=12, routed)          0.404     7.056    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_execute_ctrl_cappuccino/execute_op_lsu_load_o_reg
    SLICE_X52Y90         LUT3 (Prop_lut3_I1_O)        0.105     7.161 r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_execute_ctrl_cappuccino/mem_reg_0_63_0_2_i_9/O
                         net (fo=82, routed)          1.273     8.434    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/mem_reg_128_191_6_8/ADDRA1
    SLICE_X58Y92         RAMD64E (Prop_ramd64e_RADR1_O)
                                                      0.105     8.539 r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/mem_reg_128_191_6_8/RAMA/O
                         net (fo=1, routed)           0.804     9.343    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/mem_reg_128_191_6_8_n_0
    SLICE_X59Y90         LUT6 (Prop_lut6_I1_O)        0.105     9.448 r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/rdata[6]_i_1/O
                         net (fo=1, routed)           0.000     9.448    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/rdata0[6]
    SLICE_X59Y90         FDRE                                         r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/rdata_reg[6]/D
  -------------------------------------------------------------------    -------------------

                         (clock sys_clk rise edge)    8.000     8.000 r
    BUFGCTRL_X0Y0        BUFG                         0.000     8.000 r  BUFG/O
                         net (fo=10686, routed)       1.260     9.260    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/sys_clk
    SLICE_X59Y90         FDRE                                         r  mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/rdata_reg[6]/C
                         clock pessimism              0.066     9.326
                         clock uncertainty           -0.047     9.279
    SLICE_X59Y90         FDRE (Setup_fdre_C_D)        0.032     9.311    mor1kx_1/mor1kx_cpu/cappuccino.mor1kx_cpu/mor1kx_lsu_cappuccino/dcache_gen.mor1kx_dcache/tag_ram/rdata_reg[6]
  -------------------------------------------------------------------
                         required time                          9.311
                         arrival time                          -9.448
  -------------------------------------------------------------------
                         slack                                 -0.137
@enjoy-digital
Copy link
Contributor

I don't think it's related to artix 7 vs spartan 6 but just to the cpu frequency.
On Kasli the cpu is running at 125MHz vs 83.3Mhz on pipistrello.

@jordens
Copy link
Member Author

jordens commented Jan 15, 2018

Ah right. I forgot that. I still think that the 14 LUTs is too much and we should be able to make it to 125 MHz.

@enjoy-digital
Copy link
Contributor

Yes i was just pointing the reason. At 125Mhz, Kasli does not seems that far from meeting timings so yes it should not be too difficult to fix.

jordens added a commit to m-labs/misoc that referenced this issue Jan 16, 2018
@jordens
Copy link
Member Author

jordens commented Mar 11, 2018

Currently a frequent offender is the interrupt path (timer especially) to the dcache. Maybe we can add some pipelining at the beginning.

@sbourdeauducq
Copy link
Member

This patch makes it pass timing on both opticlock and sysu variants:

diff --git a/misoc/cores/mor1kx/core.py b/misoc/cores/mor1kx/core.py
index 832f91a2..0b4803ff 100644
--- a/misoc/cores/mor1kx/core.py
+++ b/misoc/cores/mor1kx/core.py
@@ -25,6 +25,7 @@ class MOR1KX(Module):
             OPTION_DCACHE_WAYS=1,
             OPTION_DCACHE_LIMIT_WIDTH=31,
             FEATURE_TIMER="NONE",
+            FEATURE_PIC="NONE",
             OPTION_PIC_TRIGGER="LEVEL",
             FEATURE_SYSCALL="NONE",
             FEATURE_TRAP="NONE",
@@ -32,10 +33,12 @@ class MOR1KX(Module):
             FEATURE_OVERFLOW="NONE",
             FEATURE_ADDC="ENABLED",
             FEATURE_CMOV="ENABLED",
-            FEATURE_FFL1="ENABLED",
+            FEATURE_FFL1="NONE",
+            FEATURE_ATOMIC="NONE",
+            FEATURE_STORE_BUFFER="NONE",
             OPTION_CPU0="CAPPUCCINO",
-            IBUS_WB_TYPE="B3_REGISTERED_FEEDBACK",
-            DBUS_WB_TYPE="B3_REGISTERED_FEEDBACK",
+            IBUS_WB_TYPE="CLASSIC",
+            DBUS_WB_TYPE="CLASSIC",
         )
         defaults.update(kwargs)
         parameters = {"p_{}".format(k): v for k, v in defaults.items()}
  • We do not use interrupts in ARTIQ, though other MiSoC designs might, so that needs to be parameterizable.
  • We do not use the FFL1 instructions anywhere.
  • We use the atomic instructions (l.lwa, l.swa) but I cannot see why the runtime requires them. @whitequark can we get rid of them?
  • There is a performance penalty associated with disabling the store buffer.
  • The other cores do not support B3 registered feedback, so there is no performance degradation if the CPU also does not support it.

@dnadlinger
Copy link
Collaborator

I wouldn't ditch interrupts as we need them to implement a sampling profiler, something which both @whitequark and me have been wanting to do for quite some time.

@sbourdeauducq
Copy link
Member

Unfortunately, it is very difficult to meet timing.

@sbourdeauducq
Copy link
Member

Where and how would you use interrupts exactly?

@dnadlinger
Copy link
Collaborator

Sure, meeting timing is obviously more important; read my statement as "I'd really rather we found a way to keep timer interrupts".

To implement a sampling profiler, you would periodically save the current instruction pointer (optionally the complete stack trace) into a global buffer. Later, you'd send the full buffer back to the host PC, which allows you to generate a time profile. On a simple in-order CPU like this, the result should be frighteningly accurate even without something like PEBS. (The timer does introduce some non-determinism of course, so there needs to be enough slack in the RTIO timing.)

This doesn't need the full power of a complex PIC, though, but just a fixed-interval timer. Glancing over the mor1kx source, it seems like the tick_timer is in fact handled entirely separately from the PIC anyway, so FEATURE_PIC="NONE" shouldn't be an issue for that. (Not sure which timer path @jordens was referring to, though.)

@jordens
Copy link
Member Author

jordens commented Mar 28, 2018

@dnadlinger
Copy link
Collaborator

dnadlinger commented Mar 28, 2018

Ah, right, thanks. There is also the tick_timer directly in mor1kx, but I didn't have a closer look at it yet.

@whitequark
Copy link
Contributor

whitequark commented Mar 29, 2018

We do not use the FFL1 instructions anywhere.

We do actually have them enabled in the compiler, and I vaguely recall LLVM taking advantage of them for some RTIO-related computation, but these can be easily turned off as LLVM has a (reasonably slow) software fallback.

We use the atomic instructions (l.lwa, l.swa) but I cannot see why the runtime requires them. @whitequark can we get rid of them?

l.lwa and l.swa are not just atomic instructions, but they are LL/SC instructions. Aligned atomic loads are translated into l.lwz and aligned atomic stores into l.sw. l.lwa and l.swa are needed for atomic RMW instructions.

Whether we can get rid of them depends on the meaning of "get rid of". I've looked into removing them from the Rust libcore before and that was complex enough that I chose to fix their support in LLVM instead (the LLVM handling of atomic RMWs used to be pretty broken). The problem is that libcore has a complete set of atomic operations (store, load, RMW, CAS, etc) that are translated at the time when we're building the Rust stdlib for OR1K, so it's either disabling atomic operations for pointer-sized integers (which breaks a number of crates e.g. log) or translating RMW and CAS operations somehow.

Rust is switching to so-called MIR-only rlibs that delay translation to machine code until the point where the final executable is compiled, at some point in the future, which will fix this problem in the root. However, I don't know when that's happening, so we can't delay until then.

What I propose then is to disable their support in the CPU but leave it in the compiler. Any runtime use of l.lwa or l.swa will then crash with exception 7, which is easy to recognize. I suppose we could even emulate them if the need arises.

@sbourdeauducq
Copy link
Member

sbourdeauducq commented Apr 1, 2018

With LM32 (for both kernel and comms) and without any special care: opticlock, sysu and satellite meet timing easily. The DRTIO master fails timing (a bit inexplicably) in the RTIO core, by 24ps for the worst path.

@whitequark
Copy link
Contributor

whitequark commented Apr 1, 2018 via email

@sbourdeauducq
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants