-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improved register layout and tweaked event submission code #636
Comments
We can merge address and write enable instead. |
No, because that will lose precise RTIO exceptions or create horrible race conditions. But we can use bus wait states and bus errors. |
We might be able to do both getting rid of two CSR writes in many cases:
Good! |
Doing that to the channel register is difficult because DRTIO uses it to look up first the state of the remote channel, and test for underflow, FIFO full, etc. |
Is the bit-twiddling that will be required to compute the data faster than one register write? Note that we can have a second RTIO output function that writes to address zero by not doing the address CSR write. |
Does it do that channel status lookup early to be in parallel with the CPU doing the address/data writes? |
Merging the address into the channel sounds OK. |
The DMA playback engine takes LSB-first data of arbitrary length with byte granularity and zeros the missing MSBs, and DRTIO similarly removes zeros in front of data. The new RTIO writes could be done this way, with a total of 4 bus write accesses (instead of the current 6) if the data is small:
(plus status readout) |
The automatic zero-stripping/extending sounds good. |
If I recall correctly, I've tried this and it made no practical difference. |
Since event submission (i.e. writing the |
DRTIO is not using a separate CSR for the timestamp. The (D)RTIO cores are demuxed after the CSRs, see |
CSR pinning of now sounds good and easy to implement (just pin the global to the address of CSR instead of having a global in ksupport). |
@jordens Do you confirm that a 8-bit address is still sufficient for SAWG and all foreseen needs? |
@whitequark What do you think of |
I don't see anything that would need more than 8 bit address space. |
Are bus errors working correctly with mor1kx and do they result in exceptions that can be traced? |
IIRC, bus errors work properly but only on writes, on reads error cycles are silently ignored.
…On March 29, 2018 11:41:12 PM GMT+08:00, "Sébastien Bourdeauducq" ***@***.***> wrote:
> use bus wait states and bus errors to signal RTIO exceptions
Are bus errors working correctly with mor1kx and do they result in
exceptions that can be traced?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#636 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
OK, that would be acceptable. |
This needs to be rechecked before committing to this implementation though, it was a while ago when I looked at it. |
@sbourdeauducq What do you think about mapping the RTIO registers to OR1K SPRs? I've just thought of a good way to expose those to LLVM; we can add a new address space so that accesses to SPRs are just pointer accesses and most optimizations still apply to them. |
Why does that help, single-cycle access? What kind of interface would that have? Wouldn't that cause problems with |
Yes.
Interface where? To software? More or less the same, just use mfspr/mtspr in Rust and slightly different codegen in ARTIQ Python.
Quite the opposite, it will eliminate some inefficiency from
I think you have more insight into this than me at this moment.
All mature soft CPUs I'm aware of have some sort of coprocessor interfaces.
Well, how much do we want to shave cycles off RTIO submission time? |
Interface to gateware. |
That would shave 4 cycles (32ns) and seems quite complicated to do. |
It seems to me that ARTIQ is pretty heavily invested in mor1kx, and that changing to a different CPU (even if there were a better option, which it seems at present there isn't) would involve a great deal of work anyway. |
At least 32K. |
In my experience this is one of the main pain points for ARTIQ users, and whatever improvements we can squeeze out in a reasonable fashion are really welcome for all users. |
@dhslichter Something you can do to advance this is submit realistic benchmark code that I can profile. There might be inefficiencies that I don't currently expect in different parts of the stack. |
@sbourdeauducq I assume that the proposals for shaving time off RTIO submissions would also shave time off RTIO retrieval (i.e. from reading timestamps from an input FIFO). Is this accurate? |
@whitequark ack, I will send along some code for sideband cooling to give a sense of what we're up against, and to give you something to benchmark with. |
Not really, apart from
Not that much, it's basically a bit of system code, the exception handling/unwinder, and dealing with rustc/LLVM breakage that I cannot imagine will miss the opportunity to manifest itself. A lot of things are portable. @whitequark instead of SPRs, we can maybe do Wishbone combinatorial feedback for writes; would that work (i.e. no mor1kx bugs, and single-cycle performance) and meet timing? And does it actually improve things - with the write buffer, does the 2-cycle access have an impact on performance at all? And it's just 32ns at most... |
@sbourdeauducq are there modifications that would help with input speeds as well? Ack on the CPU portability comments. |
I don't imagine we're going to migrate to LM32 at this point because of the time investment for the LLVM backend and more importantly the lack of features in the CPU core, so the only option is RISC-V. I've just looked at RISC-V status in LLVM and it's somewhat underwhelming compared to what I expected. As far as I can tell they have a decently working LLVM backend but very little of it is upstream (10/84 patches), so there's no real advantage over our current OR1K LLVM backend. I cannot imagine the situation with rustc is any better, though rustc needs only a tiny amount of architecture-specific code (it's something around 100 LOC for OR1K, I think). The EH/unwinder changes are trivial (especially after I made them once for OR1K) so that's not a factor. Apart from what you listed, what else should be handled is linking, but that's not major either. To summarize, whether we should switch to RISC-V should be informed exclusively by what we win by using a different CPU softcore, since that's the only thing that will matter in practice, at least until RISC-V is fully upstream and widely used, which isn't going to happen until the end of 2018 or even 2019. @sbourdeauducq Worth a try. |
For inputs, in addition to |
@sbourdeauducq ack, thanks. |
Currently, RTIO CSR writes go through the write buffer, and each <=32-bit write takes 2 cycles to complete, which (thanks to the write buffer) is done in parallel with the CPU preparing the next operation. The problem with this scheme is, write error exceptions become asynchronous (raised some instructions after the write), so race conditions can occur. For example, a RTIOUnderflow can potentially be raised after the software has made a decision based on the incorrect information that no underflow has occured. So to implement this, the write buffer would have to be disabled for RTIO CSRs. This raises two difficult issues:
I'm not sure if saving 1 CSR read and 1 test (dozens ns) is worth going through those difficulties. |
Maybe the best solution is to dig into the CPU pipeline and implement custom instructions for RTIO operations. But that's significantly more complicated than the other optimizations proposed here. |
Such instructions, on the other hand, could be pretty fast. With optimized assembly (the production of which by the compiler is another issue), producing a square waveform on a TTL would be something like (@whitequark correct me if I am wrong):
So, a total of ~96ns per TTL state change. |
You could emit an |
I know, but I think it's marginally faster than reading the status register and a lot more complicated. |
Funded by NIST. |
Great! |
Done. |
Nice, thanks @sbourdeauducq. We will test here in the next week or two. |
(from sinara-hw/sinara#47)
address
anddata
, mergeaddress
andwrite_enable
)delta-timestamp registeror register-pinningofnow
maybe IRQs can be used to handle error conditions and perform submission retrialsThe text was updated successfully, but these errors were encountered: