Make PipelineC LUT aware #45

JulianKemmerer · 2021-11-19T03:51:05Z

A whole world of optimizations exists at the LUT level. Especially post PNR.

@suarezvictor was quick to point out that FPGAs have essentially 'free' registers that make pipelining easy. You could even turn on the registers between every single LUT for maximum FMAX.

suarezvictor · 2021-11-19T13:23:00Z

Related paper: https://www.icsi.berkeley.edu/~nweaver/papers/2003-cslow.pdf

bartokon · 2021-11-19T13:47:54Z

Good idea for next stages of the project aka ultra fine tuning.

BTW. synth tools should infer regs automatically.
We could set lut for some fpga architecture and search for logic gates that could be represented by 3/2 or 6/1lut and after that extraction place one reg there. But this is too low level imo.
Maybe we could suggest it to pyrtl?

JulianKemmerer · 2021-11-19T14:21:21Z

Yeah I think suggesting something like some basic FPGA arch modeling as part of pyrtl - to accompany their asic modeling - makes alot of sense (like you said pick a LUT-N arch of some sort)

JulianKemmerer · 2021-11-19T14:31:10Z

Ultra fine tuning yes I like that phrase

In context there is the lowest level of feedback (which pyrtl recently newly can provide) which is what is the critical path delay? and thats it. The tool can blindly use that single number to try and adjust its pipelining guesses.

Then there is slightly better which MAIN function specifically did the critical path occur in? which only applies for designs with multiple MAIN funcs and/or multiple clock domains.

There is next a currently-broken fine grain mode of which submodule instance exactly was the critical path inside? for even more targeted pipelining guess iterations. But that requires interpreting the syn+pnr output and tracing back to the original module in VHDL - which given all the optimizations and name mangling - is quite hard to do reliably.

And then how fun to think about this ultra fine grain mode of figuring out which post PNR LUTs correspond to what original HDL modules - for deciding where to turn on those free pipelining regs (unless trying that 'turn on all regs' mode to experiment with)

JulianKemmerer · 2021-11-19T14:33:35Z

Or well I suppose for LUT level you dont need to know what part of the HDL it maps from if you know the delays across LUTs, etc - are modeling those paths - can probably just start turning on regs in the comb. path based on delay alone.

suarezvictor · 2021-11-19T18:07:18Z

To me initially it shouldn't support all kinds of chips, only the ones supported by open source tools (yosys and nextpnr). Having all the data at hand, a better tool can be designed. Then, when it works, it could be ported to commercially supported chips.

…

On Fri, Nov 19, 2021 at 11:33 AM Julian Kemmerer ***@***.***> wrote: Or well I suppose for LUT level you dont *need* to know what part of the HDL it maps from if you know the delays across LUTs, etc - are modeling those paths - can probably just start turning on regs in the comb. path based on delay alone. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACBHVWKM6ITCP4JAFIXBFDLUMZN4TANCNFSM5ILGJJ4Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

bartokon · 2021-12-07T21:50:11Z

I think most of the time FPGA support (right now) 3/2 luts or 6/1. Anyway if we could perform opt by putting registers after it uses 3 input function that have 2 outputs etc. Xilinx CLB have 2 slices and each slice have mux on output.

JulianKemmerer · 2021-12-08T01:02:47Z

I think if we took every PipelineC 'raw VHDL' pipelineable primitive, for ex. simple add operator (can divide up an N bit add into however many stages you want*).

What I would want to know is something like
If you want to es. 'cut a 64b add into 3 cycles' how many bits per stage is that such that we make best use of LUTs.

Maybe could start at the maximum and reverse - what is the upper limit of pipelining a ex. 64b adder - ex. how many bits per stage maps to the highest fmax pipeline

I did some experimenting once and there are defintely things like idk a 7b adder is less delay than a 3b adder or something like that - its not a simple mapping of bits per stage to delay

Blah blah let me know if this rant makes sense

JulianKemmerer · 2022-09-28T19:15:44Z

Remembered this issue

why not call it the -O3 flag and say 'now your rendered HDL is unreadable/a netlist of LUTs'?

So the PipelineC HDL gets synthesized to LUTs first -> re imported as a PipelineC Dataflow to be pipelined= each LUT/~prim is a C func kinda thinking

JulianKemmerer · 2022-10-22T20:41:03Z

To be clear it is possible today to write C functions wrapping raw VHDL that instantiated LUTs
and from there you could with some extra work get the compiler to user various LUTs with or with IO registers to construct pipeline primitives up to enough to replace PipelineC raw VHDL operators, ex. add two u32's

JulianKemmerer · 2022-12-06T12:05:59Z

Thanks Bartus:
https://essay.utwente.nl/79103/1/Kruiper_BA_EEMCS.pdf

suarezvictor · 2022-12-06T14:09:26Z

This Bartus' paper is so good
I have an application that needs 8-bit multipliers, in the paper it's shown how to reach 410MHz using LUTs and pipelining, instead of 257MHz with DSPs

JulianKemmerer · 2023-01-29T16:16:13Z

Likely part of #46 and #48 too

Once dealing with device specific netlists, might as well also see if tools provide .sdf output which should detail timing of each LUT IIUC

https://en.wikipedia.org/wiki/Standard_Delay_Format#:~:text=Standard%20Delay%20Format%20(SDF)%20is,verification%20and%20static%20timing%20analysis.

Thanks @suarezvictor for bringing up

JulianKemmerer added the enhancement New feature or request label Nov 19, 2021

JulianKemmerer mentioned this issue Nov 24, 2021

Better search for fmax goals #48

Open

JulianKemmerer added the help wanted Extra attention is needed label Sep 1, 2022

JulianKemmerer self-assigned this Sep 1, 2022

This was referenced Nov 11, 2022

Faster timing estimates #46

Open

Raw VHDL interface should be bits per stage not slices #147

Open

JulianKemmerer removed their assignment Feb 3, 2023

JulianKemmerer mentioned this issue Oct 29, 2023

Make tool optimize for fewer pipelining registers #64

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make PipelineC LUT aware #45

Make PipelineC LUT aware #45

JulianKemmerer commented Nov 19, 2021

suarezvictor commented Nov 19, 2021

bartokon commented Nov 19, 2021 •

edited

Loading

JulianKemmerer commented Nov 19, 2021

JulianKemmerer commented Nov 19, 2021

JulianKemmerer commented Nov 19, 2021

suarezvictor commented Nov 19, 2021 via email

bartokon commented Dec 7, 2021

JulianKemmerer commented Dec 8, 2021

JulianKemmerer commented Sep 28, 2022

JulianKemmerer commented Oct 22, 2022

JulianKemmerer commented Dec 6, 2022

suarezvictor commented Dec 6, 2022

JulianKemmerer commented Jan 29, 2023

Make PipelineC LUT aware #45

Make PipelineC LUT aware #45

Comments

JulianKemmerer commented Nov 19, 2021

suarezvictor commented Nov 19, 2021

bartokon commented Nov 19, 2021 • edited Loading

JulianKemmerer commented Nov 19, 2021

JulianKemmerer commented Nov 19, 2021

JulianKemmerer commented Nov 19, 2021

suarezvictor commented Nov 19, 2021 via email

bartokon commented Dec 7, 2021

JulianKemmerer commented Dec 8, 2021

JulianKemmerer commented Sep 28, 2022

JulianKemmerer commented Oct 22, 2022

JulianKemmerer commented Dec 6, 2022

suarezvictor commented Dec 6, 2022

JulianKemmerer commented Jan 29, 2023

bartokon commented Nov 19, 2021 •

edited

Loading