Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make PipelineC LUT aware #45

Open
JulianKemmerer opened this issue Nov 19, 2021 · 13 comments
Open

Make PipelineC LUT aware #45

JulianKemmerer opened this issue Nov 19, 2021 · 13 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@JulianKemmerer
Copy link
Owner

A whole world of optimizations exists at the LUT level. Especially post PNR.

@suarezvictor was quick to point out that FPGAs have essentially 'free' registers that make pipelining easy. You could even turn on the registers between every single LUT for maximum FMAX.

@JulianKemmerer JulianKemmerer added the enhancement New feature or request label Nov 19, 2021
@suarezvictor
Copy link

Related paper: https://www.icsi.berkeley.edu/~nweaver/papers/2003-cslow.pdf

@bartokon
Copy link
Contributor

bartokon commented Nov 19, 2021

Good idea for next stages of the project aka ultra fine tuning.

BTW. synth tools should infer regs automatically.
We could set lut for some fpga architecture and search for logic gates that could be represented by 3/2 or 6/1lut and after that extraction place one reg there. But this is too low level imo.
Maybe we could suggest it to pyrtl?

@JulianKemmerer
Copy link
Owner Author

Yeah I think suggesting something like some basic FPGA arch modeling as part of pyrtl - to accompany their asic modeling - makes alot of sense (like you said pick a LUT-N arch of some sort)

@JulianKemmerer
Copy link
Owner Author

Ultra fine tuning yes I like that phrase

In context there is the lowest level of feedback (which pyrtl recently newly can provide) which is what is the critical path delay? and thats it. The tool can blindly use that single number to try and adjust its pipelining guesses.

Then there is slightly better which MAIN function specifically did the critical path occur in? which only applies for designs with multiple MAIN funcs and/or multiple clock domains.

There is next a currently-broken fine grain mode of which submodule instance exactly was the critical path inside? for even more targeted pipelining guess iterations. But that requires interpreting the syn+pnr output and tracing back to the original module in VHDL - which given all the optimizations and name mangling - is quite hard to do reliably.

And then how fun to think about this ultra fine grain mode of figuring out which post PNR LUTs correspond to what original HDL modules - for deciding where to turn on those free pipelining regs (unless trying that 'turn on all regs' mode to experiment with)

@JulianKemmerer
Copy link
Owner Author

Or well I suppose for LUT level you dont need to know what part of the HDL it maps from if you know the delays across LUTs, etc - are modeling those paths - can probably just start turning on regs in the comb. path based on delay alone.

@suarezvictor
Copy link

suarezvictor commented Nov 19, 2021 via email

@bartokon
Copy link
Contributor

bartokon commented Dec 7, 2021

I think most of the time FPGA support (right now) 3/2 luts or 6/1. Anyway if we could perform opt by putting registers after it uses 3 input function that have 2 outputs etc. Xilinx CLB have 2 slices and each slice have mux on output.
image

@JulianKemmerer
Copy link
Owner Author

I think if we took every PipelineC 'raw VHDL' pipelineable primitive, for ex. simple add operator (can divide up an N bit add into however many stages you want*).

What I would want to know is something like
If you want to es. 'cut a 64b add into 3 cycles' how many bits per stage is that such that we make best use of LUTs.

Maybe could start at the maximum and reverse - what is the upper limit of pipelining a ex. 64b adder - ex. how many bits per stage maps to the highest fmax pipeline

I did some experimenting once and there are defintely things like idk a 7b adder is less delay than a 3b adder or something like that - its not a simple mapping of bits per stage to delay

Blah blah let me know if this rant makes sense

@JulianKemmerer JulianKemmerer added the help wanted Extra attention is needed label Sep 1, 2022
@JulianKemmerer JulianKemmerer self-assigned this Sep 1, 2022
@JulianKemmerer
Copy link
Owner Author

Remembered this issue

why not call it the -O3 flag and say 'now your rendered HDL is unreadable/a netlist of LUTs'?

So the PipelineC HDL gets synthesized to LUTs first -> re imported as a PipelineC Dataflow to be pipelined= each LUT/~prim is a C func kinda thinking

@JulianKemmerer
Copy link
Owner Author

To be clear it is possible today to write C functions wrapping raw VHDL that instantiated LUTs
and from there you could with some extra work get the compiler to user various LUTs with or with IO registers to construct pipeline primitives up to enough to replace PipelineC raw VHDL operators, ex. add two u32's

@JulianKemmerer
Copy link
Owner Author

@suarezvictor
Copy link

This Bartus' paper is so good
I have an application that needs 8-bit multipliers, in the paper it's shown how to reach 410MHz using LUTs and pipelining, instead of 257MHz with DSPs

@JulianKemmerer
Copy link
Owner Author

Likely part of #46 and #48 too

Once dealing with device specific netlists, might as well also see if tools provide .sdf output which should detail timing of each LUT IIUC

https://en.wikipedia.org/wiki/Standard_Delay_Format#:~:text=Standard%20Delay%20Format%20(SDF)%20is,verification%20and%20static%20timing%20analysis.

Thanks @suarezvictor for bringing up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants