Faster timing estimates #46

JulianKemmerer · 2021-11-20T17:10:44Z

Discussed in #42

^{Originally posted by bartokon November 9, 2021}
Hi,
I'm thinking if it is possible to create a fake part with fake timings, so whole place and route could take seconds (Cached synthesis p&r?) That part should have unlimited resources and each path delay could be for example 1.5*logic_delay. Logic delay could be avg from different FPGA families. All latency finding algorithms could much faster to iterate...

For example generate netlist from Pipeline parsed code then generate fake timing paths depending on logic depth and wire length.

JulianKemmerer · 2021-11-20T17:14:23Z

As seen in the linked discussion, there are routes to faster timing feedback than running full syn+pnr tools - thanks @bartokon. This likely involves estimation/modeling of some sort. Currently pyrtl is being experimented with as a backend for providing timing models/delay estimates.

There still needs to be some work to make models like pyrtl match the behavior of FPGA synthesis+pnr tools. FPGA device specific prims(ex. multipliers) + the bigger issue of better clarity of design signal names as translated through GHDL->Yosys->BLIF->pyrtl.

JulianKemmerer · 2021-11-20T17:18:48Z

https://www.rapidwright.io/docs/Introduction.html#what-is-rapidwright
RapidWright may also be an option for Xilinx FPGAs

bartokon · 2021-11-20T18:05:20Z

Hiho, here are the times for artix implementation mypipeline. (Yea I left my PC for more than 8h of synth xD)
artix_times.txt

.
We need to extract worst paths that create this flat frequency areas for example 150~300 latency. There is no need to run 80latency to 150 runs because the freq change is near ~20Mhz tops?

bartokon · 2021-11-20T18:11:54Z

And this is for Yosys+nextpnr (This graph looks so broken...) LFE5UM5G-85F-8BG756C

nextpnr_times.txt
)

JulianKemmerer · 2021-11-20T19:37:11Z

Wow how interesting that the nextpnr version came back so ...unpredictable... Doesnt seem like results from nextpnr ive gotten before...

JulianKemmerer · 2021-11-20T19:49:16Z

@bartokon I want to have distinct issues for the two parts of the fmax finding problem. See #48 for more discussion like this. Can you share your above plots again there? :)

suarezvictor · 2022-04-09T19:13:58Z

I find this paper pretty interesting on how to do static timings esimations: "Timing Driven C-Slow Retiming on RTL
for MultiCores on FPGAs" by Tobias Strauch @arduissimo
http://cloudx.cc/papers/c_slow_retiming_rtl_1807.05446.pdf

There are timing models for various primitives in function of width

And related article where the author states "the µLN indicator can be used for fast static timing estimations", see https://www.eetimes.com/introducing-c-slow-retiming-system-hyper-pipelining/2/

suarezvictor · 2022-10-31T04:31:59Z

I'm sending a pull request with a function that estimates delays based on operations and bit sizes
See here: #143

JulianKemmerer · 2022-11-02T23:27:45Z

See https://github.com/JulianKemmerer/PipelineC/blob/master/src/DEVICE_MODELS.py

JulianKemmerer · 2022-11-03T01:31:16Z

Lets continue here @suarezvictor

You ask:
how to continue with this outstanding results? Testing it with a larger project like the raytracer? Implementing a model of simple arithmetic/logical operations but with a multistage pipeline? Making the database larger with trying more bit width combinations? analyzing other FPGA devices?

I encourage you to take the work in the direction you most want to work on. Feel free to experiment adding things to DEVICE_MODELS.py, etc - so yeah trying more designs, trying different operations, trying different FPGA parts, and certainly multistage pipeline modeling ... so lets talk about pipelining...

JulianKemmerer · 2022-11-03T01:38:58Z

I think getting to 1) models of basic pipelined operations and then 2) models of 'full netlist' timing are generally what we want.

The goal being instead of asking a synthesis+pnr tool to give the timing of a pipelined operation (or entire design) there should be some kind of faster modeling that can be done...

So lets focus on the smaller initial goal from 1): just predicting timing of pipelined basic operators...

For that I need to explain how pipelining works right now...

suarezvictor · 2022-11-03T02:04:35Z

I propose that for any select work, that you provide the data, and I try to develop the models. In case of the pipelining of a single binary operation (split in smaller widths) it should specify the operation, all the intermediate and final widths, plus the delay (or maybe, as a start, the input widths and number of stages)

JulianKemmerer · 2022-11-03T02:14:14Z

Lets consider the basics first:

You can say "pipeline an operation to N stages".

Ex. pipeline an 32b adder to 2 stages (16b per stage)

And these are the kinds of pipeline cache values @suarezvictor is looking for
(operation, operand widths, num stages) -> fmax

However the tool does not really use just a single 'stages' latency number to characterize the pipeline.

Consider a design with three of these adders back to back in series

->  |---ADDR0---| |---ADDR1---| |---ADDR2---|  ->
 
    |------ total delay = 3*one adder delay  ----|

If you ask 'pipeline that to 2 stages', i.e. split into half - where does the split happen?

->  |---ADDR0---| |---ADDR1---| |---ADDR2---|  ->
                         /\
         first stage     |        second stage

    |------ total delay = 3*one adder delay  ----|

The stage split would occur half way through ADDR1. So you can't say 'all 32b adders need to be pipelined in two stages'. Instead you need to say specifically the ADDR1 instance only needs a single stage of pipelining, and it needs to have the stage go exactly here to break the logic into specific chunks.

So pipelining is not ultimately done on the basic operations as 'pipeline to N' stages but instead is done as 'pipeline the op breaking into these pieces/slices'

So the model I think would be most useful is
(operation, operand widths, pipeline stages pieces/slices) -> fmax

Right now the tool describes the chunk/slices/locations of pipelining by each operation instance having an associated list of numbers.

[0.5] # Single slice through middle at 50% point making 2 stages
[0.333, 0.333] # Two slices evenly spread making 3 stages
[0.25] # Make two stages one with 25% of logic, other with 75%

It would not be hard to configure the tool to collect more data by doing a bunch of --coarse runs of fixed integer 'pipeline to 2,3,4, etc stages'.
ex.

[] # Comb logic one stage
[0.5] # two stages
[0.333, 0.333]  3 stages
...

And then you could work with that as
(operation, operand widths, num stages) -> fmax

Maybe the even stage slicing is all thats needed...

Consider an odd example

[0.25, 0.5] # Make 3 stages, two with 25% of logic, other final with 50%

Do we really need to model that case specifically?
I would think the timing would be limited by the size of the largest stage, ex. the big 50% sized one.
The timing of [0.25, 0.5] should be essentially the same as just [0.5] since the longest stages in both cases are 50% of delay...

So maybe ex.

[0.25] # Make two stages one with 25% of logic, other with 75%

Biggest stage is 75%, then should have the same timing as a design "split evenly" into 1/0.75=1.333 stages .

So maybe big TLDR here is that the timing model can take the form of:
(operation, operand widths, num stages) -> fmax
But num stages has to be able to take an decimal/fractional number like 1.3

JulianKemmerer · 2022-11-03T02:20:14Z

you provide the data, and I try to develop the models.

The data is available in the repo path_delay_cache for a few different FPGA parts.

Any more data needs to be manually collected by running the synthesis+pnr tools to build the cache+models. That is something I can help you setup and run (ex. write a adder function, tell to pipeline 0,1,2,3 etc clock cycles, collect data of latency->fmax, repeat for other operators, script ti to repeat for all the FPGA parts, repeat for all the widths etc)

I personally dont have time to create and maintain these models so I am leaving it up to whoever wants to contribute to DEVICE_MODELS.py - experimental ground

suarezvictor · 2022-11-03T02:20:20Z

My proposal is that you throw to the model all the data, and then the model can simplify the data before the estimation. In the current state, for example, when you specify the width of two operands, I simplify it by just taking the greater one.
In any case, let's dig into what kind of shortcut we can do, but we'll implement that INSIDE the model, not before calling it.

suarezvictor · 2022-11-03T02:22:14Z

In regards to who will maintain the models, I'm inclined to do it. I ask you the data. You can use whatever you actually have, or try to create more. Let's agree on the interfaces ;)

JulianKemmerer · 2022-11-03T02:23:12Z

Yeah I will help create a cache of pipelined operators that can be used to do

(operation, operand widths, num stages) -> fmax
But num stages has to be able to take an decimal/fractional number like 1.3

But making the model is up to you

suarezvictor · 2022-11-03T02:27:23Z

I can live with those real numbered "num stages". It seems I prefer it as a fraction of 1, let's sat, 1/N (N=number of stages)

JulianKemmerer · 2022-11-10T03:58:58Z

OK as discussed there will now be cached files of period values for pipelined built in functions
#145 - should allow some kinds of better models and predictions for faster pipelining to be made

Paths like pipeline_min_period_cache/vivado/xc7a100tcsg324-1/syn/BIN_OP_PLUS_float_8_14_t_float_8_14_t_0.25.delay

meaning that using vivado synthesis estimates for a 100t part, a plus operation for float_8_14_t + float_8_14_t was sliced ~into 4 slices 1/4=0.25 and was observed working at the period (in nanoseconds) recorded inside the files (right now=6.227ns).

Note that this simply records that 'a plus operation sliced that way was seen in a design working a frequency=F (period=P)'. It is not recording based on specific measurements about specific operator but instead a ~guess based on the design the operator exists in.
More accurate cache results can be obtained by by running tests where the entire design is just a single operator. As such, the cache will build in accuracy over time - or targeted --coarse --sweep's can be done on test designs of individual operators.

JulianKemmerer · 2022-11-10T04:01:11Z

#48 issue also very much will be interested in that data as it builds up

JulianKemmerer · 2022-11-10T18:40:54Z

@suarezvictor the first data points for the pipeline period cache have been uploaded

https://github.com/JulianKemmerer/PipelineC/tree/master/pipeline_min_period_cache/vivado/xc7a100tcsg324-1/syn

That cache is what results from building the full RT 1080p targetting 148MHz... so in theory if we can model this cache we could get to the RT final pipeline faster - have to think about how best to use models ...

suarezvictor · 2022-11-10T20:01:59Z

Seem as good progrees
Do you think the data is reliable?

BIN_OP_MINUS_int13_t_int16_t_0.31.delay 6.172
BIN_OP_MINUS_int16_t_int13_t_0.89.delay 7.877

BIN_OP_MINUS_int22_t_int22_t_0.34.delay 4.976
BIN_OP_MINUS_int22_t_int22_t_0.71.delay 4.976

JulianKemmerer · 2022-11-10T23:51:42Z

It will increase in reliability over time as more data points are used to update the cache.(It likely isn't very accurate to start)

Or if more specific tests are intentionally done (ex. a design of just a single BIN_OP_MINUS_int22_t_int22_t set to --coarse --sweep over some range

suarezvictor · 2022-11-11T00:22:32Z

let's assume it's useful data

suarezvictor · 2022-11-11T07:47:59Z

I may not understand the data. When we hace lets say a 22+22 bit add, and you register a 0.5 factor, does it mean you made a 2 stage pipeline of 11 bit each? If you used 11 bit, we'll need that number. We need to know what the synthetizer was asked and what it returned. Maybe if you van illustrate an example with VHDL code it would be even clearer

JulianKemmerer · 2022-11-11T16:29:35Z

register a 0.5 factor, does it mean you made a 2 stage pipeline of 11 bit each
yes

If you used 11 bit, we'll need that number.
You can obtain the bits per stage number by multiplying - ex. 22 * 0.5 = 11 (round up as needed for fractions)

what the synthetizer was asked and what it returned
I am not sure how to get that information for you (beyond humans looking at the log files)

suarezvictor · 2022-11-11T16:59:13Z

how to round up if the operands doesn't have same size?
I think that if you generated code for the synthetizer using N and M bits, that should be on the file. If not can you write a pseudocode about how to get back to the sizes of all operands? I'm a bit hesitant to have to do operations on the data that should exactly match the algorihms on other parts of the code, since they may change.

JulianKemmerer · 2022-11-11T17:10:58Z

how to round up if the operands doesn't have same size?
Round up to the maximum size of the two operands, u32+u16 you would do 32 * pipeline divisor fraction number

generated code for the synthesizer using N and M bits, that should be on the file
I agree. It has been something ive been wanting to improve. Ill move it from my todo list to an issue soon.

Generally for reference these questions about exactly what is seen at the lowest level might be getting towards 'what LUTs is this?' #45

JulianKemmerer · 2022-11-11T17:22:18Z

Also @suarezvictor I encourage you, if you want good data for a specific operator, write a little demo function of just that operator and do a ex. --coarse --sweep --start 1 --stop 16 and that will populate the cache with the best possible information

suarezvictor · 2022-11-11T21:17:39Z

I'll use the algorithm of taking the widest operando and round upwards with the factor, until we can do it in a cleaner way (i.e. just saving the actual operands sizes)

JulianKemmerer · 2024-02-12T17:09:20Z

Re: stuff like building our own ~nextpnr - or just timing estimate tool etc

Very cool to see ECP5 delays documented here from prjtrellis : https://github.com/YosysHQ/prjtrellis-db/tree/master/ECP5/timing

suarezvictor · 2024-02-12T19:17:42Z

I don't get it... 5 years ago?

JulianKemmerer · 2024-02-12T20:09:23Z

I was noting that in that prjtrellis database repo directory is where all the ECP5 timing data is. So if people are still following this issue about timing estimates, they might be able to make a custom ECP5 timing estimatation tool that is faster than running full ghdl+yosys+nextpnr iterations.

suarezvictor · 2024-02-13T02:41:16Z

You say before PNR?

JulianKemmerer · 2024-02-13T05:10:10Z

In theory yeah - but its almost recreating PNR in difficulty - its still a big project to undertake - I was just noting I know where the raw timing data lives now, giving some more scope to the problem

JulianKemmerer added the enhancement New feature or request label Nov 20, 2021

JulianKemmerer mentioned this issue Nov 20, 2021

Better search for fmax goals #48

Closed

JulianKemmerer added the help wanted Extra attention is needed label Sep 1, 2022

JulianKemmerer self-assigned this Sep 1, 2022

JulianKemmerer mentioned this issue Nov 3, 2022

Make cache of pipelined basic operations #145

Closed

JulianKemmerer mentioned this issue Nov 11, 2022

Raw VHDL interface should be bits per stage not slices #147

Closed

JulianKemmerer mentioned this issue Jan 29, 2023

Make PipelineC LUT aware #45

Closed

JulianKemmerer removed their assignment Feb 3, 2023

Repository owner locked and limited conversation to collaborators Nov 16, 2024

JulianKemmerer converted this issue into discussion #217 Nov 16, 2024

This issue was moved to a discussion.

Faster timing estimates #46

Faster timing estimates #46

Comments

JulianKemmerer commented Nov 20, 2021

Discussed in #42

JulianKemmerer commented Nov 20, 2021

JulianKemmerer commented Nov 20, 2021

bartokon commented Nov 20, 2021 • edited Loading

bartokon commented Nov 20, 2021 • edited Loading

JulianKemmerer commented Nov 20, 2021

JulianKemmerer commented Nov 20, 2021

suarezvictor commented Apr 9, 2022 • edited Loading

suarezvictor commented Oct 31, 2022

JulianKemmerer commented Nov 2, 2022

JulianKemmerer commented Nov 3, 2022

JulianKemmerer commented Nov 3, 2022

suarezvictor commented Nov 3, 2022 • edited Loading

JulianKemmerer commented Nov 3, 2022 • edited Loading

JulianKemmerer commented Nov 3, 2022

suarezvictor commented Nov 3, 2022

suarezvictor commented Nov 3, 2022

JulianKemmerer commented Nov 3, 2022 • edited Loading

suarezvictor commented Nov 3, 2022 • edited Loading

JulianKemmerer commented Nov 10, 2022 • edited Loading

JulianKemmerer commented Nov 10, 2022

JulianKemmerer commented Nov 10, 2022

suarezvictor commented Nov 10, 2022

JulianKemmerer commented Nov 10, 2022

suarezvictor commented Nov 11, 2022

suarezvictor commented Nov 11, 2022

JulianKemmerer commented Nov 11, 2022

suarezvictor commented Nov 11, 2022

JulianKemmerer commented Nov 11, 2022

JulianKemmerer commented Nov 11, 2022

suarezvictor commented Nov 11, 2022

JulianKemmerer commented Feb 12, 2024

suarezvictor commented Feb 12, 2024

JulianKemmerer commented Feb 12, 2024

suarezvictor commented Feb 13, 2024

JulianKemmerer commented Feb 13, 2024

This issue was moved to a discussion.

bartokon commented Nov 20, 2021 •

edited

Loading

bartokon commented Nov 20, 2021 •

edited

Loading

suarezvictor commented Apr 9, 2022 •

edited

Loading

suarezvictor commented Nov 3, 2022 •

edited

Loading

JulianKemmerer commented Nov 3, 2022 •

edited

Loading

JulianKemmerer commented Nov 3, 2022 •

edited

Loading

suarezvictor commented Nov 3, 2022 •

edited

Loading

JulianKemmerer commented Nov 10, 2022 •

edited

Loading