cost: add keep_hierarchy pass with min_cost argument #4344

widlarizer · 2024-04-18T10:29:25Z

Modules being flattened improves QoR in practice. It also makes the yosys runtime take much longer.

This PR creates cost.cc with linear cost models for almost all internal cell types to estimate the size of a module after techmapping. To get most of these numbers, I used a modified version of the test_cell command, see emil/gather-cell-size.

This PR also adds the keep_hierarchy pass which marks all selected modules with that attribute and has an optional -max_cost integer argument which sets a maximum estimated cost threshold.

6. Executing keep_hierarchy pass.
Marking top.cpu (module too big: 8880 > 100).
Marking top.cpu.alu (module too big: 2068 > 100).
Marking top.cpu.brancher (module too big: 515 > 100).
Marking top.cpu.datamem (module too big: 1235 > 100).
Marking top.cpu.imm_decoder (module too big: 226 > 100).
Marking top.cpu.regfile (module too big: 4631 > 100).
<suppressed ~635 debug messages>

7. Executing FLATTEN pass (flatten design).
Keeping top.cpu.regfile (found keep_hierarchy attribute).
Keeping top.cpu.imm_decoder (found keep_hierarchy attribute).
Keeping top.cpu.datamem (found keep_hierarchy attribute).
Keeping top.cpu.brancher (found keep_hierarchy attribute).
Keeping top.cpu.alu (found keep_hierarchy attribute).
Deleting now unused module top.cpu.controller.
Deleting now unused module top.cpu.writeback.
<suppressed ~2 debug messages>

Effects on runtime and QoR with OpenROAD: TBD

whitequark · 2024-04-18T10:31:41Z

This PR creates cost.cc with linear cost models for almost all internal cell types to estimate the size of a module after techmapping.

I'm really interested in your methodology here--can you explain it in detail?

widlarizer · 2024-04-18T10:31:51Z

squash before merge
add mul, div

widlarizer · 2024-04-18T10:36:33Z

I'm really interested in your methodology here--can you explain it in detail?

@whitequark Sure: I used test_cell to generate random sizes of cells and to techmap them. Then I parse the dump and stats to get the gate count and input/output widths. I then stared at plots like this. I looked for upper bounds in gate count with regards to output port width, sum of all port widths, and the largest input port width. This worked better than I expected. For cells unsupported by test_cell I looked at their techmap or semantics. There's a TODO comment for ones where I'm not sure if I should expect them at the point right before flattening, namely, FSM and memory.

nakengelhardt · 2024-04-19T07:45:43Z

kernel/cost.cc

+
+static unsigned int y_coef(RTLIL::IdString type)
+{
+	// clang-format off


Ew. This is convincing me more and more that we should not use clang-format...

It's possible that it doesn't fit our repository. This really should have been a switch/case, but ID($...) isn't a constant value. Then I wouldn't have to do this. Personally I'm very used to hitting Ctrl+Shift+I to format an entire file I'm working on in VS Code. For shared files I intend to get used to formatting modified lines only, which VS Code does allow me to set a shortcut to

You can use type.in(...) which may work better.

nakengelhardt · 2024-04-19T07:56:17Z

kernel/cost.h

+			{ ID($_DFF_P_),  1 },
+			{ ID($_DFF_N_),  1 },


We should probably add all of the DFF and latch types here, but I don't know what a reasonable cost estimate for them would be (1 seems off).

We can look at the sky130hd and asap7 areas for those and the other cells

I'd say that's out of scope. Want me to make an issue? I added these only because I noticed stat.cc was patching these on its side so I consolidated it into here

What is using these costs? Is there any information about where they come from and what they are supposed to model?

stat.cc uses CMOS transistor estimates of these (16). Nothing uses default gate count estimates of these. I was using this logic: by being primitives to be techmapped to, they have a default gate count of 1

by being primitives to be techmapped to, they have a default gate count of 1

I am not sure I follow. Wouldn't that make the cost 1 for all the primitives in the list?

I think it may be appropriate to use a model suitable for pre-synthesis estimates.
So something like:

buffers and inverters have no cost as they aren't full 'functions'

simple 2-input gates (AND, NAND, OR, NOR, MUX...) have a cost of 1

complex 2-input gates (XOR, XNOR) have a cost of 2

larger gates use the equivalent of what they would use, were you to implement them as 2-input gates
This is the same approach used in Zimmermann's thesis, you need to scale it to whatever base you use (times four if you count transistors).
Sequential elements (all of them) should likely be substantially more expensive because it is (usually) harder to optimize them away and depending on your technology, you have additional costs related to this cell (special placement considerations, it may use an additional metal layer internally, it may need decap cells nearby, more flops -> more clock tree etc).

In my opinion it may even make sense to split combination and sequential cost especially considering the application in keep_hierarchy since you may want to keep modules which aren't big but have a lot of state (eg register files).

whitequark · 2024-04-19T21:58:24Z

kernel/cost.cc

+    } else if (// shift
+        type == ID($shift) ||
+        type == ID($shiftx)) {
+        return 8;


Why is this twice the cost of $shr? I'm very unconvinced that the cost model is sound.

I'll inspect some techmapped examples of these gates on Monday

run read_rtlil shift.il; techmap; opt; stat on this shift.il:

module \shr wire width 10 input 1 \A wire width 10 input 2 \B wire width 10 output 3 \Y cell $shr \UUT parameter \A_SIGNED 1 parameter \A_WIDTH 10 parameter \B_SIGNED 1 parameter \B_WIDTH 10 parameter \Y_WIDTH 10 connect \A \A connect \B \B connect \Y \Y end end module \shift wire width 10 input 1 \A wire width 10 input 2 \B wire width 10 output 3 \Y cell $shift \UUT parameter \A_SIGNED 1 parameter \A_WIDTH 10 parameter \B_SIGNED 1 parameter \B_WIDTH 10 parameter \Y_WIDTH 10 connect \A \A connect \B \B connect \Y \Y end end

yields

=== shift === Number of cells: 122 === shr === Number of cells: 64

Which shows that 2x factor. To the extent in which this yosys behavior is correct, so is the model

whitequark

I'm quite interested in the functionality of this PR but in order to be convinced that it's a good addition it will require the following:

A definition of what the costs mean, as well as a clear statement of who they are suitable for and who they are not.
A clear methodology for calculating the cost, which is not a part of some random script in someone else's repository, but a part of Yosys itself. This includes both:
- A description of the methodology in prose.
- An executable that calculates the costs according to it.

widlarizer · 2024-04-21T12:58:07Z

An executable that calculates the costs according to it

This is pretty intense for the current sole use case which is as a heuristic flattening only modules that aren't huge. For what it's worth, if these were all set to sum*1 except for mul+div, it probably would achieve similar behavior (I can try this out now that I got openroad benchmarks running locally). I can't anticipate other possible uses so I used realistic coefficients which I arrived at by staring at those plots. But it's not supposed to capture perfectly the simplest possible upper bound. A simple executable may even fail to get a reasonable coefficient randomly due to constants when something changes in techmap down the line

widlarizer · 2024-05-07T12:44:38Z

Sorry for the spam @zachjs, I accidentally committed changes in ast that I only used to play around

widlarizer · 2024-05-07T17:38:26Z

@whitequark test_cell is now capable of checking whether the cost is a correct post techmap gate count upper bound. This means the coefficients aren't generated programmatically, but at least they are verified, at least for cells covered by test_cell functionality. Use case example: test_cell -noeval -nosat -check_cost -bloat 4 all. I have also split it away from existing default_gate_cost beyond a simple check. Let me know what you think.

check why some cells (comparison, $sop) are failing and whether it's a constant offset deal
find a cost model for $lut
add statistics result (share of each cell type's instances that are larger than was estimated, and by how much was the worst/typical offender, in relative and absolute numbers)

Open questions:

can it be used in testing, with its rather intense run time to generate a significant number of cells?

whitequark · 2024-05-07T17:50:58Z

test_cell is now capable of checking whether the cost is a correct post techmap gate count upper bound. This means the coefficients aren't generated programmatically, but at least they are verified, at least for cells covered by test_cell functionality.

That is much better! I'll take a closer look a bit later.

povik · 2024-05-07T17:59:18Z

find a cost model for $lut

I propose we supply a conservative upper bound based on the width of the $lut cell alone, that is, we shouldn't bother with inspecting the LUT parameter. Usually $lut are not a cell entered by the user but are a product of lut-mapping passes only, so there isn't much of a use case where someone would be applying the cost model to it. For that reason let's go with what's easiest but correct.

widlarizer · 2024-05-13T15:43:21Z

Current status and intended usage. I think I'll have to leave it as is for now. I think it's good enough as a heuristic and should move on to more pressing topics

test_cell -noeval -nosat -bloat 4 -check_cost \$not \$pos \$neg \$and \$or \$xor \$xnor \$reduce_and \$reduce_or \$reduce_xor \$reduce_xnor \$reduce_bool \$shl \$shr \$sshl \$sshr \$shift \$shiftx \$lt \$le \$eq \$ne \$ge \$gt \$add \$sub \$logic_not \$logic_and \$logic_or
Warning: Cell type $shl failed in 3.0% cases with worst offender being by 12 (120.0%)
Warning: Cell type $sshl failed in 2.0% cases with worst offender being by 11 (110.0%)

test_cell -noeval -nosat -bloat 1 -check_cost \$mux \$bmux \$demux
Warning: Cell type $demux failed in 8.0% cases with worst offender being by 30 (23.4%)

test_cell -noeval -nosat -check_cost -bloat 2 \$mul \$div \$mod \$divfloor \$modfloor 
Warning: Cell type $div failed in 1.0% cases with worst offender being by 66 (6.7%)
Warning: Cell type $modfloor failed in 3.0% cases with worst offender being by 433 (33.8%)

test_cell -noeval -nosat -bloat 2 -check_cost \$alu \$lcu \$fa
Warning: Cell type $alu failed in 20.0% cases with worst offender being by 13 (10.2%)

test_cell -noeval -nosat -noopt -bloat 2 -check_cost \$lut \$sop

passes/hierarchy/keep_hierarchy.cc

nakengelhardt · 2024-06-03T12:21:21Z

passes/hierarchy/keep_hierarchy.cc

+
+		CellCosts costs(design);
+
+		for (auto module : design->selected_modules()) {


There's two options to consider for operations on modules, selected_modules() vs selected_whole_modules_warn(). I think in this case it makes sense to operate on partial modules (can be useful e.g. if you want to set keep_hierarchy on any module that contains a particular type of cell) but wanted to get everyone else's opinion too.

The best thing to do would be to warn on partially selected modules, but still add them, which selected_whole_modules_warn doesn't do, so I'll leave it as-is

nakengelhardt · 2024-06-03T12:35:26Z

kernel/cost.h

+	// Get the cell cost for a cell based on its parameters.
+	// This cost is an *approximate* upper bound for the number of gates that
+	// the cell will get mapped to with "opt -fast; techmap"
+	// The intended usage is for flattening heuristics and similar situations
+	unsigned int get(RTLIL::Cell *cell);
+	// Sum up the cell costs of all cells in the module
+	// and all its submodules recursively
+	unsigned int get(RTLIL::Module *mod);


I would have expected CellCosts to become a base class with the get functions virtual (pure, or maybe the costs for default_gate_cost()?), and the heuristic a derived class with a name like NumInternalGatesEstimate that'll make it immediately obvious at the point of use what's happening. Then the cmos and default costs are just other variants of cost models, rather than static functions that are for some reason defined in a class that now does something unrelated.

My get method operate on cells, the previously implemented estimation dicts operate on types. They don't share an interface. The dicts are provided over static functions, only using CellCosts as a namespace. I can move them out to make this distinction more explicit. I have previously done something like what you describe but moved away from it

phsauter · 2024-06-12T20:58:16Z

I don't know how far you want to go with this but you could also compare your approach against theoretical approaches to calculate the cost of arithmetic operations.
For example something like Zimmermanns thesis on page 54 in the pdf (page 92 of the booklet) gives you the sizes of different adder architectures in a few different ways (equation, gate-representation and area in a specific technology).
Currently Yosys uses Brent-Kung (PPA-BK) for its adders.

It is also important to note that if you set demanding timing goals and give ABC an aggressive script, it will start to deviate from this architecture and go towards the performance characteristics of others. So depending on how you use ABC, the final result for large operations and timing critical (deep) paths can deviate drastically from what you may observe with a more relaxed ABC script.

I don't think this is a huge problem since the intended use-case is likely to get rid of really small modules, so they are less likely to have these big arithmetic operations anyway.

I would recommend there is a note in the help-message of keep_hierarchy clarifying this use-case and that outside of it the estimate may no longer be accurate.

…min_cost parameter

nakengelhardt reviewed Apr 19, 2024

View reviewed changes

whitequark reviewed Apr 19, 2024

View reviewed changes

whitequark requested changes Apr 21, 2024

View reviewed changes

widlarizer requested a review from zachjs as a code owner May 7, 2024 12:39

widlarizer removed the request for review from zachjs May 7, 2024 12:44

widlarizer force-pushed the emil/keep_hierarchy branch from 68b1008 to 9e67d5c Compare May 9, 2024 16:36

widlarizer added status-needs-review Status: Needs reviewers to move forward merge-after-jf Merge: PR will be merged after the next Dev JF unless concerns are raised labels May 24, 2024

nakengelhardt requested changes Jun 3, 2024

View reviewed changes

widlarizer removed the merge-after-jf Merge: PR will be merged after the next Dev JF unless concerns are raised label Jun 3, 2024

widlarizer force-pushed the emil/keep_hierarchy branch from 7e47a07 to f04137d Compare July 8, 2024 16:04

nakengelhardt approved these changes Jul 18, 2024

View reviewed changes

cost: add model for techmapped cell count, keep_hierarchy pass with -…

4b29f64

…min_cost parameter

widlarizer force-pushed the emil/keep_hierarchy branch from 7bb763b to 4b29f64 Compare July 29, 2024 08:26

widlarizer changed the title ~~cost: add keep_hierarchy pass with max_cost argument~~ cost: add keep_hierarchy pass with min_cost argument Jul 29, 2024

widlarizer merged commit 92cac63 into YosysHQ:main Jul 29, 2024
21 checks passed

widlarizer mentioned this pull request Sep 11, 2024

Selectively flatten by size #3562

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cost: add keep_hierarchy pass with min_cost argument #4344

cost: add keep_hierarchy pass with min_cost argument #4344

widlarizer commented Apr 18, 2024

whitequark commented Apr 18, 2024

widlarizer commented Apr 18, 2024 •

edited by aiju

Loading

widlarizer commented Apr 18, 2024

nakengelhardt Apr 19, 2024

widlarizer Apr 19, 2024

whitequark Apr 19, 2024

nakengelhardt Apr 19, 2024

povik Apr 19, 2024

widlarizer Apr 19, 2024

nakengelhardt Apr 20, 2024

widlarizer Apr 20, 2024

povik Apr 22, 2024

phsauter Jun 12, 2024

whitequark Apr 19, 2024

widlarizer Apr 20, 2024

widlarizer Jul 9, 2024

whitequark left a comment

widlarizer commented Apr 21, 2024

widlarizer commented May 7, 2024

widlarizer commented May 7, 2024 •

edited

Loading

whitequark commented May 7, 2024

povik commented May 7, 2024

widlarizer commented May 13, 2024

nakengelhardt Jun 3, 2024

widlarizer Jul 8, 2024

nakengelhardt Jun 3, 2024

widlarizer Jul 8, 2024

phsauter commented Jun 12, 2024


		CellCosts costs(design);

		for (auto module : design->selected_modules()) {

cost: add keep_hierarchy pass with min_cost argument #4344

cost: add keep_hierarchy pass with min_cost argument #4344

Conversation

widlarizer commented Apr 18, 2024

whitequark commented Apr 18, 2024

widlarizer commented Apr 18, 2024 • edited by aiju Loading

widlarizer commented Apr 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

whitequark left a comment

Choose a reason for hiding this comment

widlarizer commented Apr 21, 2024

widlarizer commented May 7, 2024

widlarizer commented May 7, 2024 • edited Loading

whitequark commented May 7, 2024

povik commented May 7, 2024

widlarizer commented May 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

phsauter commented Jun 12, 2024

widlarizer commented Apr 18, 2024 •

edited by aiju

Loading

widlarizer commented May 7, 2024 •

edited

Loading