Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add recursive partitioning ternary tree (RPTT) #1055

Merged
merged 7 commits into from
Sep 5, 2024

Conversation

diriLin
Copy link
Contributor

@diriLin diriLin commented Sep 4, 2024

@eddieh-xlnx
Copy link
Collaborator

This PR brings in the Recursive Partition Ternary Tree technique as described in:

@inproceedings{zang2024parallel,
  title={An Open-Source Fast Parallel Routing Approach for Commercial FPGAs},
  author={Zang, Xinshi and Lin, Wenhao and Lin, Shiju and Liu, Jinwei and Young, Evangeline FY},
  booktitle={Proceedings of the Great Lakes Symposium on VLSI 2024},
  year={2024}
}

This technique is introduced across two new classes CUFR (which extends RWRoute) and PartialCUFR (which extends PartialRouter).

First up are the end-to-end wall clock results for PartialCUFR on all 28 benchmarks using the FPGA24 Routing Contest infrastructure on a 16-core 32-thread machine:
image

Runtime is normalized to the baseline RWRoute wall time, and in ascending order to this time. Lower normalized numbers represent faster wall clock time, and normalized values less than 1.0 represent a speed up over RWRoute. Most of the benchmarks stay under 1.0 representing a speedup. Two lines are shown RPTT-only, and RPTT-with-HUS.

For RPTT-only, two benchmarks (mlcad_d181*) slightly exceed 1.0 -- further investigation shows that these designs are likely sensitive to net ordering. Forcing single-threaded RWRoute to use the same net ordering gives RPTT-only a normalized value of 0.91 (hence a speedup).

For the largest designs (mlcad_d181*, boom_soc_v2), where HUS does noticeably kick in, significant runtime improvements are seen. HUS does activate and appears to hurt the performance of both corundum_* runs though.

Here's another figure of the CPU time, normalized again to the baseline result:
image
In general, hovering around the 1.0 value shows that CUFR is not spending more CPU time overall than sequential RWRoute, but uses more of them in parallel to get it done quicker. With HUS, this extends to doing less work too.

Note that these numbers are for the end-to-end result, which includes reading the FPGA Interchange Format benchmarks and writing them (with routed results) all back out again.

Geomean summary:

  • RPTT-only: 1.6x end-to-end speedup
  • RPTT+HUS: 1.9x end-to-end speedup

A few other bits of note:

  • A neat effect of this technique is that CUFR is deterministic -- regardless of whether one thread is used or many.
  • The routing result is not expected to be the same as for RWRoute due to nets being routed in a different order. Thus it's possible (as we saw for mlcad_d181*) that the routing problem becomes harder or easier, regardless of whether it is being solved in a parallel fashion.

@eddieh-xlnx eddieh-xlnx merged commit b032fce into Xilinx:master Sep 5, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants