Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add recursive partitioning ternary tree (RPTT) #1049

Closed
wants to merge 7 commits into from

Conversation

diriLin
Copy link
Contributor

@diriLin diriLin commented Aug 14, 2024

No description provided.

@eddieh-xlnx
Copy link
Collaborator

eddieh-xlnx commented Aug 14, 2024

This PR brings in the Recursive Partition Ternary Tree technique as described in:

@inproceedings{zang2024parallel,
  title={An Open-Source Fast Parallel Routing Approach for Commercial FPGAs},
  author={Zang, Xinshi and Lin, Wenhao and Lin, Shiju and Liu, Jinwei and Young, Evangeline FY},
  booktitle={Proceedings of the Great Lakes Symposium on VLSI 2024},
  year={2024}
}

This technique is introduced across two new classes CUFR (which extends RWRoute) and PartialCUFR (which extends PartialRouter).

First up are the end-to-end wall clock results for PartialCUFR on all 28 benchmarks using the FPGA24 Routing Contest infrastructure on a 16-core 32-thread machine:
image

Runtime is normalized to the baseline RWRoute wall time, and in ascending order to this time. Lower normalized numbers represent faster wall clock time, and normalized values less than 1.0 represent a speed up over RWRoute. Most of the benchmarks stay under 1.0 representing a speedup. Two lines are shown RPTT-only, and RPTT-with-HUS.

For RPTT-only, two benchmarks (mlcad_d181*) slightly exceed 1.0 -- further investigation shows that these designs are likely sensitive to net ordering. Forcing single-threaded RWRoute to use the same net ordering gives RPTT-only a normalized value of 0.91 (hence a speedup).

For the largest designs (mlcad_d181*, boom_soc_v2), where HUS does noticeably kick in, significant runtime improvements are seen. HUS does activate and appears to hurt the performance of both corundum_* runs though.

Here's another figure of the CPU time, normalized again to the baseline result:
image
In general, hovering around the 1.0 value shows that CUFR is not spending more CPU time overall than sequential RWRoute, but uses more of them in parallel to get it done quicker. With HUS, this extends to doing less work too.

Note that these numbers are for the end-to-end result, which includes reading the FPGA Interchange Format benchmarks and writing them (with routed results) all back out again.

Geomean summary:

  • RPTT-only: 1.6x end-to-end speedup
  • RPTT+HUS: 1.9x end-to-end speedup

A few other bits of note:

  • A neat effect of this technique is that CUFR is deterministic -- regardless of whether one thread is used or many.
  • The routing result is not expected to be the same as for RWRoute due to nets being routed in a different order. Thus it's possible (as we saw for mlcad_d181*) that the routing problem becomes harder or easier, regardless of whether it is being solved in a parallel fashion.

@clavin-xlnx

This comment was marked as resolved.

@eddieh-xlnx

This comment was marked as resolved.

@clavin-xlnx

This comment was marked as resolved.

@eddieh-xlnx
Copy link
Collaborator

eddieh-xlnx commented Aug 14, 2024

Can we use a smaller design or do anything to reduce the memory footprint?

Turns out ThreadLocal can incur memory "leaks" since the only decent way for its values to get garbage collected (without killing the thread) is for the owning thread to call ThreadLocal.remove(). Even allowing ThreadLocal to be GC-ed does not guarantee its values will be GC-ed. However, by the time we know all overlaps have been resolved and we're done routing which will be time at which we don't need to re-use ConnectionStates anymore, there's no practical way to cycle through all the threads to call remove().

Switch to a ConcurrentHashMap instead -- this only gets called once per routed connection, so performance impact should not be noticeable.

Here, the memory leak was because we had a non-timing-driven RouteNodeGraph (created by earlier tests) followed by a RouteNodeGraphTimingDriven (later tests) that both held onto many many nodes.

@eddieh-xlnx eddieh-xlnx changed the base branch from master to 2024.1.2 August 15, 2024 20:48
@eddieh-xlnx eddieh-xlnx requested review from clavin-xlnx and removed request for clavin-xlnx August 15, 2024 21:19
@clavin-xlnx clavin-xlnx deleted the branch Xilinx:2024.1.2 September 4, 2024 17:53
@clavin-xlnx clavin-xlnx closed this Sep 4, 2024
@eddieh-xlnx
Copy link
Collaborator

Looks like this got closed because the target branch was merged. @diriLin is it possible for you to re-open (otherwise you'll have to start a new PR).

@diriLin
Copy link
Contributor Author

diriLin commented Sep 4, 2024

Looks like this got closed because the target branch was merged. @diriLin is it possible for you to re-open (otherwise you'll have to start a new PR).

@eddieh-xlnx It seems that I cannot re-open this PR because the base branch has been deleted. I would create a new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants