-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VTA][Chisel,de10nano] Chisel fixes and de10nano support #4986
Conversation
Hi @tmoreau89, @liangfu, please review the PR and let me know of anything. |
@vegaluisjose can you also review this PR? |
Thank you @pasqoc for this awesome PR, and extensive fixes to the Chisel codebase. @liangfu , @vegaluisjose and I will help review the PR. |
You are welcome! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pasqoc for the excellent contribution; and supporting de10 end to end demo!
One question, have you tested that the unit tests pass in tsim with the PynqConfig?
vta/hardware/chisel/Makefile
Outdated
@@ -109,7 +133,7 @@ else | |||
lib_path = $(vta_dir)/$(BUILD_NAME)/$(VTA_LIBNAME).so | |||
endif | |||
|
|||
default: lint lib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the reason for removing lint here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I explained in the commit message, currently lint messes up indentation and requires manual fixes.
It would be better in my opinion to perform lint manually after large code changes only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I just saw your newest commits. @vegaluisjose set up the Chisel linter, he can chime in on what the best course of action is moving forward. It would be nice to keep linting the code for future submissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind if you leave it there.
I personally do make lib only but then I forget, type make and indentation goes berserk, just annoying :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With --test
argument in sbt scalafmt
, the linter would not change the code base, see PR #4555 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But unfortunately it does not work, I invite you to try yourself.
sbt scalafmt --test does change the code.
In my case it changes 6 scala files, for instance:
diff --git a/vta/hardware/chisel/src/main/scala/core/LoadUop.scala b/vta/hardware/chisel/src/main/scala/core/LoadUop.scala
index 31e0b56bd..f99ac4948 100644
--- a/vta/hardware/chisel/src/main/scala/core/LoadUop.scala
+++ b/vta/hardware/chisel/src/main/scala/core/LoadUop.scala
@@ -118,12 +118,11 @@ class LoadUop(debug: Boolean = false)(implicit p: Parameters) extends Module {
state := sReadCmd
xlen := xrem
xrem := 0.U
- }
- .otherwise {
- state := sReadCmd
- xlen := xmax - 1.U
- xrem := xrem - xmax
- }
+ }.otherwise {
+ state := sReadCmd
+ xlen := xmax - 1.U
+ xrem := xrem - xmax
+ }
}
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is indeed an odd indentation choice, but I would be in favor of enabling lint even if the output is odd to have uniformity/consistency across the Chisel codebase. @liangfu @vegaluisjose any input on why the indentation of the }
looks funky?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't mind if lint stays or not.
Given your preference I am going to put it back.
But there is definitely a bug in either scalafmt or the way sbt calls it.
I tried to change the scalafmt configuration to fix the indentation but without success...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to replace scala linter from scalafmt
to scalatyle
, the former focus on changing the code format to a predefined style (it removed --test argument lately I think), and the later focus on checking style errors with no intention in changing the code base. I can put an update to switch the linter for scala, if you would like @tmoreau89 @vegaluisjose .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @liangfu ,
that sounds like a plan. Let's do it in a separate PR. I saw the chisel template is also using that.
@vegaluisjose @liangfu this makes me realize that we may want to run CI testing for different FPGA parameterizations in TSIM, e.g. DE10Nano, Pynq, F1. This might consume quite a bit of compute cycles, so they would just be done on unit tests. |
I have tested unit tests, conv2d, and deploy classification for tsim and de10 for Chisel with De10Config only. |
I have produced the following table while testing the various targets under same workloads.
|
Agree, as long as we preserve at least one TSIM based integration test in the CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pasqoc Thanks for your contribution.
I did a quick review. Overall, I think this is a great PR that brings a lot of useful features.
Please give me a cycle to test this.
TOP = VTA | ||
TOP_TEST = Test | ||
BUILD_NAME = build | ||
USE_TRACE = 0 | ||
USE_TRACE_FST = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might need a comment here, to notify future users that USE_TRACE
would default to use VCD as output, and USE_TRACE_FST
would not take effect if USE_TRACE
is not enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, although the logic is fairly simple and self-explanatory.
I did not see any comments in the Makefile for any of the configuration variables so I did not want to start adding ones.
}.elsewhen((io.vme_rd.data.fire() || isZeroPad) && | ||
set === (tp.tensorLength - 1).U && | ||
tag === (tp.numMemBlock - 1).U) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the linter might remind you to move the bracket to the previous line.
vta/src/de10nano/de10nano_mgr.h
Outdated
#ifndef VTA_DE10NANO_DE10NANO_MGR_H_ | ||
#define VTA_DE10NANO_DE10NANO_MGR_H_ | ||
|
||
extern "C" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest taking the following format
#ifdef __cplusplus
extern "C" {
#endif
// ...
#ifdef __cplusplus
}
#endif
In addition, I think this is a C++ header file (please correct me if I understood), it's better to include C++ variant of the standard library, like #include <cstdint>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This style is kind of nowadays redundant as the compiler knows already what to do.
You are right, this is a C++ header file and chances are it will not be used in a C only context.
I can make it pure C++ if you like.
vta/src/de10nano/de10nano_mgr.h
Outdated
// Register definition and address map taken from cv_5v4.pdf, | ||
// Cyclone V Hard Processor System Technical Reference Manual, | ||
// chapter 5: FPGA Manager. | ||
struct de10nano_mgr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take CamelCase
for class names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I don't mind either way.
from tvm import rpc | ||
from vta import get_bitstream_path, download_bitstream, program_fpga, reconfig_runtime | ||
|
||
host = os.environ.get("VTA_PYNQ_RPC_HOST", "de10nano") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tmoreau89 Shall we rename this environment variable for targets other than PYNQ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was also confused when I first browsed the code.
I would have expected something like:
VTA_RPC_HOST = {pynq|de10nano|ultra96|etc}
VTA_RPC_PORT = 9091
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I've been using this variable to be able to target and program different FPGAs. For in that spririt, I would use VTA_DE10_RPC_HOST
for this test. Your bashrc could contain multiple hosts including VTA_PYNQ_RPC_HOST
, VTA_DE10_RPC_HOST
, VTA_ULTRA96_RPC_HOST
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes perfect sense, perhaps coupled with your idea of folding fpga target specific info in environment.py.
Right now introducing VTA_DE10_RPC_HOST anywhere VTA_PYNQ_RPC_HOST, including matrix_multiply.py for instance would add a lot of boiler plate code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see; upon second though it would be cleaner to use VTA_RPC_HOST
environment variables given that some of our unit tests assume that we're targeting the pynq boards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's totally acceptable as long as we update documentation too to reflect that change in requirements. Grep-ing for VTA_PYNQ_RPC_HOST
should do the trick
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree!
Last commit addresses previous comments. |
General question. |
It looks like at the moment the cpp lint test is failing: https://ci.tvm.ai/blue/organizations/jenkins/tvm/detail/PR-4986/5/pipeline You may have to go in and address the lint errors in |
Right now the build is failing in runtime.cc, not sure why. |
Any chance the docker image maybe using python2 instead of python3. |
I was running into the same issue with the |
Just removed them, but f strings are so much better ..... |
This is an error on that cmake file actually because everything should be python3 by now. In fact, f-strings are used in other parts of the python codebase. This is a good find. |
Do you want me to remove python from |
I think it would be nice to do that in a separate PR. |
Hey @pasqoc , I just ran/build everything in both Linux/Mac but it seems to be failing consistently in
Perhaps I am missing something? I built everything directly from your |
Agree with @vegaluisjose on the separate PR; if you submit it, we'll work to merge it quickly so you can rebase against master and get those changes in |
One think I forgot to mention (and I did not change the deploy_classification.py code in the branch) is that currently you must avoid using the timer when doing inference. |
I suggest changing the timer object to avoid performing the "warming run" job and enable running a single job. |
Resnet18_v1 should only take 31M cycles for one run and around 5 minutes, so you are running 12 jobs in around an hour. |
I can add the changes to timer if you want me to, and then changing num = rep = 1 in deploy_classification.py should make the test pass. |
Hey @pasqoc , You are right, that makes the test work. It works for me without changing the number of repetitions. Thoughts on this Here is the trace
|
@@ -62,20 +62,40 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)( | |||
val tag = Reg(UInt(8.W)) | |||
val set = Reg(UInt(8.W)) | |||
|
|||
// Dynamically adjust the size of DMA transfers to avoid crossing page boundaries. | |||
final val ADAPTIVE_DMA_XFER_ENABLE = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious -- is there a way this is false?. If this is fixing the wraparound AXI bug, then we should make it default and update all the code that depend on it. For example:
val xfer_init_bytes = if (ADAPTIVE_DMA_XFER_ENABLE) xmax_bytes - xfer_init_addr % xmax_bytes else xmax_bytes
will be replaced by
val xfer_init_bytes = xmax_bytes - xfer_init_addr % xmax_bytes
Same in TensorUtil.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is small cost that the fix adds in terms of timing, which could be mitigated with a refactoring of the FSM.
I was not sure whether the wraparound problem is a systematic limitation or not so I decided to add it in a parametric way just in case another platform does no exhibits the issue.
I left the static constant there but one could drive it from the Configs.scala file if needed.
The idea is to turn it off when trying another platform and if successful set the parameter to false in a config file.
Does it make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, AFAIK the problem is systematic across all platforms, so we might want to fix it right away. Also, I believe de10-nano
is AXI3
, therefore other platforms using AXI4
should also work with this because AXI's "backwards compatibility."
Later, we can optimize for performance if we would like to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, this is a limitation, not a feature though, unless you are talking about bug backward compatibility.
Indeed, you may very well have this limitation removed in future versions of AXI or maybe be using other interconnects that do not have it, and you may choose to take advantage of better performance.
That said, I don't mind taking it out, it will save two lines of scala code ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think we can remove it here and in TensorUtil.scala
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will do shortly!
Issue: The Cyclone V FPGA on board of the DE10-Nano can only be programmed using the JTAG port, which is a limiting option for users. Solution: Add support for the remote programming of the FPGA implementing the FPGA programming manager protocol published in the Cyclone V user manual. * Added file de10nano_mgr.h implementing an FPGA manager class that supports handling of control and status registers as well as a push-button option to program the FPGA. The class can be easily extended to include more registers if needed. * Used an instance of the FPGA manager to implement function VTAProgram also warning users when incompatible bitstream files are used. * Registered VTAProgram as a global function and modified the program_bitstream python class to use it.
Issue: The de10nano target has incomplete, non-working support for runtime reconfiguration, bitstream programming, and examples of usage. Solution: Complete runtime support for the de10nano target. * Modified VTA.cmake to comment out a default override for VTA_MAX_XFER to 21 bit wide. * Modified VTA.cmake to add needed de10nano include dirs. * Modified relevant files to support de10nano same way as other targets for VTA runtime reconfiguration and FPGA programming. * Added test_program_rpc.py example as a runtime FPGA programming example. Note that unlike the pynq target no bitstream is either downloaded or programmed when the bitstream argument is set to None. * Cosmetic changes to vta config files.
Issue: The LoadUop FSM incorrectly advances the address of the next uop to read from DRAM when the DRAM data valid bit is deasserted and asserted at the end of a read. This is caused by a mismatch in the logic of the state and output portions of the FSM. This is one of two issues that was gating the correct operation of VTA on the DE10-Nano target. Solution: Modify the logic of the output section of the FSM to include a check on the DRAM read valid bit or fold the output assignemnt into the state section. * Folded the assignemnt of the next uop address in the state section of the FSM.
Issue: In the DE10-Nano target and possibly in others, DMA transfers that cross the boundaries of memory pages result in incorrect reads and writes from and to DRAM. When this happens depending on different input values, VTA loads and stores exhibit incorrect results for DMA pulses at the end of a transfer. This is one of two issues that were gating the DE10-Nano target from functioning correctly, but may affect other Chisel based targets. Solution: Add support for dynamically adjustble DMA transfer sizes in load and store operations. For a more elegant and modular implementation the feature can be enabled at compile time with a static constant that can be passed as a configuration option. * Modified the load and store finite state machines to dynamically adjust the size of initial and stride DMA transfers. The feature is enabled by default by virtue of the static constant ADAPTIVE_DMA_XFER_ENABLE.
Issue: Cross reference between FSIM, TSIM, and Chisel based FPGA traces is an invaluable instrument that enables fast analysis on FSIM, and analysis/debug on TSIM and FPGA, especially for complex flows like conv2d or full inferences. Currently this cannot be done easily since a suitable reference is missing. The clock cycle event counter cannot be used since it is undefined in FSIM and not reliable between TSIM and FPGA because of different latencies. Solution: Introduce a new event counter that preserves a program order across FSIM, TSIM, FPGA. We propose adding the accumulator write event counter in the Chisel EventCounter class and a simple instrumentation in the FSIM runtime code. Note that this technique enabled finding the Chisel issues reportes in the PR, which would have been otherwise far more difficult. * Added the acc_wr_count event counter and changed interfaces accordingly.
* Use CamelCase class names. * Use C++ style C include header files. * Add comments to Chisel makefile.
* Reorder C and C++ includes in de10nano_mgr.h. * Restore lint as default target in Chisel Makefile.
Issue: With more FPGA targets coming online the initial method of using individual environment variables to specify target IP and port does not scale well. Solution: Use a single VTA_RPC_HOST, VTA_RPC_PORT pair to be changed every time a different target is used. For instance in a script used to benchmark all targets. * Replaced every instance of VTA_PYNQ_RPC_HOST and VTA_PYNQ_RPC_PORT with VTA_RPC_HOST and VTA_RPC_PORT, respectively.
Rebased and made small changes after running new linter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pasqoc for the work, the changes LGTM!
Thanks @pasqoc @liangfu @vegaluisjose for the work and the reviews; the PR has been merged! |
* [VTA][de10nano] Enable user defined target frequency. Issue: The VTA target frequency on the DE10-Nano is hardcoded to 50MHz unnecessarily limiting performance. Solution: Add a PLL to the FPGA sub-system along with support for the selection of a user specified frequency at build time. The board successfully builds and runs at 100MHz. * Added a PLL in the soc_system.tcl platform designer generator script. * Modified the Makefile to automatically set the target frequency from that specified in the pkg_config.py file. * Modified the Makefile to generate a bitstream with an RBF format that enables programming of the FPGA directly from the on-board processor. Specifically, the RBF is generated in FastParallel32 mode with compression, which corresponds to the default MSEL switch setting on the board, i.e. 01010. * Added a false path override to file set_clocks.sdc to turn off unconstrained path warnings on the VTA pulse LED. * [VTA][TSIM] Add more debug and tracing options. * Modified Makefile to change default config to DafaultDe10Config. * Added option in Makefile to produce more detailed tracing for extra observability in debugging complex scenarios. * Added option in Makefile to produce traces in FST format which are 2 orders of magnitude smaller, although much slower to generate. * Added option in Makefile to build the simulator with GCC address sanitizer. * Modified Makefile to not lint the scala code by default avoiding unintended wrong indentation. Linting should be better performed manually on a per-need basis. * [VTA][de10nano] Enable remote programming of FPGA. Issue: The Cyclone V FPGA on board of the DE10-Nano can only be programmed using the JTAG port, which is a limiting option for users. Solution: Add support for the remote programming of the FPGA implementing the FPGA programming manager protocol published in the Cyclone V user manual. * Added file de10nano_mgr.h implementing an FPGA manager class that supports handling of control and status registers as well as a push-button option to program the FPGA. The class can be easily extended to include more registers if needed. * Used an instance of the FPGA manager to implement function VTAProgram also warning users when incompatible bitstream files are used. * Registered VTAProgram as a global function and modified the program_bitstream python class to use it. * [VTA][de10nano] Enhance de10nano runtime support. Issue: The de10nano target has incomplete, non-working support for runtime reconfiguration, bitstream programming, and examples of usage. Solution: Complete runtime support for the de10nano target. * Modified VTA.cmake to comment out a default override for VTA_MAX_XFER to 21 bit wide. * Modified VTA.cmake to add needed de10nano include dirs. * Modified relevant files to support de10nano same way as other targets for VTA runtime reconfiguration and FPGA programming. * Added test_program_rpc.py example as a runtime FPGA programming example. Note that unlike the pynq target no bitstream is either downloaded or programmed when the bitstream argument is set to None. * Cosmetic changes to vta config files. * [VTA][Chisel] LoadUop FSM bug fix. Issue: The LoadUop FSM incorrectly advances the address of the next uop to read from DRAM when the DRAM data valid bit is deasserted and asserted at the end of a read. This is caused by a mismatch in the logic of the state and output portions of the FSM. This is one of two issues that was gating the correct operation of VTA on the DE10-Nano target. Solution: Modify the logic of the output section of the FSM to include a check on the DRAM read valid bit or fold the output assignemnt into the state section. * Folded the assignemnt of the next uop address in the state section of the FSM. * [VTA][Chisel] Dynamically adjust DMA tranfer size. Issue: In the DE10-Nano target and possibly in others, DMA transfers that cross the boundaries of memory pages result in incorrect reads and writes from and to DRAM. When this happens depending on different input values, VTA loads and stores exhibit incorrect results for DMA pulses at the end of a transfer. This is one of two issues that were gating the DE10-Nano target from functioning correctly, but may affect other Chisel based targets. Solution: Add support for dynamically adjustble DMA transfer sizes in load and store operations. For a more elegant and modular implementation the feature can be enabled at compile time with a static constant that can be passed as a configuration option. * Modified the load and store finite state machines to dynamically adjust the size of initial and stride DMA transfers. The feature is enabled by default by virtue of the static constant ADAPTIVE_DMA_XFER_ENABLE. * [VTA][Chisel] Improve FSIM/TSIM/FPGA xref debug. Issue: Cross reference between FSIM, TSIM, and Chisel based FPGA traces is an invaluable instrument that enables fast analysis on FSIM, and analysis/debug on TSIM and FPGA, especially for complex flows like conv2d or full inferences. Currently this cannot be done easily since a suitable reference is missing. The clock cycle event counter cannot be used since it is undefined in FSIM and not reliable between TSIM and FPGA because of different latencies. Solution: Introduce a new event counter that preserves a program order across FSIM, TSIM, FPGA. We propose adding the accumulator write event counter in the Chisel EventCounter class and a simple instrumentation in the FSIM runtime code. Note that this technique enabled finding the Chisel issues reportes in the PR, which would have been otherwise far more difficult. * Added the acc_wr_count event counter and changed interfaces accordingly. * [VTA][de10nano] Comply with linting rules. * [VTA] Appease make lint. * [VTA] Disable pylint import not top level error. * [VTA][Chisel,de10nano] Linting changes. * Use CamelCase class names. * Use C++ style C include header files. * Add comments to Chisel makefile. * [VTA][de10nano] * Reorder C and C++ includes in de10nano_mgr.h. * Restore lint as default target in Chisel Makefile. * [VTA][de10nano] Do not use f string in pkg_config.py. * [VTA][de10nano] Remove overlooked f strings in pkg_config.py. * [VTA][de10nano] Fixed typo. * [VTA][TSIM] Check if gcc has align-new. * [VTA][Chisel] Make adaptive DMA transfer default. * [VTA][RPC] Renamed VTA_PYNQ_RPC_* to VTA_RPC_*. Issue: With more FPGA targets coming online the initial method of using individual environment variables to specify target IP and port does not scale well. Solution: Use a single VTA_RPC_HOST, VTA_RPC_PORT pair to be changed every time a different target is used. For instance in a script used to benchmark all targets. * Replaced every instance of VTA_PYNQ_RPC_HOST and VTA_PYNQ_RPC_PORT with VTA_RPC_HOST and VTA_RPC_PORT, respectively. * [VTA][Chisel] Comply with new linter.
* [VTA][de10nano] Enable user defined target frequency. Issue: The VTA target frequency on the DE10-Nano is hardcoded to 50MHz unnecessarily limiting performance. Solution: Add a PLL to the FPGA sub-system along with support for the selection of a user specified frequency at build time. The board successfully builds and runs at 100MHz. * Added a PLL in the soc_system.tcl platform designer generator script. * Modified the Makefile to automatically set the target frequency from that specified in the pkg_config.py file. * Modified the Makefile to generate a bitstream with an RBF format that enables programming of the FPGA directly from the on-board processor. Specifically, the RBF is generated in FastParallel32 mode with compression, which corresponds to the default MSEL switch setting on the board, i.e. 01010. * Added a false path override to file set_clocks.sdc to turn off unconstrained path warnings on the VTA pulse LED. * [VTA][TSIM] Add more debug and tracing options. * Modified Makefile to change default config to DafaultDe10Config. * Added option in Makefile to produce more detailed tracing for extra observability in debugging complex scenarios. * Added option in Makefile to produce traces in FST format which are 2 orders of magnitude smaller, although much slower to generate. * Added option in Makefile to build the simulator with GCC address sanitizer. * Modified Makefile to not lint the scala code by default avoiding unintended wrong indentation. Linting should be better performed manually on a per-need basis. * [VTA][de10nano] Enable remote programming of FPGA. Issue: The Cyclone V FPGA on board of the DE10-Nano can only be programmed using the JTAG port, which is a limiting option for users. Solution: Add support for the remote programming of the FPGA implementing the FPGA programming manager protocol published in the Cyclone V user manual. * Added file de10nano_mgr.h implementing an FPGA manager class that supports handling of control and status registers as well as a push-button option to program the FPGA. The class can be easily extended to include more registers if needed. * Used an instance of the FPGA manager to implement function VTAProgram also warning users when incompatible bitstream files are used. * Registered VTAProgram as a global function and modified the program_bitstream python class to use it. * [VTA][de10nano] Enhance de10nano runtime support. Issue: The de10nano target has incomplete, non-working support for runtime reconfiguration, bitstream programming, and examples of usage. Solution: Complete runtime support for the de10nano target. * Modified VTA.cmake to comment out a default override for VTA_MAX_XFER to 21 bit wide. * Modified VTA.cmake to add needed de10nano include dirs. * Modified relevant files to support de10nano same way as other targets for VTA runtime reconfiguration and FPGA programming. * Added test_program_rpc.py example as a runtime FPGA programming example. Note that unlike the pynq target no bitstream is either downloaded or programmed when the bitstream argument is set to None. * Cosmetic changes to vta config files. * [VTA][Chisel] LoadUop FSM bug fix. Issue: The LoadUop FSM incorrectly advances the address of the next uop to read from DRAM when the DRAM data valid bit is deasserted and asserted at the end of a read. This is caused by a mismatch in the logic of the state and output portions of the FSM. This is one of two issues that was gating the correct operation of VTA on the DE10-Nano target. Solution: Modify the logic of the output section of the FSM to include a check on the DRAM read valid bit or fold the output assignemnt into the state section. * Folded the assignemnt of the next uop address in the state section of the FSM. * [VTA][Chisel] Dynamically adjust DMA tranfer size. Issue: In the DE10-Nano target and possibly in others, DMA transfers that cross the boundaries of memory pages result in incorrect reads and writes from and to DRAM. When this happens depending on different input values, VTA loads and stores exhibit incorrect results for DMA pulses at the end of a transfer. This is one of two issues that were gating the DE10-Nano target from functioning correctly, but may affect other Chisel based targets. Solution: Add support for dynamically adjustble DMA transfer sizes in load and store operations. For a more elegant and modular implementation the feature can be enabled at compile time with a static constant that can be passed as a configuration option. * Modified the load and store finite state machines to dynamically adjust the size of initial and stride DMA transfers. The feature is enabled by default by virtue of the static constant ADAPTIVE_DMA_XFER_ENABLE. * [VTA][Chisel] Improve FSIM/TSIM/FPGA xref debug. Issue: Cross reference between FSIM, TSIM, and Chisel based FPGA traces is an invaluable instrument that enables fast analysis on FSIM, and analysis/debug on TSIM and FPGA, especially for complex flows like conv2d or full inferences. Currently this cannot be done easily since a suitable reference is missing. The clock cycle event counter cannot be used since it is undefined in FSIM and not reliable between TSIM and FPGA because of different latencies. Solution: Introduce a new event counter that preserves a program order across FSIM, TSIM, FPGA. We propose adding the accumulator write event counter in the Chisel EventCounter class and a simple instrumentation in the FSIM runtime code. Note that this technique enabled finding the Chisel issues reportes in the PR, which would have been otherwise far more difficult. * Added the acc_wr_count event counter and changed interfaces accordingly. * [VTA][de10nano] Comply with linting rules. * [VTA] Appease make lint. * [VTA] Disable pylint import not top level error. * [VTA][Chisel,de10nano] Linting changes. * Use CamelCase class names. * Use C++ style C include header files. * Add comments to Chisel makefile. * [VTA][de10nano] * Reorder C and C++ includes in de10nano_mgr.h. * Restore lint as default target in Chisel Makefile. * [VTA][de10nano] Do not use f string in pkg_config.py. * [VTA][de10nano] Remove overlooked f strings in pkg_config.py. * [VTA][de10nano] Fixed typo. * [VTA][TSIM] Check if gcc has align-new. * [VTA][Chisel] Make adaptive DMA transfer default. * [VTA][RPC] Renamed VTA_PYNQ_RPC_* to VTA_RPC_*. Issue: With more FPGA targets coming online the initial method of using individual environment variables to specify target IP and port does not scale well. Solution: Use a single VTA_RPC_HOST, VTA_RPC_PORT pair to be changed every time a different target is used. For instance in a script used to benchmark all targets. * Replaced every instance of VTA_PYNQ_RPC_HOST and VTA_PYNQ_RPC_PORT with VTA_RPC_HOST and VTA_RPC_PORT, respectively. * [VTA][Chisel] Comply with new linter.
* [VTA][de10nano] Enable user defined target frequency. Issue: The VTA target frequency on the DE10-Nano is hardcoded to 50MHz unnecessarily limiting performance. Solution: Add a PLL to the FPGA sub-system along with support for the selection of a user specified frequency at build time. The board successfully builds and runs at 100MHz. * Added a PLL in the soc_system.tcl platform designer generator script. * Modified the Makefile to automatically set the target frequency from that specified in the pkg_config.py file. * Modified the Makefile to generate a bitstream with an RBF format that enables programming of the FPGA directly from the on-board processor. Specifically, the RBF is generated in FastParallel32 mode with compression, which corresponds to the default MSEL switch setting on the board, i.e. 01010. * Added a false path override to file set_clocks.sdc to turn off unconstrained path warnings on the VTA pulse LED. * [VTA][TSIM] Add more debug and tracing options. * Modified Makefile to change default config to DafaultDe10Config. * Added option in Makefile to produce more detailed tracing for extra observability in debugging complex scenarios. * Added option in Makefile to produce traces in FST format which are 2 orders of magnitude smaller, although much slower to generate. * Added option in Makefile to build the simulator with GCC address sanitizer. * Modified Makefile to not lint the scala code by default avoiding unintended wrong indentation. Linting should be better performed manually on a per-need basis. * [VTA][de10nano] Enable remote programming of FPGA. Issue: The Cyclone V FPGA on board of the DE10-Nano can only be programmed using the JTAG port, which is a limiting option for users. Solution: Add support for the remote programming of the FPGA implementing the FPGA programming manager protocol published in the Cyclone V user manual. * Added file de10nano_mgr.h implementing an FPGA manager class that supports handling of control and status registers as well as a push-button option to program the FPGA. The class can be easily extended to include more registers if needed. * Used an instance of the FPGA manager to implement function VTAProgram also warning users when incompatible bitstream files are used. * Registered VTAProgram as a global function and modified the program_bitstream python class to use it. * [VTA][de10nano] Enhance de10nano runtime support. Issue: The de10nano target has incomplete, non-working support for runtime reconfiguration, bitstream programming, and examples of usage. Solution: Complete runtime support for the de10nano target. * Modified VTA.cmake to comment out a default override for VTA_MAX_XFER to 21 bit wide. * Modified VTA.cmake to add needed de10nano include dirs. * Modified relevant files to support de10nano same way as other targets for VTA runtime reconfiguration and FPGA programming. * Added test_program_rpc.py example as a runtime FPGA programming example. Note that unlike the pynq target no bitstream is either downloaded or programmed when the bitstream argument is set to None. * Cosmetic changes to vta config files. * [VTA][Chisel] LoadUop FSM bug fix. Issue: The LoadUop FSM incorrectly advances the address of the next uop to read from DRAM when the DRAM data valid bit is deasserted and asserted at the end of a read. This is caused by a mismatch in the logic of the state and output portions of the FSM. This is one of two issues that was gating the correct operation of VTA on the DE10-Nano target. Solution: Modify the logic of the output section of the FSM to include a check on the DRAM read valid bit or fold the output assignemnt into the state section. * Folded the assignemnt of the next uop address in the state section of the FSM. * [VTA][Chisel] Dynamically adjust DMA tranfer size. Issue: In the DE10-Nano target and possibly in others, DMA transfers that cross the boundaries of memory pages result in incorrect reads and writes from and to DRAM. When this happens depending on different input values, VTA loads and stores exhibit incorrect results for DMA pulses at the end of a transfer. This is one of two issues that were gating the DE10-Nano target from functioning correctly, but may affect other Chisel based targets. Solution: Add support for dynamically adjustble DMA transfer sizes in load and store operations. For a more elegant and modular implementation the feature can be enabled at compile time with a static constant that can be passed as a configuration option. * Modified the load and store finite state machines to dynamically adjust the size of initial and stride DMA transfers. The feature is enabled by default by virtue of the static constant ADAPTIVE_DMA_XFER_ENABLE. * [VTA][Chisel] Improve FSIM/TSIM/FPGA xref debug. Issue: Cross reference between FSIM, TSIM, and Chisel based FPGA traces is an invaluable instrument that enables fast analysis on FSIM, and analysis/debug on TSIM and FPGA, especially for complex flows like conv2d or full inferences. Currently this cannot be done easily since a suitable reference is missing. The clock cycle event counter cannot be used since it is undefined in FSIM and not reliable between TSIM and FPGA because of different latencies. Solution: Introduce a new event counter that preserves a program order across FSIM, TSIM, FPGA. We propose adding the accumulator write event counter in the Chisel EventCounter class and a simple instrumentation in the FSIM runtime code. Note that this technique enabled finding the Chisel issues reportes in the PR, which would have been otherwise far more difficult. * Added the acc_wr_count event counter and changed interfaces accordingly. * [VTA][de10nano] Comply with linting rules. * [VTA] Appease make lint. * [VTA] Disable pylint import not top level error. * [VTA][Chisel,de10nano] Linting changes. * Use CamelCase class names. * Use C++ style C include header files. * Add comments to Chisel makefile. * [VTA][de10nano] * Reorder C and C++ includes in de10nano_mgr.h. * Restore lint as default target in Chisel Makefile. * [VTA][de10nano] Do not use f string in pkg_config.py. * [VTA][de10nano] Remove overlooked f strings in pkg_config.py. * [VTA][de10nano] Fixed typo. * [VTA][TSIM] Check if gcc has align-new. * [VTA][Chisel] Make adaptive DMA transfer default. * [VTA][RPC] Renamed VTA_PYNQ_RPC_* to VTA_RPC_*. Issue: With more FPGA targets coming online the initial method of using individual environment variables to specify target IP and port does not scale well. Solution: Use a single VTA_RPC_HOST, VTA_RPC_PORT pair to be changed every time a different target is used. For instance in a script used to benchmark all targets. * Replaced every instance of VTA_PYNQ_RPC_HOST and VTA_PYNQ_RPC_PORT with VTA_RPC_HOST and VTA_RPC_PORT, respectively. * [VTA][Chisel] Comply with new linter.
This PR provides fixes to the VTA Chisel implementation, as well as support and enhancements for the tsim and de10nano targets.
With fixes in, the deploy classification tutorial now runs correctly on the de10nano for Resnet18 and Resnet34 workloads, matching the results obtained when running with cpu, fsim, and tsim targets.
A summary of the PR contributions is reported below, more details can be found in the individual commits.
Bug fixes:
Corrupted DRAM stores and loads when crossing page boundaries.
Mismatched LoadUop state and output FSM logic.
Enhancements:
Added de10nano host FPGA programming.
Enabled de10nano user defined target frequency, tested at 100MHz.
Improved FSIM/TSIM/FPGA xref debug.