Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VTA][Chisel,de10nano] Chisel fixes and de10nano support #4986

Merged
merged 19 commits into from
Mar 9, 2020

Conversation

pasqoc
Copy link
Contributor

@pasqoc pasqoc commented Mar 4, 2020

This PR provides fixes to the VTA Chisel implementation, as well as support and enhancements for the tsim and de10nano targets.

With fixes in, the deploy classification tutorial now runs correctly on the de10nano for Resnet18 and Resnet34 workloads, matching the results obtained when running with cpu, fsim, and tsim targets.

A summary of the PR contributions is reported below, more details can be found in the individual commits.

Bug fixes:

  • Corrupted DRAM stores and loads when crossing page boundaries.

  • Mismatched LoadUop state and output FSM logic.

Enhancements:

  • Added de10nano host FPGA programming.

  • Enabled de10nano user defined target frequency, tested at 100MHz.

  • Improved FSIM/TSIM/FPGA xref debug.

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 4, 2020

Hi @tmoreau89, @liangfu, please review the PR and let me know of anything.
Finally deploy classification returns a cat on the de10-nano ... ;-)

@tmoreau89
Copy link
Contributor

@vegaluisjose can you also review this PR?

@tmoreau89
Copy link
Contributor

Thank you @pasqoc for this awesome PR, and extensive fixes to the Chisel codebase. @liangfu , @vegaluisjose and I will help review the PR.

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 4, 2020

You are welcome!
Sorry I forgot to lint, doing next.

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pasqoc for the excellent contribution; and supporting de10 end to end demo!

One question, have you tested that the unit tests pass in tsim with the PynqConfig?

@@ -109,7 +133,7 @@ else
lib_path = $(vta_dir)/$(BUILD_NAME)/$(VTA_LIBNAME).so
endif

default: lint lib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the reason for removing lint here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I explained in the commit message, currently lint messes up indentation and requires manual fixes.
It would be better in my opinion to perform lint manually after large code changes only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just saw your newest commits. @vegaluisjose set up the Chisel linter, he can chime in on what the best course of action is moving forward. It would be nice to keep linting the code for future submissions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind if you leave it there.
I personally do make lib only but then I forget, type make and indentation goes berserk, just annoying :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With --test argument in sbt scalafmt, the linter would not change the code base, see PR #4555 .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But unfortunately it does not work, I invite you to try yourself.
sbt scalafmt --test does change the code.
In my case it changes 6 scala files, for instance:

diff --git a/vta/hardware/chisel/src/main/scala/core/LoadUop.scala b/vta/hardware/chisel/src/main/scala/core/LoadUop.scala
index 31e0b56bd..f99ac4948 100644
--- a/vta/hardware/chisel/src/main/scala/core/LoadUop.scala
+++ b/vta/hardware/chisel/src/main/scala/core/LoadUop.scala
@@ -118,12 +118,11 @@ class LoadUop(debug: Boolean = false)(implicit p: Parameters) extends Module {
               state := sReadCmd
               xlen := xrem
               xrem := 0.U
-            }
-            .otherwise {
-              state := sReadCmd
-              xlen := xmax - 1.U
-              xrem := xrem - xmax
-            }
+            }.otherwise {
+                state := sReadCmd
+                xlen := xmax - 1.U
+                xrem := xrem - xmax
+              }
           }
         }
       }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is indeed an odd indentation choice, but I would be in favor of enabling lint even if the output is odd to have uniformity/consistency across the Chisel codebase. @liangfu @vegaluisjose any input on why the indentation of the } looks funky?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't mind if lint stays or not.
Given your preference I am going to put it back.
But there is definitely a bug in either scalafmt or the way sbt calls it.
I tried to change the scalafmt configuration to fix the indentation but without success...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to replace scala linter from scalafmt to scalatyle, the former focus on changing the code format to a predefined style (it removed --test argument lately I think), and the later focus on checking style errors with no intention in changing the code base. I can put an update to switch the linter for scala, if you would like @tmoreau89 @vegaluisjose .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @liangfu ,

that sounds like a plan. Let's do it in a separate PR. I saw the chisel template is also using that.

vta/hardware/chisel/Makefile Show resolved Hide resolved
@tmoreau89
Copy link
Contributor

@vegaluisjose @liangfu this makes me realize that we may want to run CI testing for different FPGA parameterizations in TSIM, e.g. DE10Nano, Pynq, F1. This might consume quite a bit of compute cycles, so they would just be done on unit tests.

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 5, 2020

I have tested unit tests, conv2d, and deploy classification for tsim and de10 for Chisel with De10Config only.

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 5, 2020

I have produced the following table while testing the various targets under same workloads.
Where you see nan it typically means segfault (Resnet50 and Resnet101 and Pynq for Resnet34).
DE10-Nano has a total of 1GB of DRAM with 512MB dedicated to CMA.
Pynq uses the HLS build.
FSIM does computation.
Both DE10-Nano and Pynq run at 100MHz.
Can you explain the differences in match 4 and 5 and with cpu?

net target graph[s] offload[s] inference[s] match 1 match 2 match 3 match 4 match 5
0 resnet18_v1 cpu 37.51 7.47 0.16 tabby cat tiger cat Egyptian cat welcome mat carton
1 resnet18_v1 fsim 14.94 1.58 1.01 tabby cat tiger cat Egyptian cat welcome mat plastic bag
2 resnet18_v1 tsim 15.22 1.7 227.73 tabby cat tiger cat Egyptian cat welcome mat plastic bag
3 resnet18_v1 de10nano 15.08 2.34 0.76 tabby cat tiger cat Egyptian cat welcome mat plastic bag
4 resnet18_v1 pynq 15.93 2.45 0.61 tabby cat tiger cat Egyptian cat welcome mat plastic bag
5 resnet18_v2 cpu 45.61 8.7 0.18 tabby cat tiger cat Egyptian cat radiator plastic bag
6 resnet18_v2 fsim 16.6 1.95 1.08 Egyptian cat tabby cat tiger cat window screen shower curtain
7 resnet18_v2 tsim 17.22 2.17 290.37 Egyptian cat tabby cat tiger cat window screen shower curtain
8 resnet18_v2 de10nano 16.38 2.47 0.77 Egyptian cat tabby cat tiger cat window screen shower curtain
9 resnet18_v2 pynq 16.52 2.56 0.63 Egyptian cat tabby cat tiger cat window screen shower curtain
10 resnet34_v1 cpu 38.31 7.56 0.33 tabby cat Egyptian cat tiger cat welcome mat carton
11 resnet34_v1 fsim 17.84 2.09 2.18 tabby cat tiger cat Egyptian cat window screen carton
12 resnet34_v1 tsim 18.54 1.92 506.25 tabby cat tiger cat Egyptian cat window screen carton
13 resnet34_v1 de10nano 18.26 2.97 1.13 tabby cat tiger cat Egyptian cat window screen carton
14 resnet34_v1 pynq 20.27 3.03 0 nan nan nan nan nan
15 resnet34_v2 cpu 42.49 8.42 0.32 tabby cat tiger cat Egyptian cat welcome mat catamount
16 resnet34_v2 fsim 20.22 2.3 2.14 tabby cat Egyptian cat tiger cat bathroom tissue paper towel
17 resnet34_v2 tsim 19.98 2.37 466.59 tabby cat Egyptian cat tiger cat paper towel bathroom tissue
18 resnet34_v2 de10nano 19.67 3.21 1.13 tabby cat Egyptian cat tiger cat paper towel wash-hand basin
19 resnet34_v2 pynq 21.75 3.32 0 nan nan nan nan nan
20 resnet50_v2 cpu 27.34 6.45 0.41 Egyptian cat tabby cat tiger cat welcome mat carton
21 resnet50_v2 fsim 0 0 0 nan nan nan nan nan
22 resnet50_v2 tsim 0 0 0 nan nan nan nan nan
23 resnet50_v2 de10nano 0 0 0 nan nan nan nan nan
24 resnet50_v2 pynq 0 0 0 nan nan nan nan nan
25 resnet101_v2 cpu 27.95 6.52 0.78 tabby cat tiger cat Egyptian cat carton catamount
26 resnet101_v2 fsim 0 0 0 nan nan nan nan nan
27 resnet101_v2 tsim 0 0 0 nan nan nan nan nan
28 resnet101_v2 de10nano 0 0 0 nan nan nan nan nan
29 resnet101_v2 pynq 0 0 0 nan nan nan nan nan

@liangfu
Copy link
Member

liangfu commented Mar 5, 2020

This makes me realize that we may want to run CI testing for different FPGA parameterizations in TSIM, e.g. DE10Nano, Pynq, F1. This might consume quite a bit of compute cycles, so they would just be done on unit tests.

Agree, as long as we preserve at least one TSIM based integration test in the CI.

Copy link
Member

@liangfu liangfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pasqoc Thanks for your contribution.

I did a quick review. Overall, I think this is a great PR that brings a lot of useful features.
Please give me a cycle to test this.

TOP = VTA
TOP_TEST = Test
BUILD_NAME = build
USE_TRACE = 0
USE_TRACE_FST = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might need a comment here, to notify future users that USE_TRACE would default to use VCD as output, and USE_TRACE_FST would not take effect if USE_TRACE is not enabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, although the logic is fairly simple and self-explanatory.
I did not see any comments in the Makefile for any of the configuration variables so I did not want to start adding ones.

}.elsewhen((io.vme_rd.data.fire() || isZeroPad) &&
set === (tp.tensorLength - 1).U &&
tag === (tp.numMemBlock - 1).U)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the linter might remind you to move the bracket to the previous line.

#ifndef VTA_DE10NANO_DE10NANO_MGR_H_
#define VTA_DE10NANO_DE10NANO_MGR_H_

extern "C" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest taking the following format

#ifdef __cplusplus
extern "C" {
#endif 

// ...

#ifdef __cplusplus
}
#endif 

In addition, I think this is a C++ header file (please correct me if I understood), it's better to include C++ variant of the standard library, like #include <cstdint>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This style is kind of nowadays redundant as the compiler knows already what to do.
You are right, this is a C++ header file and chances are it will not be used in a C only context.
I can make it pure C++ if you like.

// Register definition and address map taken from cv_5v4.pdf,
// Cyclone V Hard Processor System Technical Reference Manual,
// chapter 5: FPGA Manager.
struct de10nano_mgr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take CamelCase for class names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I don't mind either way.

from tvm import rpc
from vta import get_bitstream_path, download_bitstream, program_fpga, reconfig_runtime

host = os.environ.get("VTA_PYNQ_RPC_HOST", "de10nano")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmoreau89 Shall we rename this environment variable for targets other than PYNQ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was also confused when I first browsed the code.
I would have expected something like:
VTA_RPC_HOST = {pynq|de10nano|ultra96|etc}
VTA_RPC_PORT = 9091

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've been using this variable to be able to target and program different FPGAs. For in that spririt, I would use VTA_DE10_RPC_HOST for this test. Your bashrc could contain multiple hosts including VTA_PYNQ_RPC_HOST, VTA_DE10_RPC_HOST, VTA_ULTRA96_RPC_HOST etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes perfect sense, perhaps coupled with your idea of folding fpga target specific info in environment.py.
Right now introducing VTA_DE10_RPC_HOST anywhere VTA_PYNQ_RPC_HOST, including matrix_multiply.py for instance would add a lot of boiler plate code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see; upon second though it would be cleaner to use VTA_RPC_HOST environment variables given that some of our unit tests assume that we're targeting the pynq boards.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's totally acceptable as long as we update documentation too to reflect that change in requirements. Grep-ing for VTA_PYNQ_RPC_HOST should do the trick

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree!

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 5, 2020

Last commit addresses previous comments.
sbt scalafmt --test is still open since it changes code.
Please let me know what you think.

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 5, 2020

General question.
I see the CI jenkins merge-hook build fails for reasons other than the PR itself (not right now, I need to change include order to pass lint after the last commit).
Is it a requirement to have this build pass?

@tmoreau89
Copy link
Contributor

It looks like at the moment the cpp lint test is failing: https://ci.tvm.ai/blue/organizations/jenkins/tvm/detail/PR-4986/5/pipeline

You may have to go in and address the lint errors in vta/src/de10nano/de10nano_mgr.h

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 5, 2020

Right now the build is failing in runtime.cc, not sure why.

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 5, 2020

Any chance the docker image maybe using python2 instead of python3.
Not running docker so not able to reproduce, but it seems the f string is not recognized.

@vegaluisjose
Copy link
Member

I was running into the same issue with the f-strings in your code. I think the problem is that we are running python2 in the cmake maybe @tmoreau89 ?

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 6, 2020

Just removed them, but f strings are so much better .....
Any particular reason why python 2 is still used?

@vegaluisjose
Copy link
Member

This is an error on that cmake file actually because everything should be python3 by now. In fact, f-strings are used in other parts of the python codebase. This is a good find.

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 6, 2020

Do you want me to remove python from
find_program(PYTHON NAMES python python3 python3.6)
in cmake/modules/VTA.cmake?

@vegaluisjose
Copy link
Member

I think it would be nice to do that in a separate PR.

@vegaluisjose
Copy link
Member

Hey @pasqoc ,

I just ran/build everything in both Linux/Mac but it seems to be failing consistently in resnet18v1. I am getting the following

[traced 408M cycles]

Execution statistics:
        cycle_count     :         31427003

resnet18_v1 prediction for sample 0
        #1: grocery store, grocery, food market, market
        #2: scale, weighing machine
        #3: banana
        #4: punching bag, punch bag, punching ball, punchball
        #5: cleaver, meat cleaver, chopper
Traceback (most recent call last):

  File "deploy_classification.py", line 290, in <module>
    assert(cat_detected)

AssertionError

Perhaps I am missing something? I built everything directly from your de10-nano branch

@tmoreau89
Copy link
Contributor

Agree with @vegaluisjose on the separate PR; if you submit it, we'll work to merge it quickly so you can rebase against master and get those changes in

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 6, 2020

One think I forgot to mention (and I did not change the deploy_classification.py code in the branch) is that currently you must avoid using the timer when doing inference.
There is some interaction I have not debugged yet that makes the runs started by timer subsequent to the first one fail for all chisel based targets.
Right now you must call the model once directly, i.e. disable the timer object and call m.run() instead of timer() on line 255.

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 6, 2020

I suggest changing the timer object to avoid performing the "warming run" job and enable running a single job.
This is also what you may want to do if you add resnet18 tsim to CI and don't want the test to run for ages.

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 6, 2020

Resnet18_v1 should only take 31M cycles for one run and around 5 minutes, so you are running 12 jobs in around an hour.
Also if you want to cross reference the results reported on the table I have used a tabby cat image directly downloaded from ImageNet because I wanted to be sure to get the correct expected result.
Using the default cat image may give you a tiger cat instead ...

@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 6, 2020

I can add the changes to timer if you want me to, and then changing num = rep = 1 in deploy_classification.py should make the test pass.

@vegaluisjose
Copy link
Member

Hey @pasqoc ,

You are right, that makes the test work. It works for me without changing the number of repetitions.

Thoughts on this timer() @tmoreau89 and @liangfu ?

Here is the trace

resnet18_v1 inference graph built in 8.82s!
File synset.txt exists, skip.
File cat.png exists, skip.
[traced 1M cycles]
[traced 2M cycles]
[traced 3M cycles]
[traced 4M cycles]
[traced 5M cycles]
[traced 6M cycles]
[traced 7M cycles]
[traced 8M cycles]
[traced 9M cycles]
[traced 10M cycles]
[traced 11M cycles]
[traced 12M cycles]
[traced 13M cycles]
[traced 14M cycles]
[traced 15M cycles]
[traced 16M cycles]
[traced 17M cycles]
[traced 18M cycles]
[traced 19M cycles]
[traced 20M cycles]
[traced 21M cycles]
[traced 22M cycles]
[traced 23M cycles]
[traced 24M cycles]
[traced 25M cycles]
[traced 26M cycles]
[traced 27M cycles]
[traced 28M cycles]
[traced 29M cycles]
[traced 30M cycles]
[traced 31M cycles]

Execution statistics:
        cycle_count     :          2417775

resnet18_v1 prediction for sample 0
        #1: tiger cat
        #2: Egyptian cat
        #3: tabby, tabby cat
        #4: lynx, catamount
        #5: weasel

@@ -62,20 +62,40 @@ class TensorStore(tensorType: String = "none", debug: Boolean = false)(
val tag = Reg(UInt(8.W))
val set = Reg(UInt(8.W))

// Dynamically adjust the size of DMA transfers to avoid crossing page boundaries.
final val ADAPTIVE_DMA_XFER_ENABLE = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious -- is there a way this is false?. If this is fixing the wraparound AXI bug, then we should make it default and update all the code that depend on it. For example:

val xfer_init_bytes = if (ADAPTIVE_DMA_XFER_ENABLE) xmax_bytes - xfer_init_addr % xmax_bytes else xmax_bytes

will be replaced by

val xfer_init_bytes =  xmax_bytes - xfer_init_addr % xmax_bytes

Same in TensorUtil.scala

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is small cost that the fix adds in terms of timing, which could be mitigated with a refactoring of the FSM.
I was not sure whether the wraparound problem is a systematic limitation or not so I decided to add it in a parametric way just in case another platform does no exhibits the issue.
I left the static constant there but one could drive it from the Configs.scala file if needed.
The idea is to turn it off when trying another platform and if successful set the parameter to false in a config file.
Does it make sense?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, AFAIK the problem is systematic across all platforms, so we might want to fix it right away. Also, I believe de10-nano is AXI3, therefore other platforms using AXI4 should also work with this because AXI's "backwards compatibility."

Later, we can optimize for performance if we would like to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, this is a limitation, not a feature though, unless you are talking about bug backward compatibility.
Indeed, you may very well have this limitation removed in future versions of AXI or maybe be using other interconnects that do not have it, and you may choose to take advantage of better performance.
That said, I don't mind taking it out, it will save two lines of scala code ;-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think we can remove it here and in TensorUtil.scala.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will do shortly!

pasqoc added 17 commits March 8, 2020 19:50
Issue:
The Cyclone V FPGA on board of the DE10-Nano can only be programmed
using the JTAG port, which is a limiting option for users.

Solution:
Add support for the remote programming of the FPGA implementing
the FPGA programming manager protocol published in the Cyclone V
user manual.

* Added file de10nano_mgr.h implementing an FPGA manager class
  that supports handling of control and status registers as well
  as a push-button option to program the FPGA. The class can be
  easily extended to include more registers if needed.

* Used an instance of the FPGA manager to implement function
  VTAProgram also warning users when incompatible bitstream
  files are used.

* Registered VTAProgram as a global function and modified
  the program_bitstream python class to use it.
Issue:
The de10nano target has incomplete, non-working support
for runtime reconfiguration, bitstream programming, and
examples of usage.

Solution:
Complete runtime support for the de10nano target.

* Modified VTA.cmake to comment out a default override for
  VTA_MAX_XFER to 21 bit wide.

* Modified VTA.cmake to add needed de10nano include dirs.

* Modified relevant files to support de10nano same way as
  other targets for VTA runtime reconfiguration and FPGA
  programming.

* Added test_program_rpc.py example as a runtime FPGA
  programming example. Note that unlike the pynq target
  no bitstream is either downloaded or programmed when
  the bitstream argument is set to None.

* Cosmetic changes to vta config files.
Issue:
The LoadUop FSM incorrectly advances the address of the next
uop to read from DRAM when the DRAM data valid bit is deasserted
and asserted at the end of a read. This is caused by a mismatch
in the logic of the state and output portions of the FSM.
This is one of two issues that was gating the correct operation
of VTA on the DE10-Nano target.

Solution:
Modify the logic of the output section of the FSM to include
a check on the DRAM read valid bit or fold the output assignemnt
into the state section.

* Folded the assignemnt of the next uop address in the state
  section of the FSM.
Issue:
In the DE10-Nano target and possibly in others, DMA transfers that
cross the boundaries of memory pages result in incorrect reads and
writes from and to DRAM. When this happens depending on different
input values, VTA loads and stores exhibit incorrect results for
DMA pulses at the end of a transfer. This is one of two issues that
were gating the DE10-Nano target from functioning correctly, but may
affect other Chisel based targets.

Solution:
Add support for dynamically adjustble DMA transfer sizes in load
and store operations. For a more elegant and modular implementation
the feature can be enabled at compile time with a static constant
that can be passed as a configuration option.

* Modified the load and store finite state machines to dynamically
  adjust the size of initial and stride DMA transfers. The feature
  is enabled by default by virtue of the static constant
  ADAPTIVE_DMA_XFER_ENABLE.
Issue:
Cross reference between FSIM, TSIM, and Chisel based FPGA traces
is an invaluable instrument that enables fast analysis on FSIM,
and analysis/debug on TSIM and FPGA, especially for complex flows
like conv2d or full inferences. Currently this cannot be done
easily since a suitable reference is missing. The clock cycle
event counter cannot be used since it is undefined in FSIM and
not reliable between TSIM and FPGA because of different latencies.

Solution:
Introduce a new event counter that preserves a program order across
FSIM, TSIM, FPGA. We propose adding the accumulator write event
counter in the Chisel EventCounter class and a simple instrumentation
in the FSIM runtime code. Note that this technique enabled finding the
Chisel issues reportes in the PR, which would have been otherwise
far more difficult.

* Added the acc_wr_count event counter and changed interfaces
  accordingly.
* Use CamelCase class names.

* Use C++ style C include header files.

* Add comments to Chisel makefile.
* Reorder C and C++ includes in de10nano_mgr.h.

* Restore lint as default target in Chisel Makefile.
Issue:
With more FPGA targets coming online the initial method of
using individual environment variables to specify target IP and port
does not scale well.

Solution:
Use a single VTA_RPC_HOST, VTA_RPC_PORT pair to be changed
every time a different target is used. For instance in a script
used to benchmark all targets.

* Replaced every instance of VTA_PYNQ_RPC_HOST and VTA_PYNQ_RPC_PORT
  with VTA_RPC_HOST and VTA_RPC_PORT, respectively.
@pasqoc
Copy link
Contributor Author

pasqoc commented Mar 9, 2020

Rebased and made small changes after running new linter.
New linter works great, thanks @liangfu !
Thanks @tmoreau89 , happy to contribute to such an awesome project!

Copy link
Member

@liangfu liangfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor Author

@pasqoc pasqoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

@tmoreau89 tmoreau89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pasqoc for the work, the changes LGTM!

@tmoreau89 tmoreau89 merged commit 5b4cf5d into apache:master Mar 9, 2020
@tmoreau89
Copy link
Contributor

Thanks @pasqoc @liangfu @vegaluisjose for the work and the reviews; the PR has been merged!

@pasqoc pasqoc deleted the de10-nano branch March 9, 2020 17:19
tqchen pushed a commit to tqchen/tvm that referenced this pull request Mar 29, 2020
* [VTA][de10nano] Enable user defined target frequency.

Issue:
The VTA target frequency on the DE10-Nano is hardcoded to 50MHz
unnecessarily limiting performance.

Solution:
Add a PLL to the FPGA sub-system along with support for the
selection of a user specified frequency at build time. The board
successfully builds and runs at 100MHz.

* Added a PLL in the soc_system.tcl platform designer generator
  script.

* Modified the Makefile to automatically set the target frequency
  from that specified in the pkg_config.py file.

* Modified the Makefile to generate a bitstream with an RBF
  format that enables programming of the FPGA directly from
  the on-board processor. Specifically, the RBF is generated in
  FastParallel32 mode with compression, which corresponds to the
  default MSEL switch setting on the board, i.e. 01010.

* Added a false path override to file set_clocks.sdc to turn off
  unconstrained path warnings on the VTA pulse LED.

* [VTA][TSIM] Add more debug and tracing options.

* Modified Makefile to change default config to DafaultDe10Config.

* Added option in Makefile to produce more detailed tracing
  for extra observability in debugging complex scenarios.

* Added option in Makefile to produce traces in FST format which
  are 2 orders of magnitude smaller, although much slower to
  generate.

* Added option in Makefile to build the simulator with GCC address
  sanitizer.

* Modified Makefile to not lint the scala code by default avoiding
  unintended wrong indentation. Linting should be better performed
  manually on a per-need basis.

* [VTA][de10nano] Enable remote programming of FPGA.

Issue:
The Cyclone V FPGA on board of the DE10-Nano can only be programmed
using the JTAG port, which is a limiting option for users.

Solution:
Add support for the remote programming of the FPGA implementing
the FPGA programming manager protocol published in the Cyclone V
user manual.

* Added file de10nano_mgr.h implementing an FPGA manager class
  that supports handling of control and status registers as well
  as a push-button option to program the FPGA. The class can be
  easily extended to include more registers if needed.

* Used an instance of the FPGA manager to implement function
  VTAProgram also warning users when incompatible bitstream
  files are used.

* Registered VTAProgram as a global function and modified
  the program_bitstream python class to use it.

* [VTA][de10nano] Enhance de10nano runtime support.

Issue:
The de10nano target has incomplete, non-working support
for runtime reconfiguration, bitstream programming, and
examples of usage.

Solution:
Complete runtime support for the de10nano target.

* Modified VTA.cmake to comment out a default override for
  VTA_MAX_XFER to 21 bit wide.

* Modified VTA.cmake to add needed de10nano include dirs.

* Modified relevant files to support de10nano same way as
  other targets for VTA runtime reconfiguration and FPGA
  programming.

* Added test_program_rpc.py example as a runtime FPGA
  programming example. Note that unlike the pynq target
  no bitstream is either downloaded or programmed when
  the bitstream argument is set to None.

* Cosmetic changes to vta config files.

* [VTA][Chisel] LoadUop FSM bug fix.

Issue:
The LoadUop FSM incorrectly advances the address of the next
uop to read from DRAM when the DRAM data valid bit is deasserted
and asserted at the end of a read. This is caused by a mismatch
in the logic of the state and output portions of the FSM.
This is one of two issues that was gating the correct operation
of VTA on the DE10-Nano target.

Solution:
Modify the logic of the output section of the FSM to include
a check on the DRAM read valid bit or fold the output assignemnt
into the state section.

* Folded the assignemnt of the next uop address in the state
  section of the FSM.

* [VTA][Chisel] Dynamically adjust DMA tranfer size.

Issue:
In the DE10-Nano target and possibly in others, DMA transfers that
cross the boundaries of memory pages result in incorrect reads and
writes from and to DRAM. When this happens depending on different
input values, VTA loads and stores exhibit incorrect results for
DMA pulses at the end of a transfer. This is one of two issues that
were gating the DE10-Nano target from functioning correctly, but may
affect other Chisel based targets.

Solution:
Add support for dynamically adjustble DMA transfer sizes in load
and store operations. For a more elegant and modular implementation
the feature can be enabled at compile time with a static constant
that can be passed as a configuration option.

* Modified the load and store finite state machines to dynamically
  adjust the size of initial and stride DMA transfers. The feature
  is enabled by default by virtue of the static constant
  ADAPTIVE_DMA_XFER_ENABLE.

* [VTA][Chisel] Improve FSIM/TSIM/FPGA xref debug.

Issue:
Cross reference between FSIM, TSIM, and Chisel based FPGA traces
is an invaluable instrument that enables fast analysis on FSIM,
and analysis/debug on TSIM and FPGA, especially for complex flows
like conv2d or full inferences. Currently this cannot be done
easily since a suitable reference is missing. The clock cycle
event counter cannot be used since it is undefined in FSIM and
not reliable between TSIM and FPGA because of different latencies.

Solution:
Introduce a new event counter that preserves a program order across
FSIM, TSIM, FPGA. We propose adding the accumulator write event
counter in the Chisel EventCounter class and a simple instrumentation
in the FSIM runtime code. Note that this technique enabled finding the
Chisel issues reportes in the PR, which would have been otherwise
far more difficult.

* Added the acc_wr_count event counter and changed interfaces
  accordingly.

* [VTA][de10nano] Comply with linting rules.

* [VTA] Appease make lint.

* [VTA] Disable pylint import not top level error.

* [VTA][Chisel,de10nano] Linting changes.

* Use CamelCase class names.

* Use C++ style C include header files.

* Add comments to Chisel makefile.

* [VTA][de10nano]

* Reorder C and C++ includes in de10nano_mgr.h.

* Restore lint as default target in Chisel Makefile.

* [VTA][de10nano] Do not use f string in pkg_config.py.

* [VTA][de10nano] Remove overlooked f strings in pkg_config.py.

* [VTA][de10nano] Fixed typo.

* [VTA][TSIM] Check if gcc has align-new.

* [VTA][Chisel] Make adaptive DMA transfer default.

* [VTA][RPC] Renamed VTA_PYNQ_RPC_* to VTA_RPC_*.

Issue:
With more FPGA targets coming online the initial method of
using individual environment variables to specify target IP and port
does not scale well.

Solution:
Use a single VTA_RPC_HOST, VTA_RPC_PORT pair to be changed
every time a different target is used. For instance in a script
used to benchmark all targets.

* Replaced every instance of VTA_PYNQ_RPC_HOST and VTA_PYNQ_RPC_PORT
  with VTA_RPC_HOST and VTA_RPC_PORT, respectively.

* [VTA][Chisel] Comply with new linter.
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Apr 16, 2020
* [VTA][de10nano] Enable user defined target frequency.

Issue:
The VTA target frequency on the DE10-Nano is hardcoded to 50MHz
unnecessarily limiting performance.

Solution:
Add a PLL to the FPGA sub-system along with support for the
selection of a user specified frequency at build time. The board
successfully builds and runs at 100MHz.

* Added a PLL in the soc_system.tcl platform designer generator
  script.

* Modified the Makefile to automatically set the target frequency
  from that specified in the pkg_config.py file.

* Modified the Makefile to generate a bitstream with an RBF
  format that enables programming of the FPGA directly from
  the on-board processor. Specifically, the RBF is generated in
  FastParallel32 mode with compression, which corresponds to the
  default MSEL switch setting on the board, i.e. 01010.

* Added a false path override to file set_clocks.sdc to turn off
  unconstrained path warnings on the VTA pulse LED.

* [VTA][TSIM] Add more debug and tracing options.

* Modified Makefile to change default config to DafaultDe10Config.

* Added option in Makefile to produce more detailed tracing
  for extra observability in debugging complex scenarios.

* Added option in Makefile to produce traces in FST format which
  are 2 orders of magnitude smaller, although much slower to
  generate.

* Added option in Makefile to build the simulator with GCC address
  sanitizer.

* Modified Makefile to not lint the scala code by default avoiding
  unintended wrong indentation. Linting should be better performed
  manually on a per-need basis.

* [VTA][de10nano] Enable remote programming of FPGA.

Issue:
The Cyclone V FPGA on board of the DE10-Nano can only be programmed
using the JTAG port, which is a limiting option for users.

Solution:
Add support for the remote programming of the FPGA implementing
the FPGA programming manager protocol published in the Cyclone V
user manual.

* Added file de10nano_mgr.h implementing an FPGA manager class
  that supports handling of control and status registers as well
  as a push-button option to program the FPGA. The class can be
  easily extended to include more registers if needed.

* Used an instance of the FPGA manager to implement function
  VTAProgram also warning users when incompatible bitstream
  files are used.

* Registered VTAProgram as a global function and modified
  the program_bitstream python class to use it.

* [VTA][de10nano] Enhance de10nano runtime support.

Issue:
The de10nano target has incomplete, non-working support
for runtime reconfiguration, bitstream programming, and
examples of usage.

Solution:
Complete runtime support for the de10nano target.

* Modified VTA.cmake to comment out a default override for
  VTA_MAX_XFER to 21 bit wide.

* Modified VTA.cmake to add needed de10nano include dirs.

* Modified relevant files to support de10nano same way as
  other targets for VTA runtime reconfiguration and FPGA
  programming.

* Added test_program_rpc.py example as a runtime FPGA
  programming example. Note that unlike the pynq target
  no bitstream is either downloaded or programmed when
  the bitstream argument is set to None.

* Cosmetic changes to vta config files.

* [VTA][Chisel] LoadUop FSM bug fix.

Issue:
The LoadUop FSM incorrectly advances the address of the next
uop to read from DRAM when the DRAM data valid bit is deasserted
and asserted at the end of a read. This is caused by a mismatch
in the logic of the state and output portions of the FSM.
This is one of two issues that was gating the correct operation
of VTA on the DE10-Nano target.

Solution:
Modify the logic of the output section of the FSM to include
a check on the DRAM read valid bit or fold the output assignemnt
into the state section.

* Folded the assignemnt of the next uop address in the state
  section of the FSM.

* [VTA][Chisel] Dynamically adjust DMA tranfer size.

Issue:
In the DE10-Nano target and possibly in others, DMA transfers that
cross the boundaries of memory pages result in incorrect reads and
writes from and to DRAM. When this happens depending on different
input values, VTA loads and stores exhibit incorrect results for
DMA pulses at the end of a transfer. This is one of two issues that
were gating the DE10-Nano target from functioning correctly, but may
affect other Chisel based targets.

Solution:
Add support for dynamically adjustble DMA transfer sizes in load
and store operations. For a more elegant and modular implementation
the feature can be enabled at compile time with a static constant
that can be passed as a configuration option.

* Modified the load and store finite state machines to dynamically
  adjust the size of initial and stride DMA transfers. The feature
  is enabled by default by virtue of the static constant
  ADAPTIVE_DMA_XFER_ENABLE.

* [VTA][Chisel] Improve FSIM/TSIM/FPGA xref debug.

Issue:
Cross reference between FSIM, TSIM, and Chisel based FPGA traces
is an invaluable instrument that enables fast analysis on FSIM,
and analysis/debug on TSIM and FPGA, especially for complex flows
like conv2d or full inferences. Currently this cannot be done
easily since a suitable reference is missing. The clock cycle
event counter cannot be used since it is undefined in FSIM and
not reliable between TSIM and FPGA because of different latencies.

Solution:
Introduce a new event counter that preserves a program order across
FSIM, TSIM, FPGA. We propose adding the accumulator write event
counter in the Chisel EventCounter class and a simple instrumentation
in the FSIM runtime code. Note that this technique enabled finding the
Chisel issues reportes in the PR, which would have been otherwise
far more difficult.

* Added the acc_wr_count event counter and changed interfaces
  accordingly.

* [VTA][de10nano] Comply with linting rules.

* [VTA] Appease make lint.

* [VTA] Disable pylint import not top level error.

* [VTA][Chisel,de10nano] Linting changes.

* Use CamelCase class names.

* Use C++ style C include header files.

* Add comments to Chisel makefile.

* [VTA][de10nano]

* Reorder C and C++ includes in de10nano_mgr.h.

* Restore lint as default target in Chisel Makefile.

* [VTA][de10nano] Do not use f string in pkg_config.py.

* [VTA][de10nano] Remove overlooked f strings in pkg_config.py.

* [VTA][de10nano] Fixed typo.

* [VTA][TSIM] Check if gcc has align-new.

* [VTA][Chisel] Make adaptive DMA transfer default.

* [VTA][RPC] Renamed VTA_PYNQ_RPC_* to VTA_RPC_*.

Issue:
With more FPGA targets coming online the initial method of
using individual environment variables to specify target IP and port
does not scale well.

Solution:
Use a single VTA_RPC_HOST, VTA_RPC_PORT pair to be changed
every time a different target is used. For instance in a script
used to benchmark all targets.

* Replaced every instance of VTA_PYNQ_RPC_HOST and VTA_PYNQ_RPC_PORT
  with VTA_RPC_HOST and VTA_RPC_PORT, respectively.

* [VTA][Chisel] Comply with new linter.
zhiics pushed a commit to neo-ai/tvm that referenced this pull request Apr 17, 2020
* [VTA][de10nano] Enable user defined target frequency.

Issue:
The VTA target frequency on the DE10-Nano is hardcoded to 50MHz
unnecessarily limiting performance.

Solution:
Add a PLL to the FPGA sub-system along with support for the
selection of a user specified frequency at build time. The board
successfully builds and runs at 100MHz.

* Added a PLL in the soc_system.tcl platform designer generator
  script.

* Modified the Makefile to automatically set the target frequency
  from that specified in the pkg_config.py file.

* Modified the Makefile to generate a bitstream with an RBF
  format that enables programming of the FPGA directly from
  the on-board processor. Specifically, the RBF is generated in
  FastParallel32 mode with compression, which corresponds to the
  default MSEL switch setting on the board, i.e. 01010.

* Added a false path override to file set_clocks.sdc to turn off
  unconstrained path warnings on the VTA pulse LED.

* [VTA][TSIM] Add more debug and tracing options.

* Modified Makefile to change default config to DafaultDe10Config.

* Added option in Makefile to produce more detailed tracing
  for extra observability in debugging complex scenarios.

* Added option in Makefile to produce traces in FST format which
  are 2 orders of magnitude smaller, although much slower to
  generate.

* Added option in Makefile to build the simulator with GCC address
  sanitizer.

* Modified Makefile to not lint the scala code by default avoiding
  unintended wrong indentation. Linting should be better performed
  manually on a per-need basis.

* [VTA][de10nano] Enable remote programming of FPGA.

Issue:
The Cyclone V FPGA on board of the DE10-Nano can only be programmed
using the JTAG port, which is a limiting option for users.

Solution:
Add support for the remote programming of the FPGA implementing
the FPGA programming manager protocol published in the Cyclone V
user manual.

* Added file de10nano_mgr.h implementing an FPGA manager class
  that supports handling of control and status registers as well
  as a push-button option to program the FPGA. The class can be
  easily extended to include more registers if needed.

* Used an instance of the FPGA manager to implement function
  VTAProgram also warning users when incompatible bitstream
  files are used.

* Registered VTAProgram as a global function and modified
  the program_bitstream python class to use it.

* [VTA][de10nano] Enhance de10nano runtime support.

Issue:
The de10nano target has incomplete, non-working support
for runtime reconfiguration, bitstream programming, and
examples of usage.

Solution:
Complete runtime support for the de10nano target.

* Modified VTA.cmake to comment out a default override for
  VTA_MAX_XFER to 21 bit wide.

* Modified VTA.cmake to add needed de10nano include dirs.

* Modified relevant files to support de10nano same way as
  other targets for VTA runtime reconfiguration and FPGA
  programming.

* Added test_program_rpc.py example as a runtime FPGA
  programming example. Note that unlike the pynq target
  no bitstream is either downloaded or programmed when
  the bitstream argument is set to None.

* Cosmetic changes to vta config files.

* [VTA][Chisel] LoadUop FSM bug fix.

Issue:
The LoadUop FSM incorrectly advances the address of the next
uop to read from DRAM when the DRAM data valid bit is deasserted
and asserted at the end of a read. This is caused by a mismatch
in the logic of the state and output portions of the FSM.
This is one of two issues that was gating the correct operation
of VTA on the DE10-Nano target.

Solution:
Modify the logic of the output section of the FSM to include
a check on the DRAM read valid bit or fold the output assignemnt
into the state section.

* Folded the assignemnt of the next uop address in the state
  section of the FSM.

* [VTA][Chisel] Dynamically adjust DMA tranfer size.

Issue:
In the DE10-Nano target and possibly in others, DMA transfers that
cross the boundaries of memory pages result in incorrect reads and
writes from and to DRAM. When this happens depending on different
input values, VTA loads and stores exhibit incorrect results for
DMA pulses at the end of a transfer. This is one of two issues that
were gating the DE10-Nano target from functioning correctly, but may
affect other Chisel based targets.

Solution:
Add support for dynamically adjustble DMA transfer sizes in load
and store operations. For a more elegant and modular implementation
the feature can be enabled at compile time with a static constant
that can be passed as a configuration option.

* Modified the load and store finite state machines to dynamically
  adjust the size of initial and stride DMA transfers. The feature
  is enabled by default by virtue of the static constant
  ADAPTIVE_DMA_XFER_ENABLE.

* [VTA][Chisel] Improve FSIM/TSIM/FPGA xref debug.

Issue:
Cross reference between FSIM, TSIM, and Chisel based FPGA traces
is an invaluable instrument that enables fast analysis on FSIM,
and analysis/debug on TSIM and FPGA, especially for complex flows
like conv2d or full inferences. Currently this cannot be done
easily since a suitable reference is missing. The clock cycle
event counter cannot be used since it is undefined in FSIM and
not reliable between TSIM and FPGA because of different latencies.

Solution:
Introduce a new event counter that preserves a program order across
FSIM, TSIM, FPGA. We propose adding the accumulator write event
counter in the Chisel EventCounter class and a simple instrumentation
in the FSIM runtime code. Note that this technique enabled finding the
Chisel issues reportes in the PR, which would have been otherwise
far more difficult.

* Added the acc_wr_count event counter and changed interfaces
  accordingly.

* [VTA][de10nano] Comply with linting rules.

* [VTA] Appease make lint.

* [VTA] Disable pylint import not top level error.

* [VTA][Chisel,de10nano] Linting changes.

* Use CamelCase class names.

* Use C++ style C include header files.

* Add comments to Chisel makefile.

* [VTA][de10nano]

* Reorder C and C++ includes in de10nano_mgr.h.

* Restore lint as default target in Chisel Makefile.

* [VTA][de10nano] Do not use f string in pkg_config.py.

* [VTA][de10nano] Remove overlooked f strings in pkg_config.py.

* [VTA][de10nano] Fixed typo.

* [VTA][TSIM] Check if gcc has align-new.

* [VTA][Chisel] Make adaptive DMA transfer default.

* [VTA][RPC] Renamed VTA_PYNQ_RPC_* to VTA_RPC_*.

Issue:
With more FPGA targets coming online the initial method of
using individual environment variables to specify target IP and port
does not scale well.

Solution:
Use a single VTA_RPC_HOST, VTA_RPC_PORT pair to be changed
every time a different target is used. For instance in a script
used to benchmark all targets.

* Replaced every instance of VTA_PYNQ_RPC_HOST and VTA_PYNQ_RPC_PORT
  with VTA_RPC_HOST and VTA_RPC_PORT, respectively.

* [VTA][Chisel] Comply with new linter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants