Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low DPDK throughput in NFV benchmark #665

Closed
eugeneia opened this issue Nov 18, 2015 · 25 comments
Closed

Low DPDK throughput in NFV benchmark #665

eugeneia opened this issue Nov 18, 2015 · 25 comments
Assignees
Labels

Comments

@eugeneia
Copy link
Member

program/snabbnfv/packetblaster_bench.sh performs badly due to DPDK dropping packets. See #588.

@eugeneia eugeneia added the bug label Nov 18, 2015
@lukego
Copy link
Member

lukego commented Nov 18, 2015

@eugeneia Could you please update the issue text to say exactly what we know about the differences between the fast and slow test environments and how to run both tests? (In the discussion on the other issue I lose track a bit of exactly what code is being tested and what its results are.)

Standard question whenever a mysterious 30% performance difference appears: Can the NUMA affinity have changed? (Could check e.g. in htop which node's cores are being used in each run of the benchmark.)

@eugeneia
Copy link
Member Author

@lukego I started testing from a clean slate, to be honest I didn't know anything and that's why I did not want to pollute the issue with previous speculations.

I am quite sure that its unrelated to NUMA affinity, as nothing changed there and the other benchmarks are not affected.

So here we go, fresh start. There are two different versions of DPDK being tested:

  • “legacy” - this is the version we used previously in the opaque image, and we are using now for the CI
  • v2.1.0-snabb - the current version of DPDK with our patches applied.

I also tested on two different machines:

  • grindelwald - Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
  • davos - Intel(R) Xeon(R) CPU E5-2603 v2 @ 1.80GHz

Finally I tested with two different versions of QEMU (2.4.0 and 2.1.0). Now to the results:

  1. On grindelwald, it performs worse (~25%) when using v2.1.0-snabb instead of legacy DPDK (4 Mpps vs 3 Mpps), but QEMU versions do not make a difference
  2. On davos using legacy DPDK, it performs worse (~30%) when using QEMU 2.4 instead of 2.1 (2.4 Mpps vs 3.5 Mpps)

So while there is a performance hit when upgrading to DPDK 2.1, the reason for the decrease in performance on davos seems to be indeed the new QEMU (2.4.0) version.

@lukego On a side node, legacy DPDK does indeed seem to work with 1000 byte packets.

@lukego
Copy link
Member

lukego commented Nov 19, 2015

Can you tell me how to reproduce please?

We should not apply any patches to DPDK.

9000 byte is what I believe won't work in the older DPDK.

@eugeneia
Copy link
Member Author

Good news, I have found the culprit: QEMU is the bottleneck, not DPDK as I thought.I have built a docker image identical to what the CI uses except that it contains QEMU https://github.com/SnabbCo/qemu/tree/v2.1.0-vhostuser plus the patch which increases hard-coded Virtio vring size (snabbco/qemu@7a94322).

(I actually reproduced the wrong QEMU version before, that's why I couldn't reproduce on grindelwald.)

This image should be an almost exact replica of the opaque blobs and QEMU we used in bench_env times (except a slight difference in the QEMU source for which I couldn't find the patch / context):

https://hub.docker.com/r/eugeneia/snabb-nfv-test-legacy/

You can reproduce like so:

$ docker pull eugeneia/snabb-nfv-test-legacy
$ SNABB_TEST_IMAGE=eugeneia/snabb-nfv-test-legacy \
  SNABB_PCI_INTEL0=0000:01:00.0 \
  SNABB_PCI_INTEL1=0000:01:00.1 \
  scripts/dock.sh program/snabbnfv/packetblaster_bench.sh

This yielded expected performance (5.3Mpps on grindelwald) and no DPDK packet loss when I ran it.

@kbara
Copy link
Contributor

kbara commented Nov 20, 2015

Nice work. :-)

@lukego
Copy link
Member

lukego commented Nov 23, 2015

Great detective work, Max!

The Docker workflow really does make it easy to reproduce tests! I ran this on chur and the results seem consistent when taking into account the slower CPU:

[luke@chur:~/git/snabbswitch/src]$ scripts/dock.sh "(cd ..; make -j)"
...
[luke@chur:~/git/snabbswitch/src]$ SNABB_TEST_IMAGE=eugeneia/snabb-nfv-test-legacy   SNABB_PCI_INTEL0=0000:01:00.0   SNABB_PCI_INTEL1=0000:01:00.1   scripts/dock.sh program/snabbnfv/packetblaster_bench.sh
...
Processed 100.0 million packets in 23.90 seconds (6400003840 bytes; 2.14 Gbps)
Made 891,403 breaths: 112.18 packets per breath; 26.81us per breath
Rate(Mpps): 4.184

Next I would really like to get in control of the patches. Specifically I would like to migrate over to testing with the latest releases of QEMU and DPDK without any patches applied. That is the intended target software environment. If the performance does not match expectations then we would dig in to look for the root cause and try to address this in the snabb code rather than with adding/reviving patches. Is this easy to setup?

Patches can really take on a life of their own :-). I have a little bit of Nix envy here: Nix makes the exact versions/patches being tested completely transparent, even down to kernels and libc both on the host and inside the VMs, whereas Docker seems to make this very opaque. Dockerhub makes it extremely easy to reproduce the test environment but the summary page doesn't provide any insight into what code is actually in the container.

(There is actually one QEMU patch that I do still recommend applying, "f393aea Add G_IO_HUP handler for socket chardev", but that should not be needed for the DPDK benchmark. It's only to allow the snabb process to restart without restarting QEMU.)

@eugeneia
Copy link
Member Author

Next I would really like to get in control of the patches. Specifically I would like to migrate over to testing with the latest releases of QEMU and DPDK without any patches applied.

I am in the process of building a patch-less “vanilla” image.

@eugeneia
Copy link
Member Author

Here is the “vanilla” image containing latest stable QEMU and DPDK (and kernel 3.19 instead of 3.13 because DPDK required 3.14+): https://hub.docker.com/r/eugeneia/snabb-nfv-test-vanilla/

@lukego
Copy link
Member

lukego commented Nov 24, 2015

Awesome, thanks! I am really impressed with the Docker workflow you have cooked up, I think that the effort has already paid off in terms of time saved on manually reproducing test environments.

Here is a braindump on the general theme of compatibility and performance with different software versions from our upstream Snabb Switch perspective:

The main thing is to make the latest upstream versions work and prevent them from breaking/regressing in the future. Then over time we build up a long trail of compatibility with consecutive versions. Supporting older and/or patched versions is less important unless specially required for some reason.

We want to take responsibility for making all the components work well together: Snabb, QEMU, DPDK/Linux guests, etc. If there is a problem then we need to find a solution even if that involves creating workarounds in Snabb Switch, making temporary patches to QEMU/DPDK, and working with upstream communities to merge fixes so that we can drop patches. In this sense it is better when a problem is caused by Snabb Switch, where it should be easy to fix, rather than a change in another project that we need to somehow deal with.

On performance issues like packet drops it can be that we need to take a "holistic" view of the relationships between all of the components rather than looking at one in isolation. For example if a new QEMU version leads to packet loss then this doesn't necessarily mean that there is a bug in QEMU but rather that the interactions between the components has changed. Particularly, QEMU is not involved in packet processing at all (this is done directly in shared memory between Snabb & DPDK) so it cannot be a direct processing bottleneck, but it is involved in negotiating the sizes and features of the shared memory rings and this can indirectly affect performance.

Likewise even between the processing components it is hard to point a finger at one and say that it is the problem. If the DPDK guest is dropping packets then all we know is that it is receiving faster than it is transmitting (and the difference is the dropped packets). This could be due to local DPDK behavior, or subtle Virtio-net behavior that gives the receive ring larger capacity than the transmit ring, or subtle Snabb Switch behavior that bursts packets onto the receive ring faster than it takes them off the transmit ring, and so on.

The most practical method I know of for diagnosing holistic performance problems is to bisect i.e. to identify two software versions that are as close as possible but with one being "good" and one being "bad". In this example it seems like we have "good" behavior from the fully patched QEMU 2.1 and "bad" behavior from the unpatched QEMU 2.4.1 and the next problem is to isolate this more narrowly e.g. does the problem appear when we drop one of our QEMU patches (like the vring size increase) or did it appear in QEMU 2.2 or 2.3 or 2.4 and so on.

I have a fantasty that our build infrastructure could make this easy to generate a test matrix e.g. by running commands like:

for snabb in v2015.08 v2015.09 v2015.10 v2015.11; do
  for qemu in 2.1-snabb 2.1 2.2 2.3 2.4; do
    for dpdk in 1.7-snabb 1.8 1.9 2.0 2.1; do
      run-test > result-$snabb-$qemu-$dpdk.txt
    done
  done
done

I am not sure if this is practical with the Docker-based test bootstrapping? If so that would be interesting!

I believe that this is a built-in feature of Hydra where build/test parameters can be specified as e.g. enums and all combinations can be tested (seen in a blog post on Hydra). I am keen to dig into this on the side in case it would be a nice solution for us in the future.

@lukego
Copy link
Member

lukego commented Nov 24, 2015

@eugeneia Running the "vanilla" test I don't see traffic passing so it seems like we have a compatibility issue between Snabb master + QEMU 2.4.1 + DPDK 2.1. Do you agree? If so I can create a separate ticket for that.

@eugeneia
Copy link
Member Author

No I tested it and it works for me (albeit with decreased performance, since both the DPDK and QEMU “regressions” kick in.

chur$ SNABB_PCI_INTEL0=0000:03:00.0 \
      SNABB_PCI_INTEL1=0000:03:00.1 \
      SNABB_TEST_IMAGE=eugeneia/snabb-nfv-test-vanilla \
      scripts/dock.sh program/snabbnfv/packetblaster_bench.sh
[...]
Rate(Mpps): 2.076

Edit: I have only tested on chur.

@eugeneia eugeneia self-assigned this Nov 24, 2015
@eugeneia
Copy link
Member Author

My guess is that its these two patches that are impacting the throughput:

And maybe this one (I am not sure what it does): virtualopensystems/dpdk@dae0a7f ( [virtio] Initialize the queues even if VIRTIO_NET_F_CTRL_VQ is not negotiated)

Will try to verify this guess.

@lukego
Copy link
Member

lukego commented Nov 24, 2015

Is there an easy way to confirm that based on the Docker workflow? This seems worth confirming before investing serious time in making a fix.

@eugeneia
Copy link
Member Author

I did some tests, results in relative numbers:

I could confirm that virtualopensystems/dpdk@dae0a7f is unrelated to performance. We already knew virtualopensystems/dpdk@7807fbb makes up for 20%, so I am thinking I misapplied snabbco/qemu@7a94322 (the code changed a bit since then, this is my adaption: eugeneia/qemu@101ec94). Makes sense?

@eugeneia
Copy link
Member Author

Regarding an easy way: I scratched my head a bit, but I ended up just branching snabbswitch-docker and building new images. The image building process takes a bit, but at least if you decide to share you can just docker push it. Also: its convenient to compare results between runs where the only difference in SNABB_TEST_IMAGE.

@eugeneia
Copy link
Member Author

We already knew virtualopensystems/dpdk@7807fbb makes up for 20%

I take that back: Removing the last DPDK patch from the equation does not affect performance. That leaves us with:

I might be approaching this from the wrong direction (e.g. eliminating patches that don't help instead of “bisecting” to the first bad commits) but since we have only ~3 relevant patches and probably thousands of commits from QEMU and DPDK upstream (which I probably don't understand)...

@eugeneia
Copy link
Member Author

OK, I am now reasonably certain that DPDK is the component we need to patch / focus on.

I have ran another test using a vanilla/legacy hybrid image, using “legacy DPDK” and vanilla QEMU + eugeneia/qemu@101ec94:

There could maybe be one detail invalidating the result, which is kernel versions (legacy doesn't compile with 3.19 so it uses 3.13, vanilla doesn't compile with 3.13 so it uses 3.19). I guess I could apply eugeneia/dpdk@75f58c6 to vanilla to be really sure.

Anyways, my takeaway from this is that we need to focus on DPDK and find out where the performance decrease comes from. If I understand correctly, l2fwd is just an example program, and is mostly untouched since 2013. So maybe it needs to be updated to adapt to DPDK development.

@eugeneia
Copy link
Member Author

Latest insights on this issue:

 DPDK 2.1 l2fwd application negotiates two additional Virtio-net options, “Indirect Descriptors” and “Mergeable RX buffers”, and this triggers lower performance in this benchmark environment.

@lukego
Copy link
Member

lukego commented Jan 12, 2016

@eugeneia do you have a quick tip for how I could run the snabbnfv in benchmark mode (-B) and see the output/result from the docker container? (The docker containers seem opaque to me, I never remember how to see what is going on inside. Sorry, I think you have explained before :))

@eugeneia
Copy link
Member Author

SNABB_PCI0=[...] DOCKERFLAGS=-t scripts/dock.sh program/snabbnfv/packetblaster_bench.sh
or
SNABB_PCI0=[...] DOCKERFLAGS=-t scripts/dock.sh ./snabb snabbnfv -B ...

DOCKERFLAGS=-t is useful because it enables sigterm etc...

See src/doc/testing.md for other variables like SNABB_TEST_IMAGE etc.

@lukego
Copy link
Member

lukego commented Jan 12, 2016

My goal is to run the full benchmark (packetblaster+qemu+snabb) but to control the snabbnfv traffic arguments and see its output. Does one of those commands let me do that? (How?)

@eugeneia
Copy link
Member Author

No, not really. 😞 You could edit this line in between runs. Suboptimal, I know...

@lukego
Copy link
Member

lukego commented Jan 27, 2016

Have been thinking about indirect descriptors a bit more. I am starting to think that they are an expensive feature that should be avoided.

Direct descriptors only require the device/hypervisor/snabbnfv to make one L3 cache access to access packet payload.

Indirect descriptors require two L3 cache accesses: first to resolve the address of the payload and second to actually access it. These L3 accesses are dependent on each other so the CPU won't be able to parallelize them (second can't start until the address is provided by the first).

This makes me think that indirect descriptors will generally have higher per-packet overhead than direct descriptors. This would be visible in the DPDK l2fwd benchmark (high packet rate) but not with Linux kernel VMs (low packet rate, bottleneck is checksum offload).

Could be that this can be resolved with clever assembler code in #719 to access multiple packets in parallel but I am not sure. (Could also be that I am mis-analysing the situation entirely.)

Just flagging to @nnikolaev @dpino @wingo that you may be able to expect better efficiency with direct descriptors rather than indirect ones but I am not sure yet. Ideas/input/data welcome. (My understanding is that indirect descriptors are mostly useful for working around the impractically small vring size that is hard-coded in QEMU but the existing CI benchmarks show that it is possible to achieve good performance even with such small vrings.)

@lukego
Copy link
Member

lukego commented Jan 27, 2016

See also the excellent test suite walkthrough that @eugeneia wrote. Down the bottom you see the much higher results when testing with older version of DPDK l2fwd that did not use indirect descriptors. (Hope we can improve the situation for both.)

@eugeneia
Copy link
Member Author

Closing because #1001 landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants