Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized "blitter" routine written in assembler [wip] #719

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lukego
Copy link
Member

@lukego lukego commented Jan 19, 2016

This is an experiment towards doubling the performance of virtio-net copies (#710) with an optimized blitter routine (#711) written in AVX assembler (compatible with Sandy Bridge onwards).

Caveats and notes:

  • The blitter is used only for Virtio-net "DMA" routines i.e. guest memory data copies.
  • This version only supports copies that are multiples of 32 bytes.
  • This works with the DPDK-VM performance test suite (packet sizes are always multiple of 32).
  • This does not work with the Linux-VM functional test suite (packets of all shapes and sizes).
  • Could easily be undetected bugs that invalidate all results, not confident until the full test suite runs.

To sanity check the performance I tried three scenarios: master branch (baseline), hack to skip data copies entirely (maximum possible speedup), and then the actual asm blitter code.

Testing with 128-byte packets on lugano-1 I see these results:

  • 5.6 Mpps baseline from the master branch.
  • 8.1 Mpps (+45%) maximum when copies are skipped.
  • 7.0 Mpps (+25%) with the assembler blitter.

I interpret this to mean that the copy performance is more than doubled i.e. with this optimization we are achieving more than half of the maximum possible speedup.

The challenges I see now are:

  • Get the full test suite running without losing this performance boost.
  • Use the PMU to understand what is really going on and how dependable this speedup will be.
  • Try some further optimization ideas e.g. streaming multiple packets in parallel.

Likely we also need to update the DPDK-VM benchmark to test with a more interesting variation of packet sizes so that we don't accidentally optimize for the "packet size is always a power of 2" special case that we would never see in real life :-).

This is a simple placeholder implementation for an optimized
bit-blitting API.
Update the vhost-user code to perform Snabb<->VM memory copies via the
lib.blit module. This allows experimental optimizations with local
changes to the blit module.

This essentially separates "virtio vring processing" and "virtio
memory copies" into being two separate problems that can be profiled
and optimized separately.

This is work-in-progress: Care must be taken not to let the guest see
that packets are available until the blit.barrier() operation has been
executed and I think this will require moving the ring index updates.
work in progress / not complete:
- Rounds all copies up to 32-bytes.
- Fails NFV benchmark test.
The lib.blit API is now implemented by an assembler routine that
batches copies together.

This is a work in progress due to one major restriction: copy length
has to be a multiple of 32 bytes.
| xor rax, rax
|->copy:
| vmovdqu ymm0, [rsi+rax]
| vmovdqu [rdi+rax], ymm0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to experiment unrolling this manually. I got some significant speedups by having more loads in flight.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is quite delicate :-). I started out with an unrolled version of the inner loop and then found that the looping version delivered the same performance. There have been other very innocent code variations that were much slower though. I want to use the PMU to explore these differences.

I would like to try unrolling the outer loop though to see if coping several packets in parallel could help.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, delicate indeed :) One thing to try is instead of doing load, store, load, store, to do load, load, store, store. That was what worked best for me. Good luck :)

@wingo
Copy link
Contributor

wingo commented Jan 19, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants