Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Regression with async-std > 1.6.2 #892

Open
kydos opened this issue Oct 6, 2020 · 7 comments
Open

Performance Regression with async-std > 1.6.2 #892

kydos opened this issue Oct 6, 2020 · 7 comments

Comments

@kydos
Copy link

kydos commented Oct 6, 2020

Hello Everyone,

First of all thanks very much for the great work on async-std. We are making heavy use of this framework in zenoh and have remarked a major performance drop when upgrading from 1.6.2. Whey I say major I mean that our throughput for in some cases is divided by two.

We have identified that the performance issue is introduced on the publishing side and to highlight the huge difference in the cpu time taken by async 1.6.5 vs that taken by 1.6.2 we have made some flames graphs collecting perf data while running our throughput performance test.

The exact command used to collect perf data is included below and the code was compiled in release mode:

$ perf record --call-graph dwarf,16384 -e cpu-clock -F 997 ./target/release/examples/zn_pub_thr 8

The resulting flame graphs are available here for 1.6.2 and here for 1.6.5.

zenoh GitHub depository is https://github.com/eclipse-zenoh/zenoh/tree/rust-master

As you will see from the flame graphs the <core::future::from_generator::GenFuture as core::future::future::Future>::poll takes very little time on 1.6.2 and almost 50% of the time on 1.6.5.

I know that there have been changes in the scheduler, maybe we need to change something on our side. In any case any insight will be extremely welcome.

Thanks very much in advance!

Keep the Good Hacking!

@Licenser
Copy link
Contributor

Licenser commented Oct 6, 2020

Quick question, what kind of CPU configuration are you running this on? on SMP or Ryzen systems 1.6.5 suffers from cach invaludation when moving tasks between different core clusters.

@kydos
Copy link
Author

kydos commented Oct 6, 2020

@Licenser thanks very much for the prompt response. The flame-graphs were made on a intel skylake running the latest Ubuntu Linux.

Let me know if you need other info or want us to run some other tests.

@yoshuawuyts
Copy link
Contributor

One change that comes to mind is that we're no longer inlining TcpStream futures because of #889; this was required to fix a critical failure because of a dependency having issued a breaking change in a minor version update. If you're at all testing TCP this may be relevant.

I'm not sure what the right solution here is, but perhaps switching to async-net inside our network types may help resolve this.

@kydos
Copy link
Author

kydos commented Oct 6, 2020

@yoshuawuyts that could be the issue as we are heavily using TcpStream. BTW, it would also seem that there is some higher overhead on the ConcurrentQueue::pop. In any case the flame graph reveal that the TCPStream sending side performance have indeed degraded.

@kydos
Copy link
Author

kydos commented Nov 3, 2020

Hello everyone, any updates on this issue? We would be happy to help testing systematically async-std performance before each release. We can do that running zenoh on our 10Gbps testbed.

@ghost
Copy link

ghost commented Nov 3, 2020

Looking at the flamegraph, it seems that with v1.6.2 everything runs inside the main thread, while at v1.6.5 half of the program is in the main thread and the other half is on executor threads. Am I reading that right?

It would be worth trying to see what happens if the benchmark is run with ASYNC_STD_THREAD_COUNT=1 and if the body of the main function is wrapped in a spawn() like so:

#[async_std::main]
fn main() {
    async_std::task::spawn(async {
        // code goes here...
    })
    .await
}

@kydos
Copy link
Author

kydos commented Nov 3, 2020

Hello @stjepang, this is what we thought at first, then by looking carefully we spotted that the other thread is doing very little work. What seems to us, is that what used to represent a marginal overhead in 1.6.2 has grown to show as a wider portion of the flame graph.

In any case we'll try running with a single thread and will let you know what that gives. Thanks for the suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants