-
Notifications
You must be signed in to change notification settings - Fork 267
Scaling performance #2074
Comments
There appears to be 17 threads running on an 8 core machine. I can see why there would be main thread plus 8 for the number of cpus but there's an additional 8. I might be reading the instrument wrong though. Perhaps it has been sampled twice. They do have different TID's though |
I just ran #2072 on the fast computer after two runs:
Half then nothing. This doesn't feel like a performance problem if it just stops. |
performance stats
|
Here's a second run. Both runs panic at the end
Panic Backtrace |
Network testsSo the current branch seems to be stopping at exactly 948 not held (happened twice). Data in and outI logged this
Here's the log. Notes
Tcp connectionsI also logged all the tcp connections every second with It looks like there's many seconds between messages on a port. |
Run 99 nodes 20 instances
|
Run 99 nodes 30 instances
|
Direct messages are timing out at about 500 nodes
That just continues like that for the whole run. It doesn't look like the memory or the network are backing up. I've captured some cassettes and connection / memory data. I'm going to try and profile the cassette locally |
I've made this issue to track the work on increasing the performance of nodes as we scale.
Hypothesis
The current thought is that there is a performance issue in sim2h that is preventing the stress test from getting past ~2000 nodes.
What I've found so far
I do not think there is actually a performance issue in the sim2h code.
This is how the threads are being used:
Here is the analysis:
This is the case for current develop and sim2h-futures4.
I think either sim2h is not under a large workload or it's busy waiting for locks.
Issues with the test metric
The metric we are using, essentially checking for Agent ID entries to be held by at least one other node appears to be giving me large variance when run multiple times in a row.
Issues with perf and flamegraphs
These are great tools for finding hotspots within a program but they do not tell you much about cpu utilization. They can also average over large peaks of cpu usage. flamescope can help with the second issue but not the first.
It is possible to get thread stack traces from perf but I haven't found a nice way to do it. Instead I'm using a different profiler.
What I'm actually seeing
The amount of entries held drops by a few hundred the first run but then does not go down by much the second run through.
This is a recent run of 96 nodes with 20 instances. This is on the slower ubuntu computer. I have seen similar behaviour all day.
What to try next
I am using the USE method to try and find the issue. But so far I have not found high utilization or saturation in memory or cpu. I have not seen any errors.
If we can prove that the workload is making it to the connections we know the problem is in sim2h (probably as a logic or locking issue) otherwise the issue might be that the conductors are the slow point.
The text was updated successfully, but these errors were encountered: