Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

opal_progress: check timer only once per 8 calls #4697

Merged

Conversation

yosefe
Copy link
Contributor

@yosefe yosefe commented Jan 10, 2018

This PR improves osu_bw and osu_mbw_mr performance of UCX on ARM (30-50%) and x86_64 (10-20%) architectures.
@shamisp @hppritcha

@jsquyres
Copy link
Member

jsquyres commented Jan 10, 2018

@bwbarrett @bosilca @hjelmn This is an interesting PR -- it basically switches atomics to a static volatile. Thoughts?

Based on the comment in the code, let me ask a crazy question: is there value in removing volatile? I.e., a) decrease the performance penalty even more, because b) we don't really care if the number is not wholly accurate.

@artpol84
Copy link
Contributor

bot:mellanox:retest

@hjelmn
Copy link
Member

hjelmn commented Jan 10, 2018

My only issue with this is what about the situation where there is a low priority event and the calling code only enters MPI very infrequently. I have thought about making this a simple counter but keep running into how to force it to trigger on every call in that case. The timer handles this case.

I know this is not a common case but it is a situation worth thinking about.

@hjelmn
Copy link
Member

hjelmn commented Jan 10, 2018

@jsquyres The variable does't even need to be volatile for this to work. In either case the low priority calls will still trigger but their timing will be non-deterministic. I don't think this is an issue though. The atomic just ensures that it triggers exactly once every 8 calls.

@yosefe
Copy link
Contributor Author

yosefe commented Jan 10, 2018

@hjelmn regarding the infrequent calls to MPI - since we are doing the event loop every event_progress_delta anyway, it didn't seem like a problem. If something really requires deterministic progress, it should register itself to opal_progress. Do you know which events we have in that libevent loop?
In MXM and UCX we used a progress thread which wakes up once a while and signals the main thread that timers should be dispatched, by incrementing a shared counter (which can be sampled quickly).

@jsquyres removing volatile seemed a bit too dangerous since that variable really can be accessed by multiple threads.

@hjelmn
Copy link
Member

hjelmn commented Jan 10, 2018

@yosefe Removing volatile will likely not change much besides timing. It likely will not even change the performance. I don't think it really matters either way.

Looking at what I wrote it isn't completely correct. For low-priority events it is assumed they can be delayed almost indefinitely and that if they need to be processed the user should be calling into MPI. The same isn't necessarily true about libevent events. I would have to take a look and see if there is a problem. As is I think I am ok with this change though. I can see how there might be an improvement by removing the atomic but not necessarily the timer call. The overhead of the rdtsc instruction is ~ 40 cycles. What is the overhead of the aarch64 mrs instruction?

@hjelmn
Copy link
Member

hjelmn commented Jan 10, 2018

I plan to test this change with some RMA-MT benchmarks and see how it affects other code paths.

@yosefe
Copy link
Contributor Author

yosefe commented Jan 10, 2018

@hjelmn
on x86_64 - 40 instructions is quite a lot, and there is more overhead coming from clock_gettime() library functions. It also seems like rdtsc is slowing down the CPU execution pipeline. We observe noticeable message rate improvement when removing this call.
on aarch64 - on the machines we tested, clock_gettime(CLOCK_MONOTONIC) was translated to a system call. This is probably a problem of its own, but nevertheless removing the timer read improves both aarch64 and x86_64.

Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nack; this patch can not go in. It will adversely impact TCP by not entering the event library every time we call opal_progres().

@artpol84
Copy link
Contributor

artpol84 commented Jan 11, 2018

What if there will be 2 options for doing the progress and PML or BTL will have an option to select between them?
One will be the current one and another - proposed one.
If some component will request more aggressive libevent poll - this will override what others have requested.
In UCX case there will be only PML that will request "light" progress, while in OB1 case it will ask underlying BTLs about their requirements and BTL/TCP for example will require "heavy" version of progress and this will be propagated up by OB1.

@bwbarrett
Copy link
Member

That's essentially what the old code did; happy to see a PR for an improved version, but not polling the event library every time when there are consumers (ie, it is polled every iteration) is not an acceptable change.

@artpol84
Copy link
Contributor

Old code is slower than new one by 5-10%, so what I suggest is to have 2 opal_progress functions and use the pointer to select the proper one. Using pointer may introduce some overhead and I can't estimate it according to my experience.

@bwbarrett
Copy link
Member

I'd have some concerns about the pointer swap (ie, want to see the code) for how we turn on / off fast polling of the event library for the TCP BTL. But I think a pointer could work. And I wouldn't be concerned about the pointer overhead for the TCP BTL, so if it works for UCX, it's probably a good plan.

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2018

Someone also needs to check the other BTLs, plus the Portals and OFI MTLs - I don't know what else might be impacted, but a thorough check of non-UCX paths seems in order before proceeding.

@bwbarrett
Copy link
Member

@rhc54, sure, but if it makes UCX faster than before, it almost certainly will for others. I actually have no objections to the proposal from a general BTL impact (it should only help). The two places it will hurt are the TCP BTL (because you could go 7 extra entrances to the MPI library before calling into the event library to actually send/receive the packet) and out-of-band at time periods other than INIT and FINALIZE.

@rhc54
Copy link
Contributor

rhc54 commented Jan 11, 2018

I agree that it should be okay, but we've been burned before by making that assumption - best to ensure the right people know about it and check.

@matcabral @tkordenbrock Just want to ensure you take a look and are okay with this.

@hjelmn
Copy link
Member

hjelmn commented Jan 11, 2018

We could condition this path on the event users counter being 0. The TCP btl increments there counter when it is needed. If the counter is non zero we can always check the time delta.

@matcabral
Copy link
Contributor

From code inspection, I think this could indeed improve PSM2 MTL (most likely OFI also). Strictly from the timer standpoint, in the past I saw lots of time spent on rdtsc, lowering that would be beneficial. However, can't say how lowering the events call may impact, i can run a few tests.

@shamisp
Copy link
Contributor

shamisp commented Jan 11, 2018

Few comments.
@artpol84 proposal sounds very reasonable. Whenever possible we should avoid pocking the clock.

One a separate issue:
ompi should use the architectural timer (on architectures where it is available) and not clock_gettime(), especially in the performance critical path like opal_progress. We have had exactly the same issue before and it was fixed some time ago by @bosilca . I think it was broken again and apparently for a reason: #3003

On aarch64 I would recommend to go back to the arch timer: https://github.com/open-mpi/ompi/blob/master/opal/include/opal/sys/arm64/timer.h#L24

@tkordenbrock
Copy link
Member

I don't think this will have a negative impact on any of the Portals4 components. I'll run some benchmarks today.

@yosefe
Copy link
Contributor Author

yosefe commented Jan 16, 2018

@bwbarrett @hjelmn added check for num_event_users and removed volatile
this makes TCP check events on every call

attached are performance results of osu_bw/osu_latency on single node, Xeon(R) CPU E5-2680 v4 @ 2.40GHz
intranode_removed_timer.xlsx

@jsquyres
Copy link
Member

Per 2018-01-16 webex, we all agree that the idea of this PR is good. We just can't kill TCP performance in doing so. Mellanox will work on updating this PR, but can't commit on a timeframe.

@yosefe
Copy link
Contributor Author

yosefe commented Jan 16, 2018

@jsquyres IMHO latest update should avoid killing TCP performance

@jladd-mlnx
Copy link
Member

@bwbarrett Yossi added a check for num_event_users which preserves the current behavior for TCP. We believe this resolves your concerns. Please confirm.

Copy link
Member

@bwbarrett bwbarrett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge the two commits into one and add a better description of the change (in particular, the why of the change) and I'm happy.

@bwbarrett
Copy link
Member

@jladd-mlnx & @yosefe, my bad, I missed what Yossi was saying with his update yesterday. I'm happy with the code changes; just want some commit comment and number of commit changes and I'll approve.

Reading the system clock on every call to opal_progress() is an
expensive operation on most architectures, and it can negatively affect
the performance, for example of message rate benchmarks.

We change opal_progress() to read the clock once per 8 calls, unless
there are active users of the event mechanism.

Signed-off-by: Yossi Itigin <[email protected]>
@yosefe yosefe force-pushed the topic/opal-progress-avoid-checking-timer branch from e713a28 to 7cee603 Compare January 16, 2018 17:19
@yosefe
Copy link
Contributor Author

yosefe commented Jan 16, 2018

@bwbarrett done

@shamisp
Copy link
Contributor

shamisp commented Jan 16, 2018

Looks good !
@bwbarrett @jsquyres - what about addressing more generic concern related to the usage of slow timers. Eventually it will bites us in another places.

@hjelmn
Copy link
Member

hjelmn commented Jan 16, 2018

@shamisp We should be using the fastest available timers internally. The timers used by opal_progress() are the ones in opal/include/opal/sys/*/timer.h . The only thing that should be using the clock_gettime () is MPI_Wtime().

@hjelmn
Copy link
Member

hjelmn commented Jan 16, 2018

Hmm, except when using builtin atomics! That is a problem.. I think.

@shamisp
Copy link
Contributor

shamisp commented Jan 16, 2018

From what I see, if the clock_gettime available it will use it for all places, including progress.
Even for MPI_Wtime() I would suggest to use the architectural timer. The last time the issue was the accuracy of some timer implementation. If this is the case this timers should fall back to clock_gettime otherwise use architectural timers, but not the other way around.

Probably we should open a separate issue for this ?

@hjelmn
Copy link
Member

hjelmn commented Jan 16, 2018

@shamisp Nope. See: https://github.com/open-mpi/ompi/blob/master/opal/runtime/opal_progress.c#L193

We use the native timer if it is available. If there are cases where we have the native timer and we are not using it that needs to be fixed.

As for MPI_Wtime(). There was a long discussion on this some time ago. The problem is that no one has implemented the proper function to get the rdtsc timer frequency in open so we decided to fall back on clock_gettime(). If someone wants to open a pull request to get the correct frequency we can revisit the issue.

@bosilca
Copy link
Member

bosilca commented Jan 16, 2018

@shamisp MPI_Wtime requires a monotonic timer. We fall back on clock_gettime only if the architecture timer is not monotonic and the user has not turned off the monotonic requirement.

@hjelmn
Copy link
Member

hjelmn commented Jan 16, 2018

@bosilca That is what we used to do until we ran into the frequency issue. We now only use clock_gettime(). See: https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/wtime.c . I would love to see a fix but we need to implement the code to get the intel rdtsc frequency.

@shamisp
Copy link
Contributor

shamisp commented Jan 16, 2018

@hjelmn
Copy link
Member

hjelmn commented Jan 16, 2018

@shamisp On x86_64 (and aarch64) OPAL_PROGRESS_ONLY_USEC_NATIVE is false so we call L196 not L194. This path does not call clock_gettime ().

@shamisp
Copy link
Contributor

shamisp commented Jan 16, 2018

@hjelmn both are mapped to clock_gettime whenever it is available.

@hjelmn
Copy link
Member

hjelmn commented Jan 16, 2018

@shamisp That is a bug if it is using clock_gettime(). It is supposed to be using the native timer :-/.

@hjelmn
Copy link
Member

hjelmn commented Jan 16, 2018

@shamisp
Copy link
Contributor

shamisp commented Jan 16, 2018

Hmm...seems like it is supposed to take the right pass, but this is not what has been reported in performance measurements. Probably I have to take additional steps to confirm it.

As for https://github.com/open-mpi/ompi/blob/master/ompi/mpi/c/wtime.c
Why we would not switch back to architectural timer for non-x86 platforms ?

@hjelmn
Copy link
Member

hjelmn commented Jan 16, 2018

@shamisp We took the easy path for now. Don't think any of my users care about the performance of MPI_Wtime(). Outside benchmarks I don't think anyone cares.

This will probably change if we add a new timer to MPI. There is a proposal from the tools working group to add a call to get cycles.

@jsquyres
Copy link
Member

@shamisp That being said, pull requests are always appreciated... 😄

@shamisp
Copy link
Contributor

shamisp commented Jan 16, 2018

@hjelmn within certain builds (like static) vdso will fallback to system call, so impact on performance and measurements might be quite substantial.

Why we can not go back to what it was before, and just have it disabled if monotonic timer is not there. Essentially use the same path that opal supposed to take (in theory ?)

@bosilca
Copy link
Member

bosilca commented Jan 16, 2018

variable clock rate ?

@shamisp
Copy link
Contributor

shamisp commented Jan 16, 2018

@bosilca ifdef (x86_64) ?

@hjelmn
Copy link
Member

hjelmn commented Jan 16, 2018

@shamisp Would love to see the PR :). Someone with cycles for this needs to do the work. I don't as my users don't care. Too many other things to do.

@shamisp
Copy link
Contributor

shamisp commented Jan 16, 2018

@hjelmn are you okay with the patch where only x86_64 takes the clock_gettime() path and the rest fallbacks to the original flow ?

@jsquyres
Copy link
Member

"Someone with cycles..." Hah! Very punny. 👏

@shamisp
Copy link
Contributor

shamisp commented Jan 16, 2018 via email

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2018

I'd rather we not do that until someone quantifies the impact the change would have on x86 systems. We spent a ton of time debating this before, and had all organizations that wanted to participate spend time benchmarking the options before deciding on the current course. Frankly, I'm a tad annoyed to suddenly find us resetting back to square one.

@matcabral Would you have time to re-run the benchmarks with the proposed change? I personally consider this low-priority given all that has already transpired, but would appreciate it if you took a look as time permits.

@rhc54
Copy link
Contributor

rhc54 commented Jan 17, 2018

My bad - I'm told that this proposed change will only take everyone but x86 down a currently unused code path. I got lost in the exchange and thought something else was being proposed.

Objection removed - feel free.

@yosefe yosefe merged commit 79ca1c4 into open-mpi:master Jan 17, 2018
@yosefe yosefe deleted the topic/opal-progress-avoid-checking-timer branch January 17, 2018 08:48
@matcabral
Copy link
Contributor

I will still test the new patch ;)

@hjelmn
Copy link
Member

hjelmn commented Jan 17, 2018

@shamisp Go ahead and open the PR. I will review.

@shamisp
Copy link
Contributor

shamisp commented Jan 18, 2018

@rhc54 - correct the intend to leave x86 flow as it is and redirect all the rest to the arch timer, if it is available.
@hjelmn - I will submit patch

Thnx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.