Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize atomic operations on UP (single-VCPU) #28

Open
nyh opened this issue Aug 5, 2013 · 0 comments
Open

Optimize atomic operations on UP (single-VCPU) #28

nyh opened this issue Aug 5, 2013 · 0 comments

Comments

@nyh
Copy link
Contributor

nyh commented Aug 5, 2013

The run time of mutex_lock() and mutex_unlock() is dominated by a single instruction, "lock xadd", which is generated by std::atomic::fetch_add().

On a single VCPU, the "lock" prefix isn't needed. Because the host is SMP it cannot ignore this prefix, but when the guest has a single VCPU, we know this prefix is not necessary. If we drop the "lock" prefix and use the ordinary increment instruction, the mutex becomes much faster - an uncontended lock/unlock pair drops from 22ns to just 9ns. When mutexes are heavily used (e.g., in memcached they take as much as 20% of the run time), this can bring a noticable improvement.

What we should do is to remember where in the code we have the "lock" prefix (the single byte 0xf0), and when booting on a single vcpu, replace them by "nop" (0x90). Linux also has such a mechanism (see asm/alternative.h) - "LOCK_PREFIX" generates the "lock" instruction but also saves in a ".smp_locks" section the address of this lock, and any time the number of cpus grows beyond 1 or shrinks to 1, the code iterates over these locates and changes them to 0x90 or 0xf0.

Doing the above is easy if we implemented our own "fetch_add" and "compare_exchange" operation. However, currently we use C++11's std::atomic and it will be a shame to lose its advantages (like working on any processors, not just x86). Perhaps there's a solution, though: uses the GCC builtins __atomic_fetch_add and friends (see http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html). So if we re-implement those, it can be enough. I tried to redefine this function and got some strange compilation errors, but maybe by re-"#define"-ing it before including , or some other ugly trick, we can force our own implementation.

A different approach we can consider (though it will probably be more complex) is to remove the lock prefix from all code in a certain function or section. This will be hard and risky, though - we need to understand where instructions begin and end, and what is code and what is not code. It will be safer if we can limit this transformation to single functions (such as lockfree_mutex_lock()) which are known not to be problematic in this regard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants