Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core model was changed::even more high performance #17

Open
wants to merge 28 commits into
base: next
Choose a base branch
from

Conversation

cmpxchg16
Copy link

  1. Changing the core model so no locks at all in Scheduler layer
  2. Adding optimizations
  3. Adding some examples
  4. Adding bug fixes

cmpxchg16 and others added 27 commits April 28, 2013 09:42
2. add destructor to SSLStream - call close
3. bug fix in SSL handling::call flush in case of successfull SSL_write that small from the threshold
2. rename simplefileserver to simplehttpfileserver
3. add to simplehttpfileserver SSL
4. empty README
2. change echoserver example to be echoserver
@kevincai
Copy link
Contributor

The change request improves the performance in the sake of losing multi-thread safety. Instead of changing the core scheduler model, it would be better to provide compiling level switch that is able to turn off multi-thread safety and gain the performance benefits.

@cmpxchg16
Copy link
Author

It doesn't lose multi-thread safety while the core model was changed without the need of locks because:

  1. each Scheduler/IOManager run in it's own native thread without open native threads itself, so any tasks execution can be without locks at all because each Scheduler run ONLY it's tasks and doesn't submit tasks to other native threads
  2. For context switching between network IO and disk IO, I am not using the WorkerPool for the same reason of wasted locks, I just switch in the same Scheduler between them, so again the Scheduler handle ONLY his tasks
  3. I change accept so any Scheduler/Native thread can subscribe to accept (OS are thread safe internally from multiple threads accept) and the read/write/timeout handling submitted to specific native thread/Scheduler.

@ianupright
Copy link

These changes seem like the right way to go to me. I'm guessing even more atomic and multi threaded concurrency could be removed, improving performance even more. I guess what we are sacrificing in such a model is for many fibers to be distributed easily and evenly between a workerpool of threads?

@cmpxchg16
Copy link
Author

We get ~=uniform distribution by that the native thread that catch the accept his fibers will handle the connection/disk tasks.
The work stealing between native threads can be ignore because if there is one native thread that idle - it's mean your system not under load - so this case not interesting.

@ianupright
Copy link

For the typical web-service application, I agree, this would be the case. However, if you have something a little more complex, a web-service that spawns off hundreds of fibers, each needing a certain amount of IO, and a mixture of IO and CPU processing, then it may get a little more complicated. Knowing how or when to migrate which fibers to which threads is now not obvious. However, with some higher-level scheduling, this could be handled. Ideally, you should only pay for the multithreaded concurrency at the points where you need it, instead of being everywhere.

@ianupright
Copy link

If fibers can automatically and magically migrate between threads, however, that can create other problems. Then you are forced to deal with multithreaded concurrency issues in cases where you may not want the extra design complexity nor do you want the (sometimes substantial) additional concurrency overhead. So I think moving fibers between threads controlled more at an application level is what makes sense to me.

@kevincai
Copy link
Contributor

kevincai commented Aug 4, 2013

A few comments:

  1. Scheduler is changed to the single-thread model, so that the mutex can be removed safely. This is an extreme case of original design that the Scheduler will run with only one native thread. I would rather there is a compiling flag that turns Scheduler to this model if the application is only interested in single-thread model and be critical to the performance.
  2. This changes the original design purpose of the Scheduler which treats native threads as execution pool. Single-thread scheduler throws the multi-threads scheduling issue to the application level. Each application needs a Scheduler of the scheduler in order to leverage multi-thread and task balancer. In other words, it is not so application friendly if used in multi-thread environment.
  3. StackPool should not be implemented in Scheduler. It should be implemented independently and should be easily replaced by user-provided pool, just like the allocator in STL.

@cmpxchg16
Copy link
Author

I agree with your comment on StackPool.
I also want to develop specific stack manager to gain more performance on Linux system,
Because the implementation of mmap/munmap include a lot of VMA's, for a long run process it's a performance killer, a very lightweight implementation at kernel can boost performance.

@cmpxchg16 cmpxchg16 closed this Aug 4, 2013
@cmpxchg16
Copy link
Author

OOPS...

@cmpxchg16 cmpxchg16 reopened this Aug 4, 2013
2. change the default size of buffered stream to 4K
3. add simple implementation to transfer stream
@mtanski
Copy link

mtanski commented Jan 2, 2014

I've did some tests against your branch cmpxchg16 and my own branch. My own branch contains changes to port Mordor to C++11 but also includes a change that uses malloc instead of mmap for stack allocation. Generally system mallocs (or better yet tcmalloc / jemalloc) perform thread aware caching of freed values, which basically means we get pooled stacks for free. We also avoid some potential performance penalties for using mmap where the kernel has to change the VMA for the process.

It strikes me as much of the work to do stack polling internally is moot if we made that small change. There's also no need to remove the built in multi core model. I'm sure I could get better performance on Linux by using _setjmp fibers (avoid process mask change) but I didn't for the sake of not changing things too much.

Here's my results. I've used the same testing methodology as outline in your post.

This is a test on a Ubuntu 13.04 VM inside OSX.
The hardware is a 13" Early 2013 Macbook Pro. SSD drive, 8gig rams, two HT
i7 cores. The VM gets 4 virtual cores.

Mordor C++11 (my branch)

Lifting the server siege... done.
Transactions: 34628 hits
Availability: 100.00 %
Elapsed time: 9.95 secs
Data transferred: 43.33 MB
Response time: 0.12 secs
Transaction rate: 3480.20 trans/sec
Throughput: 4.35 MB/sec
Concurrency: 426.04
Successful transactions: 34628
Failed transactions: 0
Longest transaction: 7.18
Shortest transaction: 0.00

cmpxchng16 Mordor

The server is now under siege...
Lifting the server siege... done.
Transactions: 40459 hits
Availability: 100.00 %
Elapsed time: 9.26 secs
Data transferred: 50.62 MB
Response time: 0.11 secs
Transaction rate: 4369.22 trans/sec
Throughput: 5.47 MB/sec
Concurrency: 485.08
Successful transactions: 40459
Failed transactions: 0
Longest transaction: 1.72
Shortest transaction: 0.00

Mordor without mmap stack / but malloc

Transactions: 44371 hits
Availability: 100.00 %
Elapsed time: 9.95 secs
Data transferred: 55.52 MB
Response time: 0.11 secs
Transaction rate: 4459.40 trans/sec
Throughput: 5.58 MB/sec
Concurrency: 486.28
Successful transactions: 44371
Failed transactions: 0
Longest transaction: 1.24
Shortest transaction: 0.00

Mordor C++11 malloc + tcmalloc

Transactions: 43552 hits
Availability: 100.00 %
Elapsed time: 9.57 secs
Data transferred: 54.49 MB
Response time: 0.11 secs
Transaction rate: 4550.89 trans/sec
Throughput: 5.69 MB/sec
Concurrency: 485.36
Successful transactions: 43552
Failed transactions: 0
Longest transaction: 1.23
Shortest transaction: 0.00

@mtanski
Copy link

mtanski commented Jan 2, 2014

For comparison here is a stock mordor (without C++11) from Cody's branch. Same as before, best run out of 3.

Transactions: 33561 hits
Availability: 100.00 %
Elapsed time: 9.76 secs
Data transferred: 41.99 MB
Response time: 0.13 secs
Transaction rate: 3438.63 trans/sec
Throughput: 4.30 MB/sec
Concurrency: 438.10
Successful transactions: 33561
Failed transactions: 0
Longest transaction: 7.43
Shortest transaction: 0.00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants