Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat:change thread sheduling method in ThreadPool class #2648

Open
wants to merge 24 commits into
base: unstable
Choose a base branch
from

Conversation

QlQlqiqi
Copy link
Contributor

@QlQlqiqi QlQlqiqi commented May 12, 2024

The logic is based on function WriteThread::AwaitState in rocksdb. link

Before:

  1. All workers and main thread which pushs task in queue both are waiting the same lock. It can cause very intense competition.
  2. When a worker has finished one task, it will try to get lock again for a new task through function await. It can make the worker sleep with high probability due to intense competition. And it can cost much time to sleep and wake up.

After:

  1. This is a standard producer-consumer model. So we can use lock-free list to deal with this problem about intense competition.
  2. When a worker wake up, it will try to get tasks. And when it find there is no tasks, it will try to loop for a while to wait for new tasks. Because with high throughput the time for waiting new tasks is very short, so this loop will NOT cause serious block. In order to reduce the block time, the loop has 3 level.
    2.1. 1-level. Using spin-loop to wait.
    2.2. 2-level. Using long-time-loop to wait. The worker maybe yield the cpu when some condition is reached. And using a data to store probability of entering 2-level loop.
    2.3. 3-level. Using function await to wait for new tasks.

params

  1. the count of 1-level loop:
    default: 200. Too much number maybe cause high cpu load. Too few number maybe cause vain opration.
  2. queue_slow_size_:
    default: std::min(worker_num, 100). When the number of tasks in queue exceeds it, the main thread which call function Schedule call std::this_thread::yield().
  3. max_queue_size_:
    default: max_queue_size. When the number of tasks in queue exceeds it, the main thread which call function Schedule call std::this_thread::yield() till the number of tasks in queue is less than threshold.
  4. max_yield_usec_:
    default: 100. The max time of loop in 2-level loop.
  5. slow_yield_usec_:
    default: 3. If the time the function std::this_thread::yield() spends exceeds the threshold, the data sorce may be updated.
  6. kMaxSlowYieldsWhileSpinning:
    default: 3. If the times of reaching condition above(5), the data sorce will be updated.
  7. sampling_base:
    default: 256. It represent the provability of enter 2-level loop is not lower than 1/sampling_base.

Summary by CodeRabbit

  • Refactor

    • Major restructuring of the thread pool's internal architecture for improved task handling, worker management, and synchronization.
    • Significant changes to the ThreadPool and Worker classes for better performance and scalability.
  • New Features

    • Introduced methods for task execution, delayed scheduling, and synchronization within the thread pool.
  • Improvements

    • Enhanced logic for managing multiple links and initializing workers.
    • Updated task scheduling mechanisms to support better performance and reliability.
    • Increased thread and pool sizes for improved resource utilization and performance.

@QlQlqiqi QlQlqiqi changed the title change thread sheduling method in ThreadPool class feat:change thread sheduling method in ThreadPool class May 12, 2024
@github-actions github-actions bot added the ✏️ Feature New feature or request label May 12, 2024
@QlQlqiqi QlQlqiqi requested a review from AlexStocks May 12, 2024 06:43
// 1. loop for short time
for (uint32_t tries = 0; tries < 200; ++tries) {
if (newest_node_.load(std::memory_order_acquire) != nullptr) {
last = newest_node_.exchange(nullptr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里先到的线程直接摘了整个链表,据为己有,在去线性消费,可能会导致延迟波动大,建议尽量将任务均匀分给线程池里的worker。毕竟Pika读写链路上都是自己的线程,和rocksdb的线程模型差异比较大(Rocksdb这块都是application线程在对每个writer并发),这一块可能得多一些考量。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我想了一下,一次只取一定数量的 task 大概有两种办法:
1、一个 worker 对应一个无锁链表,然后新的 task 就随机或者遍历地往这些链表中加;
2、依旧使用一个无锁链表,但是无锁链表的容量较低,比如为 10 个这样的,这样保证一个 worker 一次最多取 10 个。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

第二个方法直接测试就行,第一个方法见我新的分支:https://github.com/QlQlqiqi/pika/tree/change-thread-shedule-with-mutil-list

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这里测的结果是这两个方法速度不相上下,当然如果调参合适的话应该会有较大的差距

@AlexStocks
Copy link
Collaborator

@CodiumAI-Agent /review

@CodiumAI-Agent
Copy link

PR Review 🔍

⏱️ Estimated effort to review [1-5]

4, because the PR involves significant changes to the threading and scheduling logic in a thread pool implementation, which is a complex area requiring careful review to ensure thread safety, performance, and correctness. The changes include the introduction of lock-free structures and a detailed scheduling mechanism which needs thorough testing and validation.

🧪 Relevant tests

No

⚡ Possible issues

Possible Bug: The use of std::this_thread::yield() in the Schedule method may not be sufficient to handle high contention gracefully. This could lead to performance degradation under heavy load.

Thread Safety Concern: The lock-free implementation using std::atomic and manual memory management (e.g., new and delete for Node objects) must be carefully reviewed to ensure there are no race conditions or memory leaks.

🔒 Security concerns

No

Code feedback:
relevant filesrc/net/src/thread_pool.cc
suggestion      

Consider using a condition variable instead of busy-waiting with std::this_thread::yield() in the Schedule method. This change could improve CPU usage and responsiveness under high load. [important]

relevant linestd::this_thread::yield();

relevant filesrc/net/src/thread_pool.cc
suggestion      

Implement proper memory management for Node objects to prevent memory leaks. Consider using smart pointers (e.g., std::unique_ptr) instead of raw pointers for automatic memory management. [important]

relevant lineauto node = new Node(func, arg);

relevant filesrc/net/src/thread_pool.cc
suggestion      

To prevent potential race conditions with the lock-free list, ensure that all operations on std::atomic pointers in LinkOne and related functions are correctly ordered using memory order tags. [important]

relevant lineif (newest_node->compare_exchange_weak(nodes, node)) {

relevant filesrc/net/src/thread_pool.cc
suggestion      

Add error handling for the scenario where new Node fails to allocate memory, which could lead to a null pointer dereference. [medium]

relevant lineauto node = new Node(func, arg);

@AlexStocks
Copy link
Collaborator

@CodiumAI-Agent /improve

@CodiumAI-Agent
Copy link

PR Code Suggestions ✨

CategorySuggestion                                                                                                                                    Score
Possible bug
Add exception handling for memory allocation to prevent leaks

Ensure proper exception handling for dynamic memory allocation when creating new Node
objects to prevent memory leaks and ensure program stability.

src/net/src/thread_pool.cc [112-113]

-auto node = new Node(func, arg);
-LinkOne(node, &newest_node_);
+Node* node = nullptr;
+try {
+  node = new Node(func, arg);
+  LinkOne(node, &newest_node_);
+} catch (const std::bad_alloc& e) {
+  // Handle memory allocation failure, e.g., log error, clean up, and throw if necessary
+  delete node;
+  throw;
+}
 
Suggestion importance[1-10]: 9

Why: Adding exception handling for dynamic memory allocation is crucial to prevent memory leaks and ensure program stability. This is a significant improvement addressing a potential bug.

9
Best practice
Use smart pointers for automatic memory management

Replace the manual memory management with smart pointers to avoid manual deletion and
improve code safety and readability.

src/net/src/thread_pool.cc [112-113]

-auto node = new Node(func, arg);
-LinkOne(node, &newest_node_);
+auto node = std::make_unique<Node>(func, arg);
+LinkOne(node.get(), &newest_node_);
 
Suggestion importance[1-10]: 8

Why: Using smart pointers enhances code safety and readability by automating memory management, reducing the risk of memory leaks and manual deletion errors. This is a best practice for modern C++.

8
Enhancement
Use a thread-safe queue to simplify task scheduling

Consider using a thread-safe queue or priority queue that encapsulates the synchronization
logic internally, rather than manually managing the synchronization and node linking in
ThreadPool::Schedule.

src/net/src/thread_pool.cc [112-113]

-auto node = new Node(func, arg);
-LinkOne(node, &newest_node_);
+task_queue.push(std::make_unique<Node>(func, arg));
 
Suggestion importance[1-10]: 8

Why: Encapsulating synchronization logic within a thread-safe queue can simplify the code and reduce the potential for synchronization errors. This enhancement improves code maintainability and readability.

8
Performance
Replace busy waiting with a sleep mechanism to improve CPU efficiency

Consider using a more efficient locking mechanism or lock-free data structures for the
ThreadPool::Schedule method to avoid potential performance bottlenecks due to the use of
std::this_thread::yield() in a busy loop. This can lead to thread starvation and
inefficient CPU usage.

src/net/src/thread_pool.cc [103-109]

 while (node_cnt_.load(std::memory_order_relaxed) >= max_queue_size_) {
-  std::this_thread::yield();
+  std::this_thread::sleep_for(std::chrono::milliseconds(1));
 }
 if (node_cnt_.load(std::memory_order_relaxed) >= queue_slow_size_) {
-  std::this_thread::yield();
+  std::this_thread::sleep_for(std::chrono::milliseconds(1));
 }
 
Suggestion importance[1-10]: 7

Why: The suggestion to replace busy waiting with a sleep mechanism can improve CPU efficiency and reduce thread starvation. However, it may introduce latency in task scheduling, so the impact on performance should be carefully evaluated.

7

Copy link

coderabbitai bot commented Jun 7, 2024

Walkthrough

The recent updates to the ThreadPool class involve a substantial overhaul aimed at enhancing task management, worker dynamics, and synchronization. Key modifications include the removal of the Task struct, significant changes to the Worker class, and the introduction of new methods and member variables in the ThreadPool class. These enhancements are designed to optimize task handling, improve resource allocation, and ensure more effective synchronization among threads.

Changes

Files Change Summary
conf/pika.conf Enhanced configuration with new parameters for caching, thread management, and compaction strategies; increased thread and pool sizes.
src/net/include/thread_pool.h Major restructuring of ThreadPool and Worker classes; removal of Task struct; new structs and method signatures added.
src/net/src/thread_pool.cc Refactoring of WorkerMain, runInThread, and task scheduling methods to improve efficiency and control flow.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant ThreadPool
    participant Worker
    Client->>ThreadPool: Schedule(task)
    ThreadPool->>Worker: Notify worker
    Worker-->>ThreadPool: Fetch task
    Worker->>Worker: Execute task
    Worker-->>ThreadPool: Task complete
Loading

Poem

In the code where threads entwine,
A pool revamped, now so divine.
Workers buzz with tasks anew,
Synchronized in perfect queue.
Tasks handled with care and might,
ThreadPool’s future, oh so bright! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range and nitpick comments (2)
tests/integration/start_codis.sh (1)

Line range hint 38-38: Add error handling after the cd command to ensure script robustness.

- cd ../codis
+ cd ../codis || exit
src/net/src/thread_pool.cc (1)

Line range hint 102-283: Replace busy waiting with more efficient waiting mechanisms and ensure robust memory management to prevent leaks and ensure thread safety.

- while (node_cnt_.load(std::memory_order_relaxed) >= max_queue_size_) {
-   std::this_thread::yield();
- }
+ std::unique_lock<std::mutex> lock(mu_);
+ rsignal_.wait(lock, [this] { return node_cnt_.load() < max_queue_size_; });

- auto node = new Node(func, arg);
+ auto node = std::make_unique<Node>(func, arg);
- LinkOne(node, &newest_node_);
+ LinkOne(node.get(), &newest_node_);
- node.release();  // Transfer ownership to the linked list
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5240e45 and 0d1b00f.

Files selected for processing (6)
  • src/net/include/likely.h (1 hunks)
  • src/net/include/net_define.h (1 hunks)
  • src/net/include/random.h (1 hunks)
  • src/net/include/thread_pool.h (3 hunks)
  • src/net/src/thread_pool.cc (5 hunks)
  • tests/integration/start_codis.sh (2 hunks)
Files skipped from review due to trivial changes (2)
  • src/net/include/likely.h
  • src/net/include/net_define.h
Additional context used
Shellcheck
tests/integration/start_codis.sh

[warning] 38-38: Use 'cd ... || exit' or 'cd ... || return' in case cd fails. (SC2164)

#rm -rf /tmp/codis
#rm -rf codis_data_1
#rm -rf codis_data_2
# sugguest clear these before starting codis
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the typo in the comment.

- # sugguest clear these before starting codis
+ # suggest clear these before starting codis
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
# sugguest clear these before starting codis
# suggest clear these before starting codis

Comment on lines 72 to 115
private:
struct Node {
Node* link_older = nullptr;
Node* link_newer = nullptr;

// true if task is TimeTask
bool is_time_task;
TimeTask task;

Node(TaskFunc func, void* arg) : is_time_task(false), task(0, func, arg) {}
Node(uint64_t exec_time, TaskFunc func, void* arg) : is_time_task(true), task(exec_time, func, arg) {}

inline void Exec() { task.func(task.arg); }
inline Node* Next() { return link_newer; }
};

static inline void AsmVolatilePause() {
#if defined(__i386__) || defined(__x86_64__)
asm volatile("pause");
#elif defined(__aarch64__)
asm volatile("wfe");
#elif defined(__powerpc64__)
asm volatile("or 27,27,27");
#endif
// it's okay for other platforms to be no-ops
}

Node* CreateMissingNewerLinks(Node* head);
bool LinkOne(Node* node, std::atomic<Node*>* newest_node);

std::atomic<Node*> newest_node_;
std::atomic<int> node_cnt_; // for task
std::atomic<Node*> time_newest_node_;
std::atomic<int> time_node_cnt_; // for time task

const int queue_slow_size_; // default value: min(worker_num_ * 10, max_queue_size_)
size_t max_queue_size_;

const uint64_t max_yield_usec_;
const uint64_t slow_yield_usec_;

AdaptationContext adp_ctx;

size_t worker_num_;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure proper memory management for Node objects and correct ordering of atomic operations to prevent race conditions.

- Node* link_older = nullptr;
- Node* link_newer = nullptr;
+ std::atomic<Node*> link_older{nullptr};
+ std::atomic<Node*> link_newer{nullptr};

- std::atomic<Node*> newest_node_;
- std::atomic<int> node_cnt_;
+ std::atomic<Node*> newest_node_{nullptr};
+ std::atomic<int> node_cnt_{0};

- if (newest_node->compare_exchange_weak(nodes, node)) {
+ if (newest_node->compare_exchange_weak(nodes, node, std::memory_order_release, std::memory_order_relaxed)) {
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
private:
struct Node {
Node* link_older = nullptr;
Node* link_newer = nullptr;
// true if task is TimeTask
bool is_time_task;
TimeTask task;
Node(TaskFunc func, void* arg) : is_time_task(false), task(0, func, arg) {}
Node(uint64_t exec_time, TaskFunc func, void* arg) : is_time_task(true), task(exec_time, func, arg) {}
inline void Exec() { task.func(task.arg); }
inline Node* Next() { return link_newer; }
};
static inline void AsmVolatilePause() {
#if defined(__i386__) || defined(__x86_64__)
asm volatile("pause");
#elif defined(__aarch64__)
asm volatile("wfe");
#elif defined(__powerpc64__)
asm volatile("or 27,27,27");
#endif
// it's okay for other platforms to be no-ops
}
Node* CreateMissingNewerLinks(Node* head);
bool LinkOne(Node* node, std::atomic<Node*>* newest_node);
std::atomic<Node*> newest_node_;
std::atomic<int> node_cnt_; // for task
std::atomic<Node*> time_newest_node_;
std::atomic<int> time_node_cnt_; // for time task
const int queue_slow_size_; // default value: min(worker_num_ * 10, max_queue_size_)
size_t max_queue_size_;
const uint64_t max_yield_usec_;
const uint64_t slow_yield_usec_;
AdaptationContext adp_ctx;
size_t worker_num_;
private:
struct Node {
std::atomic<Node*> link_older{nullptr};
std::atomic<Node*> link_newer{nullptr};
// true if task is TimeTask
bool is_time_task;
TimeTask task;
Node(TaskFunc func, void* arg) : is_time_task(false), task(0, func, arg) {}
Node(uint64_t exec_time, TaskFunc func, void* arg) : is_time_task(true), task(exec_time, func, arg) {}
inline void Exec() { task.func(task.arg); }
inline Node* Next() { return link_newer.load(); }
};
static inline void AsmVolatilePause() {
#if defined(__i386__) || defined(__x86_64__)
asm volatile("pause");
#elif defined(__aarch64__)
asm volatile("wfe");
#elif defined(__powerpc64__)
asm volatile("or 27,27,27");
#endif
// it's okay for other platforms to be no-ops
}
Node* CreateMissingNewerLinks(Node* head);
bool LinkOne(Node* node, std::atomic<Node*>* newest_node);
std::atomic<Node*> newest_node_{nullptr};
std::atomic<int> node_cnt_{0}; // for task
std::atomic<Node*> time_newest_node_;
std::atomic<int> time_node_cnt_; // for time task
const int queue_slow_size_; // default value: min(worker_num_ * 10, max_queue_size_)
size_t max_queue_size_;
const uint64_t max_yield_usec_;
const uint64_t slow_yield_usec_;
AdaptationContext adp_ctx;
size_t worker_num_;

@QlQlqiqi
Copy link
Contributor Author

QlQlqiqi commented Jun 18, 2024

这些都是一个 worker 一个无锁链表的测试(具体上面讨论有描述),代码:https://github.com/QlQlqiqi/pika/tree/change-thread-shedule-with-mutil-list

关闭 binlog、cache

redis-benchmark -h localhost -p 9221 -t get,set -n 5000000 -r 10000000000000 -d 512 -c 300 --threads 150

修改参数

kMaxSlowYieldsWhileSpinning、slow_yield_usec_、max_yield_usec_、指令类型

6 6 500 get
Summary:
  throughput summary: 121294.45 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.452     0.016     1.911     5.975    11.775   137.087

6 6 500 set
Summary:
  throughput summary: 22597.34 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
       13.178     0.080     6.711    30.463    54.175  2711.551

3 3 100 get
Summary:
  throughput summary: 122249.38 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.424     0.016     1.743     6.375    14.431   151.551

3 3 100 set
Summary:
  throughput summary: 20026.92 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
       14.806     0.064     5.903    29.727    52.831  3000.319

通过调参,可以降低一部分 get 的 p99 和 set 的 qps,但是 get 的 qps 和 set 的 p99 会提升。这块参数是和机器性能有关的,我本地测试机器性能一般,通过提高二级等待时间,可以适度提高效果。

或者可以通过牺牲 set 来换取 get 性能

如果恒让 yield_credit 为正数(即:总是进入二级等待),那么 get 的 qps 会提高,p99 会降低,对比如下:

Summary:
  throughput summary: 122249.38 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.424     0.016     1.743     6.375    14.431   151.551

Summary:
  throughput summary: 131887.84 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.256     0.024     1.879     5.055     8.855   159.615

因为总是进入二级等待,几乎可以让 get 测试不会触发 cv.wait,耗时减少;但是,这样会让本身就较为耗时的 set 总是在等待任务,占据了那些处理数据本身的时间,并且因为 set 执行慢,二级等待后可能依旧触发 cv.wait,从而导致慢。

@QlQlqiqi
Copy link
Contributor Author

QlQlqiqi commented Jun 18, 2024

当前分支成为 v1,一个 worker 一个无锁链表成为 v2。

v1 的 qps 低于 v2,但是 p99 也低于 v2,原因为:通过 gperftools 查看,v2 中有很大一部分耗时是 AsmVolatilePause,所以这就说明,v2 多个无锁链表,会导致等待效率降低,从而提高了 p99,但是一个链表又竞争过大,qps 也会低,所以需要取一个中间值,如:30 个 worker 有 5 个无锁链表,这样应该会在不降低 qps 的同时降低 p99

v1 是所有的 worker 对应一个链表,v2 是每一个 worker 对应一个链表
这里是 worker num 为 2 的 v1 和 v2 的 get 测试:

v1:
Summary:
  throughput summary: 82895.37 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        3.580     0.024     2.815     9.487    17.535   309.503

v2:
Summary:
  throughput summary: 91702.74 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        3.245     0.016     2.887     7.719     9.167   203.263

@QlQlqiqi
Copy link
Contributor Author

QlQlqiqi commented Jun 19, 2024

就像上一个讨论说的,多个无锁链表会减弱三级等待成功率,所以,可以尝试少量 worker 对应一个无锁链表:

get:一共 30 个 worker,每 2 个 worker 对应一个链表
Summary:
  throughput summary: 133697.00 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        2.213     0.016     1.615     6.279    10.711   173.951


get:一共 30 个 worker,每 3 个 worker 对应一个链表
Summary:
  throughput summary: 86218.79 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        3.450     0.024     2.071    11.311    16.927   152.191

可以看到,在我的机器上 2 个无锁链表的竞争会达到最大。
https://github.com/QlQlqiqi/pika/tree/change-thread-shedule-with-mutil-list-per-worker

@QlQlqiqi
Copy link
Contributor Author

测试总结:

关闭 binlog、cache,30 个 worker
测试命令:redis-benchmark -h localhost -p 9221 -t get,set -n 5000000 -r 10000000000000 -d 512 -c 300 --threads 150

  1. 所有的 worker 对应一个链表:这样竞争过大,CPU 更多的会去执行 Schedule 中的 std::this_thread::yield();,因为 worker 中待处理任务过多,需要让出线程,这样 CPU 切换会增大耗时。并且因为 CPU 去处理线程切换,其他已经从链表取出来的任务就得不到执行,从而减少等待效率,白白消耗第一级的等待(AsmVolatilePause())时间。
  2. nworkers_per_link_ 个 worker对应一个链表:根据机器性能和场景决定参数值,过大会因为竞争压力过大、导致等待效率降低,过小会因为竞争压力过小、链表中取数据概率降低、从而导致等待效率降低。在我的机器上(AMD7900x,ubuntu2204.3,Debug版本的 pika,DDR5-6400 内存条的环境下,这个值为 2 是最好的)

至于程序中其他参数,如:kMaxSlowYieldsWhileSpinningslow_yield_usec_max_yield_usec_ 等,更改参数也可以改变等待逻辑的时长,从而根据机器性能和场景选择合适的等待效率。

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


Test summary:

Close binlog, cache, 30 workers
Test command: redis-benchmark -h localhost -p 9221 -t get,set -n 5000000 -r 10000000000000 -d 512 -c 300 --threads 150

  1. All workers correspond to a linked list: If the competition is too large, the CPU will execute more std::this_thread::yield(); in Schedule because there are too many tasks to be processed in the workers and need to be processed Threads are removed, so CPU switching will increase time consumption. And because the CPU handles thread switching, other tasks that have been taken out from the linked list will not be executed, thereby reducing the waiting efficiency and consuming the first-level waiting (AsmVolatilePause()) time in vain.
  2. nworkers_per_link_ workers correspond to a linked list: the parameter value is determined according to the machine performance and scenario. If it is too large, the waiting efficiency will be reduced due to excessive competition pressure. If it is too small, the probability of fetching data from the linked list will be reduced due to too small competition pressure, thus reducing the probability of fetching data from the linked list. This results in reduced waiting efficiency. On my machine (AMD7900x, ubuntu2204.3, Debug version of pika, DDR5-6400 memory module, this value is 2 is the best)

As for other parameters in the program, such as: kMaxSlowYieldsWhileSpinning, slow_yield_usec_, max_yield_usec_, etc., changing the parameters can also change the length of the waiting logic, thereby selecting the appropriate waiting efficiency according to the machine performance and scenario.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Outside diff range and nitpick comments (1)
tests/integration/start_codis.sh (1)

Line range hint 38-38: Add error handling for the cd command.

The script should handle the possibility of the cd command failing, which would prevent errors in subsequent commands if the directory does not exist or is not accessible.

- cd ../codis
+ cd ../codis || exit
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 0d1b00f and 9dab5c4.

Files selected for processing (6)
  • src/net/include/likely.h (1 hunks)
  • src/net/include/net_define.h (1 hunks)
  • src/net/include/random.h (1 hunks)
  • src/net/include/thread_pool.h (4 hunks)
  • src/net/src/thread_pool.cc (5 hunks)
  • tests/integration/start_codis.sh (2 hunks)
Files skipped from review due to trivial changes (2)
  • src/net/include/likely.h
  • src/net/include/net_define.h
Additional context used
Shellcheck
tests/integration/start_codis.sh

[warning] 38-38: Use 'cd ... || exit' or 'cd ... || return' in case cd fails. (SC2164)

Additional comments not posted (1)
tests/integration/start_codis.sh (1)

3-3: Correct the typo in the comment.

The word "sugguest" should be corrected to "suggest".

- # sugguest clear these before starting codis
+ # suggest clear these before starting codis

Likely invalid or redundant comment.

Comment on lines +73 to +81
static __thread Random* tls_instance;
static __thread std::aligned_storage<sizeof(Random)>::type tls_instance_bytes;

auto rv = tls_instance;
if (UNLIKELY(rv == nullptr)) {
size_t seed = std::hash<std::thread::id>()(std::this_thread::get_id());
rv = new (&tls_instance_bytes) Random((uint32_t)seed);
tls_instance = rv;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use smart pointers for thread-local storage.

To manage memory more safely and efficiently, consider using smart pointers instead of raw pointers for the thread-local instance of the Random class.

- static __thread Random* tls_instance;
- static __thread std::aligned_storage<sizeof(Random)>::type tls_instance_bytes;
+ static thread_local std::unique_ptr<Random> tls_instance;

- auto rv = tls_instance;
- if (UNLIKELY(rv == nullptr)) {
-   size_t seed = std::hash<std::thread::id>()(std::this_thread::get_id());
-   rv = new (&tls_instance_bytes) Random((uint32_t)seed);
-   tls_instance = rv;
- }
+ if (!tls_instance) {
+   size_t seed = std::hash<std::thread::id>()(std::this_thread::get_id());
+   tls_instance = std::make_unique<Random>((uint32_t)seed);
+ }
+ return tls_instance.get();
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static __thread Random* tls_instance;
static __thread std::aligned_storage<sizeof(Random)>::type tls_instance_bytes;
auto rv = tls_instance;
if (UNLIKELY(rv == nullptr)) {
size_t seed = std::hash<std::thread::id>()(std::this_thread::get_id());
rv = new (&tls_instance_bytes) Random((uint32_t)seed);
tls_instance = rv;
}
static thread_local std::unique_ptr<Random> tls_instance;
if (!tls_instance) {
size_t seed = std::hash<std::thread::id>()(std::this_thread::get_id());
tls_instance = std::make_unique<Random>((uint32_t)seed);
}
return tls_instance.get();

@@ -0,0 +1,88 @@
// Copyright (c) 2024-present, Qihoo, Inc. All rights reserved.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a license declaration at the start of the file.

As suggested in previous comments, a license declaration should be consistently present at the start of each file.

+ // Copyright (c) 2024-present, OpenAtom Foundation. All rights reserved.
+ // This source code is licensed under the BSD-style license found in the
+ // LICENSE file in the root directory of this source tree. An additional grant
+ // of patent rights can be found in the PATENTS file in the same directory.
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Copyright (c) 2024-present, Qihoo, Inc. All rights reserved.
// Copyright (c) 2024-present, OpenAtom Foundation. All rights reserved.
// This source code is licensed under the BSD-style license found in the
// LICENSE file in the root directory of this source tree. An additional grant
// of patent rights can be found in the PATENTS file in the same directory.
// Copyright (c) 2024-present, Qihoo, Inc. All rights reserved.

Comment on lines 82 to 110
struct Node {
Node* link_older = nullptr;
Node* link_newer = nullptr;

size_t worker_num_;
// true if task is TimeTask
bool is_time_task;
TimeTask task;

Node(TaskFunc func, void* arg) : is_time_task(false), task(0, func, arg) {}
Node(uint64_t exec_time, TaskFunc func, void* arg) : is_time_task(true), task(exec_time, func, arg) {}

inline void Exec() { task.func(task.arg); }
inline Node* Next() { return link_newer; }
};

static inline void AsmVolatilePause() {
#if defined(__i386__) || defined(__x86_64__)
asm volatile("pause");
#elif defined(__aarch64__)
asm volatile("wfe");
#elif defined(__powerpc64__)
asm volatile("or 27,27,27");
#endif
// it's okay for other platforms to be no-ops
}

Node* CreateMissingNewerLinks(Node* head, int* cnt);
bool LinkOne(Node* node, std::atomic<Node*>* newest_node);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure proper memory management for Node objects and correct ordering of atomic operations.

To prevent memory leaks and race conditions, consider using smart pointers for Node objects and ensure correct memory order tags for atomic operations.

- Node* link_older = nullptr;
- Node* link_newer = nullptr;
+ std::atomic<Node*> link_older{nullptr};
+ std::atomic<Node*> link_newer{nullptr};

- std::atomic<Node*> newest_node_;
- std::atomic<int> node_cnt_;
+ std::atomic<Node*> newest_node_{nullptr};
+ std::atomic<int> node_cnt_{0};

- if (newest_node->compare_exchange_weak(nodes, node)) {
+ if (newest_node->compare_exchange_weak(nodes, node, std::memory_order_release, std::memory_order_relaxed)) {

Committable suggestion was skipped due to low confidence.

@QlQlqiqi
Copy link
Contributor Author

QlQlqiqi commented Jun 27, 2024

timer task 可能 blocking worker thread,因为它总是 wait for 一定时间,如果在此期间有新的 task,该 worker thread 依旧 wait。所以,在 wait 的时候判断是否有新的 task 到来,如果有,则将已经从链表中取出来的 timer tasks 重新 push 回链表中,并执行 task 逻辑,以保证 worker 线程不会被阻塞。(目前这里没有设置多少 timer tasks 以内才会被 re-push 回链表)

          lock.lock();
          // if task is coming now, do task immediately
          auto res = rsignal.wait_for(lock, std::chrono::microseconds(exec_time - unow), [this, &newest_node]() {
            return newest_node.load(std::memory_order_relaxed) != nullptr || UNLIKELY(should_stop());
          });
          lock.unlock();
          if (res) {
            // re-push the timer tasks
            ReDelaySchedule(time_first);
            goto retry;
          }

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 9dab5c4 and 8694da0.

Files selected for processing (2)
  • src/net/include/thread_pool.h (4 hunks)
  • src/net/src/thread_pool.cc (5 hunks)
Additional comments not posted (10)
src/net/include/thread_pool.h (7)

12-12: Consider sorting imports for better readability.

Imports like <vector> are typically grouped and sorted with other standard library includes. This can improve readability and maintainability of the code.


20-20: Modernize function pointer declaration.

Using using for type aliasing improves readability and is more in line with modern C++ standards.


34-38: Introduce a constructor for Arg struct.

The explicit constructor initializes arg and idx, ensuring that objects of this struct are always in a valid state when created.


51-52: Initialization of Worker class members.

The members idx_ and arg_ are correctly initialized in the constructor, ensuring that each worker has a unique index and associated argument.


72-73: Method to run tasks in a thread is well encapsulated.

The runInThread method allows for tasks to be executed in a specific thread based on the index, which is a good design for load balancing across workers.


75-79: Introduction of AdaptationContext struct.

This struct is used to manage adaptive thresholds for task execution, which is crucial for optimizing performance under different load conditions.


100-109: Platform-specific pause operations are correctly implemented.

The use of assembly instructions for pausing the CPU is appropriate for reducing CPU usage during spin-wait loops.

src/net/src/thread_pool.cc (3)

17-20: Correct implementation of WorkerMain.

The method correctly casts the argument and calls runInThread with the appropriate worker index, which is essential for correct task execution.


93-95: Proper cleanup in stop_thread_pool.

Signaling all condition variables before joining and deleting worker threads is a good practice to ensure clean shutdown.


77-78: Worker initialization in the thread pool.

Workers are dynamically allocated and started correctly. However, ensure that there is a corresponding deallocation to prevent memory leaks.

Verification successful

Worker initialization in the thread pool.

The Worker objects are properly deallocated, as evidenced by the delete worker statements found in the codebase.

  • src/net/src/thread_pool.cc: delete worker;
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify memory management for workers.

# Test: Search for worker deallocation.
rg --type py 'delete worker'

Length of output: 30



Script:

#!/bin/bash
# Description: Verify memory management for workers.

# Test: Search for worker deallocation in C++ files.
rg --type cpp 'delete worker'

Length of output: 227

int idx;
};

explicit Worker(ThreadPool* tp, int idx = 0) : start_(false), thread_pool_(tp), idx_(idx), arg_(tp, idx){};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure thread safety during worker initialization.

The initialization of Worker objects includes setting start_ to false, which is correct. However, consider using std::atomic for start_ to ensure thread safety during checks and updates.

};

explicit ThreadPool(size_t worker_num, size_t max_queue_size, std::string thread_pool_name = "ThreadPool");
explicit ThreadPool(size_t worker_num, size_t max_queue_size, std::string thread_pool_name = "ThreadPool");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ThreadPool constructor lacks explicit member initializations.

The constructor should explicitly initialize all its members to ensure that all objects are in a predictable state. This is particularly important for synchronization primitives like mutexes and condition variables.

- explicit ThreadPool(size_t worker_num, size_t max_queue_size, std::string thread_pool_name = "ThreadPool");
+ explicit ThreadPool(size_t worker_num, size_t max_queue_size, std::string thread_pool_name = "ThreadPool") : worker_num_(worker_num), max_queue_size_(max_queue_size), thread_pool_name_(std::move(thread_pool_name)), running_(false), should_stop_(false) {}
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
explicit ThreadPool(size_t worker_num, size_t max_queue_size, std::string thread_pool_name = "ThreadPool");
explicit ThreadPool(size_t worker_num, size_t max_queue_size, std::string thread_pool_name = "ThreadPool") : worker_num_(worker_num), max_queue_size_(max_queue_size), thread_pool_name_(std::move(thread_pool_name)), running_(false), should_stop_(false) {}

Comment on lines +82 to +95
struct Node {
Node* link_older = nullptr;
Node* link_newer = nullptr;

size_t worker_num_;
// true if task is TimeTask
bool is_time_task;
TimeTask task;

Node(TaskFunc func, void* arg) : is_time_task(false), task(0, func, arg) {}
Node(uint64_t exec_time, TaskFunc func, void* arg) : is_time_task(true), task(exec_time, func, arg) {}

inline void Exec() { task.func(task.arg); }
inline Node* Next() { return link_newer; }
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory management for Node objects needs attention.

The Node struct handles tasks but does not use smart pointers for managing dynamic memory, which can lead to memory leaks.

- Node* link_older = nullptr;
- Node* link_newer = nullptr;
+ std::unique_ptr<Node> link_older;
+ std::unique_ptr<Node> link_newer;
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct Node {
Node* link_older = nullptr;
Node* link_newer = nullptr;
size_t worker_num_;
// true if task is TimeTask
bool is_time_task;
TimeTask task;
Node(TaskFunc func, void* arg) : is_time_task(false), task(0, func, arg) {}
Node(uint64_t exec_time, TaskFunc func, void* arg) : is_time_task(true), task(exec_time, func, arg) {}
inline void Exec() { task.func(task.arg); }
inline Node* Next() { return link_newer; }
};
struct Node {
std::unique_ptr<Node> link_older;
std::unique_ptr<Node> link_newer;
// true if task is TimeTask
bool is_time_task;
TimeTask task;
Node(TaskFunc func, void* arg) : is_time_task(false), task(0, func, arg) {}
Node(uint64_t exec_time, TaskFunc func, void* arg) : is_time_task(true), task(exec_time, func, arg) {}
inline void Exec() { task.func(task.arg); }
inline Node* Next() { return link_newer.get(); }
};

Comment on lines +111 to +112
Node* CreateMissingNewerLinks(Node* head, int* cnt);
bool LinkOne(Node* node, std::atomic<Node*>* newest_node);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

Link management methods need thorough testing.

The methods CreateMissingNewerLinks and LinkOne are critical for maintaining the integrity of the task list. Ensure these methods are covered by unit tests to prevent issues in task scheduling.

  • No unit tests found for CreateMissingNewerLinks.
  • No unit tests found for LinkOne.
Analysis chain

Link management methods need thorough testing.

The methods CreateMissingNewerLinks and LinkOne are critical for maintaining the integrity of the task list. Ensure these methods are covered by unit tests to prevent issues in task scheduling.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify unit tests for link management methods.

# Test: Search for tests related to `CreateMissingNewerLinks` and `LinkOne`.
rg --type py 'test_CreateMissingNewerLinks|test_LinkOne'

Length of output: 58



Script:

#!/bin/bash
# Description: Verify unit tests for link management methods.

# Test: Search for tests related to `CreateMissingNewerLinks` and `LinkOne` in all test files.
rg 'CreateMissingNewerLinks|LinkOne' --type-add 'test:*.{cpp,h,py}' --type test

Length of output: 257

Comment on lines 160 to 313

if (newest_node.load(std::memory_order_acquire) != nullptr) {
last = newest_node.exchange(nullptr);
if (last != nullptr) {
would_spin_again = true;
// success
break;
}
}
if (UNLIKELY(time_newest_node.load(std::memory_order_acquire) != nullptr)) {
time_last = time_newest_node.exchange(nullptr);
if (time_last != nullptr) {
would_spin_again = true;
// success
break;
}
}

auto now = std::chrono::steady_clock::now();
if (now == iter_begin || now - iter_begin >= std::chrono::microseconds(slow_yield_usec_)) {
++slow_yield_count;
if (slow_yield_count >= kMaxSlowYieldsWhileSpinning) {
update_ctx = true;
break;
}
}
iter_begin = now;
}
}

// update percentage of next loop 2
if (update_ctx) {
auto v = yield_credit.load(std::memory_order_relaxed);
v = v - (v / 1024) + (would_spin_again ? 1 : -1) * 131072;
yield_credit.store(v, std::memory_order_relaxed);
}

if (!would_spin_again) {
// 3. wait for new task
continue;
}
}

if (!queue_.empty()) {
auto [func, arg] = queue_.front();
queue_.pop();
wsignal_.notify_one();
lock.unlock();
(*func)(arg);
exec:
// do all normal tasks older than this task pointed last
if (LIKELY(last != nullptr)) {
int cnt = 1;
auto first = CreateMissingNewerLinks(last, &cnt);
// node_cnt_ -= cnt;
assert(!first->is_time_task);
do {
first->Exec();
tmp = first;
first = first->Next();
node_cnt_--;
delete tmp;
} while (first != nullptr);
}

// do all time tasks older than this task pointed time_last
if (UNLIKELY(time_last != nullptr)) {
int cnt = 1;
auto time_first = CreateMissingNewerLinks(time_last, &cnt);
do {
// time task may block normal task
auto now = std::chrono::system_clock::now();
uint64_t unow = std::chrono::duration_cast<std::chrono::microseconds>(now.time_since_epoch()).count();

auto [exec_time, func, arg] = time_first->task;
assert(time_first->is_time_task);
if (unow >= exec_time) {
time_first->Exec();
} else {
lock.lock();
// if task is coming now, do task immediately
auto res = rsignal.wait_for(lock, std::chrono::microseconds(exec_time - unow), [this, &newest_node]() {
return newest_node.load(std::memory_order_relaxed) != nullptr || UNLIKELY(should_stop());
});
lock.unlock();
if (res) {
// re-push the timer tasks
ReDelaySchedule(time_first);
goto retry;
}
time_first->Exec();
}
tmp = time_first;
time_first = time_first->Next();
time_node_cnt_--;
delete tmp;
} while (time_first != nullptr);
}
goto retry;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Complex task handling logic in runInThread.

The method handles both normal and time-based tasks with appropriate checks and synchronization. However, the complexity of this method could benefit from further comments or refactoring to improve readability.

Comment on lines 115 to 129
// stop until the size of tasks queue is not greater than max_queue_size_
while (node_cnt_.load(std::memory_order_relaxed) >= max_queue_size_) {
std::this_thread::yield();
}
// slow like above
if (node_cnt_.load(std::memory_order_relaxed) >= queue_slow_size_) {
std::this_thread::yield();
}

if (!should_stop()) {
queue_.emplace(func, arg);
rsignal_.notify_one();
if (LIKELY(!should_stop())) {
auto node = new Node(func, arg);
auto idx = ++task_idx_;
LinkOne(node, &newest_node_[idx % nlinks_]);
node_cnt_++;
rsignal_[idx % nlinks_].notify_one();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Task scheduling needs optimization to reduce busy-waiting.

The use of std::this_thread::yield() in a loop can lead to high CPU usage. Consider using condition variables or other synchronization mechanisms to reduce busy-waiting.

- while (node_cnt_.load(std::memory_order_relaxed) >= max_queue_size_) {
-     std::this_thread::yield();
+ std::unique_lock lock(mu_);
+ rsignal_.wait(lock, [this]{ return node_cnt_.load(std::memory_order_relaxed) < max_queue_size_; });
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// stop until the size of tasks queue is not greater than max_queue_size_
while (node_cnt_.load(std::memory_order_relaxed) >= max_queue_size_) {
std::this_thread::yield();
}
// slow like above
if (node_cnt_.load(std::memory_order_relaxed) >= queue_slow_size_) {
std::this_thread::yield();
}
if (!should_stop()) {
queue_.emplace(func, arg);
rsignal_.notify_one();
if (LIKELY(!should_stop())) {
auto node = new Node(func, arg);
auto idx = ++task_idx_;
LinkOne(node, &newest_node_[idx % nlinks_]);
node_cnt_++;
rsignal_[idx % nlinks_].notify_one();
// stop until the size of tasks queue is not greater than max_queue_size_
std::unique_lock lock(mu_);
rsignal_.wait(lock, [this]{ return node_cnt_.load(std::memory_order_relaxed) < max_queue_size_; });
// slow like above
if (node_cnt_.load(std::memory_order_relaxed) >= queue_slow_size_) {
std::this_thread::yield();
}
if (LIKELY(!should_stop())) {
auto node = new Node(func, arg);
auto idx = ++task_idx_;
LinkOne(node, &newest_node_[idx % nlinks_]);
node_cnt_++;
rsignal_[idx % nlinks_].notify_one();

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8694da0 and 055a479.

Files selected for processing (2)
  • src/net/include/thread_pool.h (4 hunks)
  • src/net/src/thread_pool.cc (5 hunks)
Additional comments not posted (20)
src/net/include/thread_pool.h (10)

34-38: Ensure correct initialization of Arg struct.

The Arg struct is correctly initialized with the provided arguments.


40-40: Ensure thread safety during worker initialization.

The initialization of Worker objects includes setting start_ to false, which is correct. However, consider using std::atomic for start_ to ensure thread safety during checks and updates.


55-55: ThreadPool constructor lacks explicit member initializations.

The constructor should explicitly initialize all its members to ensure that all objects are in a predictable state. This is particularly important for synchronization primitives like mutexes and condition variables.


72-72: Ensure proper memory management for Node objects and correct ordering of atomic operations.

To prevent memory leaks and race conditions, consider using smart pointers for Node objects and ensure correct memory order tags for atomic operations.


111-112: Link management methods need thorough testing.

The methods CreateMissingNewerLinks and LinkOne are critical for maintaining the integrity of the task list. Ensure these methods are covered by unit tests to prevent issues in task scheduling.


72-73: Complex task handling logic in runInThread.

The method handles both normal and time-based tasks with appropriate checks and synchronization. However, the complexity of this method could benefit from further comments or refactoring to improve readability.


97-99: LGTM!

The ReDelaySchedule method correctly handles re-scheduling of timer tasks.


100-109: LGTM!

The AsmVolatilePause method correctly uses architecture-specific instructions for pausing execution.


111-112: LGTM!

The CreateMissingNewerLinks method correctly handles link creation in the task list.


112-113: LGTM!

The LinkOne method correctly handles node linking in the task list.

src/net/src/thread_pool.cc (10)

17-20: LGTM!

The WorkerMain method correctly handles the worker's main execution loop with the updated argument.


26-26: Error handling in thread creation needs improvement.

When pthread_create fails, consider logging the error or throwing an exception to help with debugging and fault tolerance.


Line range hint 36-46: LGTM!

The stop method correctly handles stopping the worker thread.


48-70: ThreadPool constructor lacks explicit member initializations.

The constructor should explicitly initialize all its members to ensure that all objects are in a predictable state. This is particularly important for synchronization primitives like mutexes and condition variables.


Line range hint 76-85: LGTM!

The start_thread_pool method correctly handles starting the worker threads.


Line range hint 92-108: LGTM!

The stop_thread_pool method correctly handles stopping the worker threads and clearing the worker list.


115-129: Task scheduling needs optimization to reduce busy-waiting.

The use of std::this_thread::yield() in a loop can lead to high CPU usage. Consider using condition variables or other synchronization mechanisms to reduce busy-waiting.


140-146: LGTM!

The DelaySchedule method correctly handles scheduling delayed tasks in the thread pool.


160-313: Complex task handling logic in runInThread.

The method handles both normal and time-based tasks with appropriate checks and synchronization. However, the complexity of this method could benefit from further comments or refactoring to improve readability.


316-350: LGTM!

The ReDelaySchedule, CreateMissingNewerLinks, and LinkOne methods correctly handle their respective tasks.

@QlQlqiqi
Copy link
Contributor Author

在我这测试结果如下,命令为:redis-benchmark -h localhost -p 9221 -t get,set -n 10000000 -r 10000000000000 -d 512 -c 32 --threads 16

关闭 binlog 和 cache,worker num 为 30
queue_slow_size_为:(std::max(10UL, std::min(worker_num * max_queue_size / 100, max_queue_size))),

unstable branch, nworkers_per_link_ = 1:
====== SET ======
Summary:
  throughput summary: 31475.25 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.008     0.080     0.895     1.815     2.991    42.175
====== GET ======
Summary:
  throughput summary: 70437.91 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.443     0.040     0.359     1.039     1.703    21.743

原版 pika branch:
====== SET ======
Summary:
  throughput summary: 29344.70 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.082     0.072     0.999     1.887     2.583    38.271
====== GET ======
Summary:
  throughput summary: 34520.49 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.918     0.040     0.711     2.575     3.831    72.255

结论

  1. set 是 QPS 和 P99 均有优化
  2. get 是 QPS 有些许优化,但是 P99 反向优化。

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 055a479 and 30e2a96.

Files selected for processing (2)
  • conf/pika.conf (1 hunks)
  • src/net/src/thread_pool.cc (5 hunks)
Additional comments not posted (16)
src/net/src/thread_pool.cc (7)

19-22: LGTM! The change enhances the function's flexibility.

The function now accepts a structured argument, improving control over task execution.


50-72: LGTM! The constructor refactoring enhances adaptability.

The changes introduce a more complex internal state, improving the pool's adaptability to varying workloads.


81-81: LGTM! The dynamic calculation of the maximum queue size is a good enhancement.

This change improves the pool's adaptability to varying workloads.


96-98: LGTM! The notification mechanism for stopping worker threads is a good enhancement.

This change ensures that all worker threads are correctly stopped.


145-150: LGTM! The notification mechanism for delayed tasks is a good enhancement.

This change ensures that worker threads are correctly notified of new tasks.


164-202: LGTM! The retry mechanism ensures all tasks are executed.

This change improves the thread pool's responsiveness and throughput.


205-215: LGTM! The notification mechanism for re-queued tasks is a good enhancement.

This change ensures that worker threads are correctly notified of re-queued tasks.

conf/pika.conf (9)

24-24: LGTM! Increasing the number of threads enhances concurrency.

This change allows for greater concurrency in handling requests.


28-28: LGTM! Increasing the thread pool size enhances concurrency.

This change indicates a substantial increase in the number of threads available for user requests.


567-567: Verify the decrease in cache max memory.

Decreasing the cache max memory may affect performance. Ensure that this change is intentional and based on performance testing.


718-718: LGTM! Changing the compaction strategy may improve performance.

This change may enhance performance during data writes and reads.


180-180: LGTM! Increasing the number of log files enhances log retention.

This change may improve log retention and analysis.


375-375: LGTM! Increasing the number of background jobs enhances concurrency.

This change allows for more concurrent background operations.


381-381: LGTM! Changing the number of background flushes may improve performance.

This change may enhance performance during data writes.


390-390: LGTM! Changing the number of background compactions may improve performance.

This change may enhance performance during data writes.


676-676: Verify the increase in write buffer size.

Increasing the write buffer size may improve write performance but could generate heavier IO load when flushing from buffer to disk. Ensure that this change is intentional and based on performance testing.

return nullptr;
}

int ThreadPool::Worker::start() {
if (!start_.load()) {
if (pthread_create(&thread_id_, nullptr, &WorkerMain, thread_pool_) != 0) {
if (pthread_create(&thread_id_, nullptr, &WorkerMain, &arg_) != 0) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider improving error handling in thread creation.

When pthread_create fails, consider logging the error or throwing an exception to help with debugging and fault tolerance.

- if (pthread_create(&thread_id_, nullptr, &WorkerMain, &arg_) != 0) {
+ if (int err = pthread_create(&thread_id_, nullptr, &WorkerMain, &arg_); err != 0) {
+     throw std::runtime_error("Failed to create thread: " + std::strerror(err));
  }
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (pthread_create(&thread_id_, nullptr, &WorkerMain, &arg_) != 0) {
if (int err = pthread_create(&thread_id_, nullptr, &WorkerMain, &arg_); err != 0) {
throw std::runtime_error("Failed to create thread: " + std::strerror(err));
}

Comment on lines +118 to +127
node_cnt_++;
// stop until the size of tasks queue is not greater than max_queue_size_
while (node_cnt_.load(std::memory_order_relaxed) >= max_queue_size_) {
std::this_thread::yield();
// pstd::SleepForMicroseconds(1);
}
// slow like above
if (node_cnt_.load(std::memory_order_relaxed) >= queue_slow_size_) {
std::this_thread::yield();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using condition variables to reduce busy-waiting.

The use of std::this_thread::yield() in a loop can lead to high CPU usage. Consider using condition variables or other synchronization mechanisms to reduce busy-waiting.

- while (node_cnt_.load(std::memory_order_relaxed) >= max_queue_size_) {
-     std::this_thread::yield();
+ std::unique_lock lock(mu_);
+ rsignal_.wait(lock, [this]{ return node_cnt_.load(std::memory_order_relaxed) < max_queue_size_; });
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
node_cnt_++;
// stop until the size of tasks queue is not greater than max_queue_size_
while (node_cnt_.load(std::memory_order_relaxed) >= max_queue_size_) {
std::this_thread::yield();
// pstd::SleepForMicroseconds(1);
}
// slow like above
if (node_cnt_.load(std::memory_order_relaxed) >= queue_slow_size_) {
std::this_thread::yield();
}
node_cnt_++;
// stop until the size of tasks queue is not greater than max_queue_size_
std::unique_lock lock(mu_);
rsignal_.wait(lock, [this]{ return node_cnt_.load(std::memory_order_relaxed) < max_queue_size_; });
// slow like above
if (node_cnt_.load(std::memory_order_relaxed) >= queue_slow_size_) {
std::this_thread::yield();
}

@luky116 luky116 removed the 3.5.5 label Aug 7, 2024
@QlQlqiqi
Copy link
Contributor Author

QlQlqiqi commented Sep 6, 2024

OS: ubuntu 2204.3
CPU: 7900X
memory: 16G * 2
使用的是如下 conf

# 360 dba pika conf pika3.5.2
port : 9221
thread-num : 8
log-path : ./log/
loglevel : info
db-path : ./db/
write-buffer-size : 256M
timeout : 30
#requirepass : 06154eee364854d5
#masterauth : 06154eee364854d5
#userpass : 06154eee364854d5360
#userblacklist : bgsave,dumpoff,client
dump-prefix : pika-
dump-expire : 1
pidfile : .pika.pid
daemonize : yes
dump-path : ./dump/block_cache_size
maxclients : 20000
target-file-size-base : 20971520
expire-logs-days : 7
expire-logs-nums : 300
root-connection-num : 10
slowlog-log-slower-than : 100000
binlog-file-size : 104857600
compression : snappy
db-sync-path : ./dbsync
db-sync-speed : 60
slowlog-write-errorlog : yes
small-compaction-threshold : 5000
max-write-buffer-size : 20737418240
max-cache-files : 8000
replication-num : 0
consensus-level : 0
max-cache-statistic-keys : 0
thread-pool-size : 50
slowlog-write-errorlog : yes
default-slot-num : 1024
instance-mode : classic
databases : 1
sync-thread-num : 1
arena-block-size : 33554432
max-background-jobs : 12
max-background-flushes : 3
max-background-compactions : 9
rate-limiter-bandwidth : 1099511627776
db-instance-num : 1
block-size : 4096
#block-cache : 5368709120
block-cache : 4294967296
max-subcompactions : 8

#cache-maxmemory : 5368709120
cache-lfu-decay-time : 1
cache-maxmemory-samples : 5
cache-maxmemory-policy : 1
cache-num : 8
cache-model : 0
zset-cache-field-num-per-key : 512
zset-cache-start-direction : 0
cache-type :

share-block-cache : yes
throttle-bytes-per-second : 102400000
max-rsync-parallel-num : 4
write-binlog : no
slotmigrate : no

# Pika automatic compact compact strategy, a complement to rocksdb compact.
# Trigger the compact background task periodically according to `compact-interval`
# Can choose `full-compact` or `obd-compact`.
# obd-compact https://github.com/OpenAtomFoundation/pika/issues/2255
compaction-strategy : obd-compact

# For OBD_Compact 
# According to the number of sst files in rocksdb, 
# compact every `compact-every-num-of-files` file.
compact-every-num-of-files : 50

# For OBD_Compact 
# In another search, if the file creation time is 
# greater than `force-compact-file-age-seconds`, 
# a compaction of the upper and lower boundaries 
# of the file will be performed at the same time 
# `compact-every-num-of-files` -1
force-compact-file-age-seconds : 300

# For OBD_Compact 
# According to the number of sst files in rocksdb, 
# compact every `compact-every-num-of-files` file.
force-compact-min-delete-ratio : 10

# For OBD_Compact 
# According to the number of sst files in rocksdb, 
# compact every `compact-every-num-of-files` file.
dont-compact-sst-created-in-seconds : 20

# For OBD_Compact 
# According to the number of sst files in rocksdb, 
# compact every `compact-every-num-of-files` file.
best-delete-min-ratio : 40

测试命令和结果如下

redis-benchmark -h localhost -p 9221 -t set,get -n 10000000 -r 10000000000000 -d 512 -c 64 --threads 32

当前分支
====== SET ======
Summary:
  throughput summary: 41356.49 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.536     0.088     1.423     2.639     3.655    47.583
====== GET ======
Summary:
  throughput summary: 77099.22 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        0.819     0.040     0.751     1.519     2.359    39.071

pika unstable 分支
====== SET ======
Summary:
  throughput summary: 35311.73 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.799     0.064     1.671     3.191     4.223    41.215
====== GET ======
Summary:
  throughput summary: 46570.11 requests per second
  latency summary (msec):
          avg       min       p50       p95       p99       max
        1.359     0.032     1.199     2.975     4.007    38.079

@baixin01
Copy link
Collaborator

@cheniujh review

@QlQlqiqi
Copy link
Contributor Author

该 pr 的 p99 问题结合 #2816 解决。

@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically.


This pr's p99 issue is resolved in conjunction with #2816.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4.0.1 ✏️ Feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants