bug: high cpu and memory usage #9015

monkeyDluffy6017 · 2023-03-06T11:05:13Z

Current Behavior

scenario 1

Reload or service discovery will update the upstream object and rebuild the health checker if a request comes in.

Line 102 in 69df734

    
           if healthcheck_parent.checker and healthcheck_parent.checker_upstream == upstream then

In the case of a large number of concurrent requests and a small number of upstreams, the following scenario exists.
Requests a, b, and c all access the same upstream, and since there is an ngx.sleep call in healthcheck.new, requests a, b, and c may all reach position 1, request a continues execution and successfully creates the checker, request b continues execution, and When it reaches position 2, since it corresponds to the same upstream object as request a, healthcheck_parent.checker is not nil and request b executes the cancel_clean_handler function, which sets the corresponding clean function to nil, and continues execution to position 3, where the ngx.sleep call is made inside the add_target function. Request c starts execution and when it reaches position 2, healthcheck_parent.checker is not nil and the cancel_clean_handler function is executed

At this point, the request returns 500 because the corresponding clean function has been set to nil by request b, and an error has occurred.

apisix/apisix/core/config_util.lua

Line 92 in 1acee1b

f(item)

The checker generated at location 1 cannot be released and a timed task is registered within the checker to continuously perform json decode

https://github.com/api7/lua-resty-healthcheck/blob/master/lib/resty/healthcheck.lua#L217

If the qps is large, thousands of checkers will be created that cannot be freed, causing CPU and memory anomalies

scenario 2

When concurrent requests arrive at position 1 at the same time, the checker is already created and cannot be released later, resulting in CPU and memory exceptions

Conclusion

Currently in concurrent scenarios, reloads or releases can create many health checkers, causing CPU and memory anomalies.
We need to ensure that only one health checker is created for an upstream

Expected Behavior

The CPU and memory is normal after reload or service discovery

Error Logs

/usr/local/apisix/apisix/core/config_util.lua:79: attempt to call local 'f' (a nil value)
config_util.lua:73: cancel_clean_handler(): item.clean_handlers is nil when cancel_clean_handler

Steps to Reproduce

One upstream with dozens of nodes
High concurrency (4000+ qps)
Active health check
Reload

Environment

APISIX version (run apisix version): 2.13.1
Operating system (run uname -a): centos 7.6
OpenResty / Nginx version (run openresty -V or nginx -V): 1.19.3.1
etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info):
APISIX Dashboard version, if relevant:
Plugin runner version, for issues related to plugin runners:
LuaRocks version, for installation issues (run luarocks --version):

The text was updated successfully, but these errors were encountered:

cocurrent requests after reload or update the upstream nodes will cause high cpu and memory usage, the checker created by healthcheck.new in create_checker won't be released if the program crashes after cancel_clean_handler failed

monkeyDluffy6017 mentioned this issue Mar 6, 2023

fix: the problem of high cpu and memory usage (#9015) #9016

Merged

5 tasks

spacewander closed this as completed in #9016 Mar 23, 2023

spacewander pushed a commit that referenced this issue Mar 23, 2023

fix: the problem of high cpu and memory usage (#9015) (#9016)

8197e03

monkeyDluffy6017 mentioned this issue Jul 18, 2023

bug: %100 cpu usage of worker process caused by healthcheck impl with error (Failed to release lock) #9775

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: high cpu and memory usage #9015

bug: high cpu and memory usage #9015

monkeyDluffy6017 commented Mar 6, 2023 •

edited

Loading

bug: high cpu and memory usage #9015

bug: high cpu and memory usage #9015

Comments

monkeyDluffy6017 commented Mar 6, 2023 • edited Loading

Current Behavior

scenario 1

scenario 2

Conclusion

Expected Behavior

Error Logs

Steps to Reproduce

Environment

monkeyDluffy6017 commented Mar 6, 2023 •

edited

Loading