Skip to content

Latest commit

 

History

History
88 lines (79 loc) · 7 KB

记一次请求丢失的问题定位过程.md

File metadata and controls

88 lines (79 loc) · 7 KB

记一次请求丢失的问题定位过程

一、背景

在频繁的执行http请求的时候,发现web界面报系统繁忙,在console端发现请求时发送出去了,但是服务端没有收到任何请求。

二、排查haproxy

因为我们的负载均衡是采用的开源组件haproxy,首先想到的肯定是看下入口请求有没有到来。 看了haproxy的日志,发现也没有收到请求。

三、排查console框架

因为我们的console请求从界面到server,中间还通过了我们的一个自研框架consoleframework, 该框架会根据请求的url,从组件silvan中查询该请求对应的server地址,然后将请求发送到该server上。 从consoleframework的日志来看,是成功收到了请求并且转发出去了的。

通过以上两个排查,初步怀疑会不会是haproxy处理并发的能力不行,在大量请求的情况下处理不过来?

四、haproxy测试

通过对上篇文章haproxy的实践中,可以得出,haproxy是不会丢失请求的,在日志中都会明确的显示是成功或者失败。 但是,haproxy的日志只会在server端回复client时才会打印,下面试验下:

(25219) accepted ('30.7.3.126', 40638)
recv client req, {url: http://30.7.3.124:8081/test_haproxy, params: NestedMultiDict([]), body: {"ts": "hhhh"}, client_addr: 30.7.3.126}
(25219) accepted ('30.7.3.126', 40644)
recv client req, {url: http://30.7.3.126:28081/, params: NestedMultiDict([]), body: , client_addr: 30.7.3.126}
30.7.3.126 - - [20/Jun/2019 20:33:55] "OPTIONS / HTTP/1.0" 200 142 0.000668
(25219) accepted ('30.7.3.126', 40668)
recv client req, {url: http://30.7.3.126:28081/, params: NestedMultiDict([]), body: , client_addr: 30.7.3.126}
30.7.3.126 - - [20/Jun/2019 20:34:00] "OPTIONS / HTTP/1.0" 200 142 0.000495
(25219) accepted ('30.7.3.126', 40670)
recv client req, {url: http://30.7.3.126:28081/, params: NestedMultiDict([]), body: , client_addr: 30.7.3.126}
30.7.3.126 - - [20/Jun/2019 20:34:00] "OPTIONS / HTTP/1.0" 200 142 0.000534
(25219) accepted ('30.7.3.126', 40708)
recv client req, {url: http://30.7.3.126:28081/, params: NestedMultiDict([]), body: , client_addr: 30.7.3.126}
30.7.3.126 - - [20/Jun/2019 20:34:10] "OPTIONS / HTTP/1.0" 200 142 0.000689
(25219) accepted ('30.7.3.126', 40780)
recv client req, {url: http://30.7.3.126:28081/, params: NestedMultiDict([]), body: , client_addr: 30.7.3.126}
30.7.3.126 - - [20/Jun/2019 20:34:25] "OPTIONS / HTTP/1.0" 200 142 0.000648
(25219) accepted ('30.7.3.126', 40808)
recv client req, {url: http://30.7.3.126:28081/, params: NestedMultiDict([]), body: , client_addr: 30.7.3.126}
30.7.3.126 - - [20/Jun/2019 20:34:30] "OPTIONS / HTTP/1.0" 200 142 0.001167
(25219) accepted ('30.7.3.126', 40810)
recv client req, {url: http://30.7.3.126:28081/, params: NestedMultiDict([]), body: , client_addr: 30.7.3.126}
30.7.3.126 - - [20/Jun/2019 20:34:30] "OPTIONS / HTTP/1.0" 200 142 0.000638
(25219) accepted ('30.7.3.126', 40858)
recv client req, {url: http://30.7.3.126:28081/, params: NestedMultiDict([]), body: , client_addr: 30.7.3.126}
30.7.3.126 - - [20/Jun/2019 20:34:40] "OPTIONS / HTTP/1.0" 200 142 0.000460
(25219) accepted ('30.7.3.126', 40930)
recv client req, {url: http://30.7.3.126:28081/, params: NestedMultiDict([]), body: , client_addr: 30.7.3.126}
30.7.3.126 - - [20/Jun/2019 20:34:55] "OPTIONS / HTTP/1.0" 200 142 0.000716
30.7.3.126,30.7.3.126 - - [20/Jun/2019 20:34:59] "POST /test_haproxy HTTP/1.1" 200 142 65.004437

这是server端收到client的请求,但是在haproxy中,还没有日志。 知道20/Jun/2019 20:34:59时间点的时候,才回复client,我们看下haproxy的日志:

2019-06-20 20:33:41.992 localhost haproxy [pid:13414]info  30.7.3.126:48353 [20/Jun/2019:20:33:41.903] cms_8098 b_def_cms_8098/controller2 1/0/88 4518 -- 0/0/0/0/0 0/0
2019-06-20 20:33:48.046 localhost haproxy [pid:17464]info  30.7.3.126:48379 [20/Jun/2019:20:33:47.713] cms_8098 b_def_cms_8098/controller2 1/0/334 4518 -- 0/0/0/0/0 0/0
2019-06-20 20:33:53.866 localhost haproxy [pid:19047]info  30.7.3.126:48407 [20/Jun/2019:20:33:53.782] cms_8098 b_def_cms_8098/controller3 1/0/83 4518 -- 0/0/0/0/0 0/0
2019-06-20 20:33:59.774 localhost haproxy [pid:17464]info  30.7.3.126:48445 [20/Jun/2019:20:33:59.689] cms_8098 b_def_cms_8098/controller3 1/0/85 4518 -- 0/0/0/0/0 0/0
2019-06-20 20:34:05.703 localhost haproxy [pid:17464]info  30.7.3.126:48475 [20/Jun/2019:20:34:05.608] cms_8098 b_def_cms_8098/controller1 1/0/94 4518 -- 0/0/0/0/0 0/0
2019-06-20 20:34:11.768 localhost haproxy [pid:17464]info  30.7.3.126:48499 [20/Jun/2019:20:34:11.693] cms_8098 b_def_cms_8098/controller2 1/0/74 4518 -- 0/0/0/0/0 0/0
2019-06-20 20:34:17.639 localhost haproxy [pid:13414]info  30.7.3.126:48527 [20/Jun/2019:20:34:17.527] cms_8098 b_def_cms_8098/controller3 1/2/109 4518 -- 0/0/0/0/0 0/0
2019-06-20 20:34:23.725 localhost haproxy [pid:13414]info  30.7.3.126:48549 [20/Jun/2019:20:34:23.631] cms_8098 b_def_cms_8098/controller1 1/0/93 4518 -- 0/0/0/0/0 0/0
2019-06-20 20:34:29.577 localhost haproxy [pid:19047]info  30.7.3.126:48585 [20/Jun/2019:20:34:29.499] cms_8098 b_def_cms_8098/controller1 1/0/77 4518 -- 1/0/0/0/0 0/0
2019-06-20 20:34:35.699 localhost haproxy [pid:19047]info  30.7.3.126:48617 [20/Jun/2019:20:34:35.585] cms_8098 b_def_cms_8098/controller2 1/0/111 4518 -- 1/0/0/0/0 0/0
2019-06-20 20:34:41.778 localhost haproxy [pid:19047]info  30.7.3.126:48651 [20/Jun/2019:20:34:41.699] cms_8098 b_def_cms_8098/controller3 1/5/79 4518 -- 1/0/0/0/0 0/0
2019-06-20 20:34:47.559 localhost haproxy [pid:19047]info  30.7.3.126:48679 [20/Jun/2019:20:34:47.490] cms_8098 b_def_cms_8098/controller1 1/0/68 4518 -- 1/0/0/0/0 0/0
2019-06-20 20:34:53.404 localhost haproxy [pid:19047]info  30.7.3.126:48701 [20/Jun/2019:20:34:53.305] cms_8098 b_def_cms_8098/controller2 1/3/87 4518 -- 1/0/0/0/0 0/0
2019-06-20 20:34:59.289 localhost haproxy [pid:17464]info  30.7.3.126:48737 [20/Jun/2019:20:34:59.189] cms_8098 b_def_cms_8098/controller3 1/0/100 4518 -- 0/0/0/0/0 0/0
2019-06-20 20:34:59.435 localhost haproxy [pid:19047]info  30.7.3.126:55079 [20/Jun/2019:20:33:54.428] ts_8081 b_def_ts_8081/controller1 0/0/0/65006/65006 200 142 - - ---- 0/0/0/0/0 0/0 "POST /test_haproxy HTTP/1.1"

可以看到haproxy确实是在server回复client的时候才输出的日志。

那么我们进一步怀疑,会不会是server端处理时间较长,达到console的超时设置,从而被自动取消了请求了呢?

五、检查console超时设置

consoleframework的透传文件中

<properties>
        <property name="timeout" value="60"/>
        <property name="service" value="csbs"/>
    </properties>

可以看到console的请求连接超时是设置了60s,再看下之前的界面错误,发现所有的错误time都是1分钟,status为canceled, 通过抓包debug,进一步发现status为0。 这里是对状态为0的解释:https://stackoverflow.com/questions/3825581/does-an-http-status-code-of-0-have-any-meaning

综上,我们可以得出结论,根本原因不在haproxy,还是服务端的性能问题,有些接口响应特别慢导致的。