Fix Resty logging, timeout and retry configuration #28

anandswaminathan · 2019-06-14T20:45:54Z

Fixes the incorrect logging from Resty library.
Removes retry from Patch and Post requests, as Flink POST request is not idempotent.
Separate timeout for Get vs Post request.

Sample logs in json format:

{"json":{"app_name":"taskfailurejob","ns":"default","phase":"SubmittingJob","src":"api.go:147"},"level":"info","msg":"RESTY 2019/06/14 13:48:18 ERROR Get http://localhost:8001/api/v1/namespaces/default/services/taskfailurejob-949b65b0:8081/proxy/overview: dial tcp [::1]:8001: connect: connection refused, Attempt 3","ts":"2019-06-14T13:48:18-07:00"}
{"json":{"app_name":"taskfailurejob","ns":"default","phase":"SubmittingJob","src":"api.go:147"},"level":"info","msg":"RESTY 2019/06/14 13:48:18 ERROR Get http://localhost:8001/api/v1/namespaces/flink-operator/services/wordcount-operator-example-c931dcc2:8081/proxy/overview: dial tcp [::1]:8001: connect: connection refused, Attempt 3","ts":"2019-06-14T13:48:18-07:00"}

glaksh100 · 2019-06-14T21:19:15Z

pkg/controller/flink/client/api.go

 	var resp *resty.Response
 	var err error
 	if method == httpGet {
-		resp, err = c.client.R().Get(url)
+		client.SetTimeout(httpGetTimeOut).SetRetryCount(retryCount)


Out of curiosity, why two separate timeouts?

More discussion here: #9

Main reason is this is a blocking call, and Get calls generally finishes in milliseconds. It's only the POST request in Flink which causes issues under low CPU, and POST is not idempotent.

Problem with using the same timeout for Get calls (like 1 minute timeout) is when the cluster is starting, and pods are not ready, all the requests are going timeout as there is no job manager to serve the requests. This will just keep all the workers blocked. It is better to give up, and poll in the next resync period.

Cool, thanks for the context :) The difference in timeouts makes sense. Given that this would depend on some cluster specific resources, should we make this configurable (with some sensible defaults as you have here) ?

@glaksh100 Its actually the other way. You will have all kinds of behavior based in application configuration/resources. Intention is to get to a value which would not affect the cycles of the operator. Also if there is a better number for timeout it’s better to find and set that value.

mwylde · 2019-06-17T20:38:06Z

👍

Fix Resty logging, timeout and retry configuration

8d7e2a7

anandswaminathan requested review from glaksh100, kumare3 and mwylde as code owners June 14, 2019 20:45

glaksh100 reviewed Jun 14, 2019

View reviewed changes

anandswaminathan merged commit 93f0e01 into master Jun 17, 2019

anandswaminathan mentioned this pull request Jun 17, 2019

Increase flink api timeout to avoid submitting duplicate jobs #9

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Resty logging, timeout and retry configuration #28

Fix Resty logging, timeout and retry configuration #28

anandswaminathan commented Jun 14, 2019 •

edited

Loading

glaksh100 Jun 14, 2019

anandswaminathan Jun 14, 2019

glaksh100 Jun 14, 2019

anandswaminathan Jun 15, 2019

mwylde commented Jun 17, 2019

Fix Resty logging, timeout and retry configuration #28

Fix Resty logging, timeout and retry configuration #28

Conversation

anandswaminathan commented Jun 14, 2019 • edited Loading

glaksh100 Jun 14, 2019

Choose a reason for hiding this comment

anandswaminathan Jun 14, 2019

Choose a reason for hiding this comment

glaksh100 Jun 14, 2019

Choose a reason for hiding this comment

anandswaminathan Jun 15, 2019

Choose a reason for hiding this comment

mwylde commented Jun 17, 2019

anandswaminathan commented Jun 14, 2019 •

edited

Loading