Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting 'Error finding Wavefront Alert' sometimes #16

Closed
stephenchu opened this issue Nov 2, 2017 · 8 comments
Closed

Getting 'Error finding Wavefront Alert' sometimes #16

stephenchu opened this issue Nov 2, 2017 · 8 comments

Comments

@stephenchu
Copy link

stephenchu commented Nov 2, 2017

Occasionally I am getting Error finding Wavefront Alert errors:

wavefront_alert.1498186013614: Refreshing state... (ID: 1498186013614)
wavefront_alert.1494626662446: Refreshing state... (ID: 1494626662446)
wavefront_alert.1475626739625: Refreshing state... (ID: 1475626739625)
Error refreshing state: 11 error(s) occurred:

* wavefront_alert.1489083522148: 1 error(s) occurred:

* wavefront_alert.1489083522148: wavefront_alert.1489083522148: Error finding Wavefront Alert 1489083522148. Post https://metrics.wavefront.com/api/v2/search/alert: http: server closed idle connection
* wavefront_alert.1509396643007: 1 error(s) occurred:

* wavefront_alert.1509396643007: wavefront_alert.1509396643007: Error finding Wavefront Alert 1509396643007. Post https://metrics.wavefront.com/api/v2/search/alert: http: server closed idle connection
* wavefront_alert.1487887261795: 1 error(s) occurred:

* wavefront_alert.1487887261795: wavefront_alert.1487887261795: Error finding Wavefront Alert 1487887261795. Post https://metrics.wavefront.com/api/v2/search/alert: http: server closed idle connection
* wavefront_alert.1458691979355: 1 error(s) occurred:

Is it the case the WF's search API is not that reliable? Or is it that the alerts.Find() here is not waiting enough for results to be returned?

Is there any way to maybe turn on "debug" on these request/response made by the provider, so I can see more information?

Thanks!

@louism517
Copy link
Contributor

Hi,

I suspect this is some sort of low-level transport problem maybe caused by a mismatch between keepalives on the client side and idle timeout on the server.

The link you posted is actually for Events, not Alerts, but the underlying client is the same.

I think this could be fixed by disabling keepalives on the HTTP transport https://golang.org/pkg/net/http/#Transport so that each HTTP request uses a separate TCP connection instead of trying to re-use them. I'm happy to give this a go.

@louism517
Copy link
Contributor

Just some further questions:

  • Roughly how many alerts are you managing?
  • Does this happen every time you run a terraform apply or are you otherwise able to reproduce the issue?
  • Do some of the alerts get through, or do you see issues for every alert?
  • Are you happy to build your own version of the plugin to test any fixes?

We are discussing how we can implement debug logging. The facility exists on the wavefront client, we just need to figure the best way to enable it via terraform.

@stephenchu
Copy link
Author

  • ~ 400 alerts
  • It doesn't happen every apply. It happens randomly in either plan and apply. I also tried to play with the -parallelism=N setting, but I recall it even happened when I tuned it down to N=3. At the same time, I have also seen it successfully run with N=20.
  • Most alerts do go thru; usually only a handful (or fewer) would fail when it happens.
  • Better: Let me give you a rough Dockerfile and see if you can simulate it in the same environment where I would see them. Inside the container is where I would run the terraform {plan,apply} that results in this error.

NOTE: I tried using a different WF CLI client to see if I am actually being rate-limited by WF, but it didn't seem to fail -- although the wf alert describe <alert-id> command I used together with GNU parallel with -j 5 (N=5) was hitting a different WF API endpoint (rather than the search endpoint your terraform-provider-wavefront is using).

Gist: https://gist.github.com/stephenchu/07df8b63971e72d3ea54cd9ae44182d0

Sorry - probably won't work out-of-the-box on all these but I'm sure you can figure it all out.

TIA

@stephenchu
Copy link
Author

stephenchu commented Nov 14, 2017

I seem to be getting this error 9 out of 10 times now somehow... :-(

Looks related: golang/go#22158

@louism517
Copy link
Contributor

Hi @stephenchu,

Apologies for the delay, we are working to allocate some time in our internal sprint to look into this.

@stephenchu
Copy link
Author

No problem at all. I'm already grateful for your attention on this matter.

One thought on how it could be fixed. If this was indeed due to the aforementioned golang bug, then it is a bug exposed when doing http POST. But, in this case, the "POST to search by alert id" can also be done via a "GET to describe by alert id", which should be sufficient to bypass this bug.

But, of course, if you think turning off KeepAlive is a better and cheaper fix, I will not have a problem with that.

@louism517
Copy link
Contributor

Hi @stephenchu,

I'm afraid I'm unable to reproduce this issue. I have successfully applied, planned and destroyed 400 test alerts several times without hitting the problem.

I'm sure you're aware, but there was a known issue with parallel requests made against the Wavefront API, which they have recently fixed. See this commit. Before that you should have been running updates with -parallelism=1.

All of my repro attempts have been made with no explicit parallelism setting. I'm thinking/hoping that the Wavefront API fix may have made this problem go away.

Could you confirm that you are still seeing this problem? If so, there is still a chance your cluster has not been upgraded, and in that case could you try running with -parallelism=1?

@stephenchu
Copy link
Author

I am happy to confirm that I somehow magically no longer see the problem... I guess you were right about something about my WF account/cluster changed under the hood recently. I am closing this now.

Thanks a bunch for your time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants