Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for retrying output writes, using independent threads #298

Merged
merged 1 commit into from
Oct 21, 2015

Conversation

sparrc
Copy link
Contributor

@sparrc sparrc commented Oct 21, 2015

Fixes #285

@sparrc
Copy link
Contributor Author

sparrc commented Oct 21, 2015

@daviesalex this change will add support for retrying failed writes.

Initially I've decided to implement it a bit more simply than a circular shared buffer. This will simply spin off a thread for each "batch" of points, retrying flush_retries times if it fails.

@sparrc sparrc merged commit dfc5986 into master Oct 21, 2015
@sparrc sparrc deleted the cache-retry branch October 21, 2015 17:28
@daviesalex
Copy link

@sparrc looks great, we will test this out at scale this week and report back.

@daviesalex
Copy link

I'm not 100% sure which issue is best to comment on this; apologies if I have picked the wrong one. But, as promised, we tested this. Summary:

  • This works as you would expect
  • If you are at scale, you need to set the flush_interval to be the same as jitter_interval
  • Even with that, you are at risk of microbursts "on the second"

We deleted all our data, and pushed a change to 1,000 hosts with flush_interval at 60s and jitter_interval 30. See below graph for the moment we brought this config up (before the obvious change old config; after change 1,000 of our several thousand hosts moved to the new config)

image

These 1,000 hosts have pretty accurate timesync (<25us) and are a small number of cut through (=very fast) switches from the InfluxDB test nodes, so this is an extreme example of the problem, but we saw extreme microbursts (causing drops on Intel 10G NIC cards).

We have not yet done network inspection of capture files, but I strongly suspect that we are going to see a <100us microburst once per second. This isnt affecting InfluxDB too much, but is making network hardware have a bad day in the middle.

We have two further suggestions (as well as making it a fixed sized buffer in #285):

  1. We will be adjusting our config to make the two intervals the same, but I suggest that when the snap to a second config is in place you add a non-configurable random sleep of between 0 and 1 second (measured in microseconds). This will prevent users who dont realize that they have microbursts from experiencing the worst of the problems.
  2. Any failures should backoff exponentially up to some sane maximum. I've not dug into the code to see what the behaviour in HEAD Is, but when we started InfluxDB up after taking it down with just 1k agents, the microburst of them all retrying their writes was >10G (for a fraction of a second) and sufficient to cause a commercially supported loadbalancer running haproxy to segfault and 10G Intel NICs on appliances rated for 14 million packets per second to drop frames. I dont think we got to 14 million packets in a second, but we probably got over a few million in a few us.

We are going to leave this configuration running for the weekend and will do more testing on Monday. Happy to do any specific testing you can suggest.

Thanks!

-Alex

@daviesalex
Copy link

So a further graph of real world usage; this shows Rx traffic for 3 nodes (all writes going to one, the yellow)

Box 1 - old client
Box 2 - new client, flush_interval=60, jitter_interval=30 on 1k clients
Box 3 - new client, flush_interval=60, jitter_interval=60 on 1k clients
Box 4 - InfluxDB fell over with shard errors (Separate issue) and started rejecting a small % of writes

The spikiness in box 3/4 we believe is caused by clients bursting at the same time on a specific second. Our metric data is once per 10 seconds on the server, so we need to do more analysis to dig really into this, which will happen on Monday.

image

@sparrc
Copy link
Contributor Author

sparrc commented Oct 23, 2015

Thank you @daviesalex, this is very good information. My first reaction was that we must be choosing a random number on 1s resolution, but we're actually choosing it on 1ns resolution. Most likely the problem is that each Telegraf binary is using the same seed from the rand package. I should be able to fix this by setting the random seed to the current time in nanoseconds, this should give you the same behavior as you were getting before, where the flush time was dependent on the Telegraf start time.

From what I can tell, adding another random sleep between 0 and 1 seconds will only result in all Telegraf instances sleeping for the same "random" amount, and the microbursts will remain.

I also see what you mean about the backoff on retries. The current implementation has the batches of points retrying independently. This means that if you have an InfluxDB server down for more than 2 flush intervals, then each Telegraf instance will have 3 batches of points backed up, and will be trying to flush those 3 batches on the same interval. Having the persistent buffer will fix this.

@sparrc
Copy link
Contributor Author

sparrc commented Oct 23, 2015

PS @daviesalex can we send you some InfluxDB swag to say "thanks" for this? If you send your t-shirt size and address to [email protected] we can get a package out to you :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants