Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/prometheusremotewrite] hookup WAL #7304

Conversation

Aneurysm9
Copy link
Member

Description: Enables a write-ahead log for the PRW exporter to provide durable temporary storage of metric data in flight.

odeke-em and others added 5 commits September 11, 2021 03:19
This change completes the WAL capability for the Prometheus Remote-Write exporter
that allows recovery of data if the prior write requests weren't yet exported.

This wires up a Write-Ahead-Log (WAL) that was implemented in
PR open-telemetry/opentelemetry-collector#3597, which
was split off from PR open-telemetry/opentelemetry-collector#3017.

Note: there is very rare condition for which we can perhaps send the
same data a couple of times, and it can happen if we haven't yet truncated
the WAL, the RemoteWrite endpoint received the prior data but the process
received a Ctrl+C, kill signal. However, this will be very rare and
would have to be timed so fast and precisely.

Replaces PR open-telemetry/opentelemetry-collector#3017
Updates PR open-telemetry/opentelemetry-collector#3597
Fixes open-telemetry#4751
Fixes open-telemetry/wg-prometheus#9
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL

Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
@Aneurysm9 Aneurysm9 force-pushed the exporter-prometheusremotewrite-hookup-WAL branch from a4b9205 to 875897b Compare January 21, 2022 22:25
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL

Signed-off-by: Anthony J Mirabella <[email protected]>
@Aneurysm9 Aneurysm9 force-pushed the exporter-prometheusremotewrite-hookup-WAL branch from 6cadf74 to c055725 Compare January 21, 2022 23:32
Signed-off-by: Anthony J Mirabella <[email protected]>
Copy link
Contributor

@dashpole dashpole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm overall

exporter/prometheusremotewriteexporter/wal.go Show resolved Hide resolved
go func() {
prwe.continuallyPopWALThenExport(ctx, func() { close(waitUntilStartedCh) })
}()
<-waitUntilStartedCh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the waitUntilStartedCh actually do anything here? It looks like signalStart is called after just creating a timer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the purpose of this. It feels like an attempt to ensure that the scheduler started work on the goroutine just started, even when GOMAXPROCS == 1 while avoiding using a sync.Once inside the for loop.

…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
@Aneurysm9 Aneurysm9 added the ready to merge Code review completed; ready to merge by maintainers label Jan 28, 2022
exporter/prometheusremotewriteexporter/README.md Outdated Show resolved Hide resolved
RemoteWriteQueue: RemoteWriteQueue{NumConsumers: 1},
WAL: &WALConfig{
Directory: tempDir,
TruncateFrequency: 60 * time.Microsecond,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications of this? Wouldn't this be too hard on the CPU used in the CI machines? Perhaps 1ms wouldn't be too bad in this case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or will it just truncate once?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this as it doesn't seem to affect the tests. I expect this is gated on the batch size instead of frequency.

exporter/prometheusremotewriteexporter/exporter_test.go Outdated Show resolved Hide resolved
exporter/prometheusremotewriteexporter/wal.go Show resolved Hide resolved
case <-timer.C:
shouldExport = true
default:
shouldExport = len(reqL) >= maxCountPerUpload
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will this be called? Won't this select always fall into the timer.C?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most often I'd expect this to hit the default case, actually. Unless the timer is very short I would expect multiple metrics to be processed from the WAL between timer expirations. The default timer length is 1min and readPrompbFromWAL() looks like it shouldn't block for more than about 8s at the outside.

return nil
}

if errL := prwe.exportSink(ctx, reqL); len(errL) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the lock include the export sink? A concurrent call to this would potentially send the same "front" twice.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think locking is a concern here, but walking back from here does have me concerned that an error returned here would stop the goroutine that reads from the WAL, but not the exporter, and there doesn't appear to be a mechanism to restart reading from the WAL.

I need to spend some more time with this.

@Aneurysm9 Aneurysm9 removed the ready to merge Code review completed; ready to merge by maintainers label Jan 28, 2022
@Aneurysm9 Aneurysm9 marked this pull request as draft January 28, 2022 20:44
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
…p-WAL' into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
…p-WAL' into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
Signed-off-by: Anthony J Mirabella <[email protected]>
Signed-off-by: Anthony J Mirabella <[email protected]>
@Aneurysm9 Aneurysm9 marked this pull request as ready for review February 28, 2022 18:38
Signed-off-by: Anthony J Mirabella <[email protected]>
…tor-contrib into exporter-prometheusremotewrite-hookup-WAL
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@codeboten codeboten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are a few go.mod/sum files that were changed in this PR that don't appear to be related to the change, please revert those.

@jpkrohling jpkrohling enabled auto-merge (squash) March 7, 2022 18:47
@jpkrohling jpkrohling merged commit fc80f17 into open-telemetry:main Mar 7, 2022
@Aneurysm9 Aneurysm9 deleted the exporter-prometheusremotewrite-hookup-WAL branch March 7, 2022 19:54
foxlegend pushed a commit to foxlegend/opentelemetry-collector-contrib that referenced this pull request May 16, 2022
* exporter/prometheusremotewrite: glue up and use Write-Ahead-Log

This change completes the WAL capability for the Prometheus Remote-Write exporter
that allows recovery of data if the prior write requests weren't yet exported.

This wires up a Write-Ahead-Log (WAL) that was implemented in
PR open-telemetry/opentelemetry-collector#3597, which
was split off from PR open-telemetry/opentelemetry-collector#3017.

Note: there is very rare condition for which we can perhaps send the
same data a couple of times, and it can happen if we haven't yet truncated
the WAL, the RemoteWrite endpoint received the prior data but the process
received a Ctrl+C, kill signal. However, this will be very rare and
would have to be timed so fast and precisely.

Replaces PR open-telemetry/opentelemetry-collector#3017
Updates PR open-telemetry/opentelemetry-collector#3597
Fixes open-telemetry#4751
Fixes open-telemetry/wg-prometheus#9

* fix go.mod

Signed-off-by: Anthony J Mirabella <[email protected]>

* exporter/prw: WALConfig should be exported to allow programmatic manipulation

Signed-off-by: Anthony J Mirabella <[email protected]>

* Add CHANGELOG entry

Signed-off-by: Anthony J Mirabella <[email protected]>

* fix lint error

Signed-off-by: Anthony J Mirabella <[email protected]>

* mod tidy

Signed-off-by: Anthony J Mirabella <[email protected]>

* tidy up WAL logic and comments

Signed-off-by: Anthony J Mirabella <[email protected]>

* Handle error from closing WAL

Signed-off-by: Anthony J Mirabella <[email protected]>

* Ensure WAL processor keeps running after error

Signed-off-by: Anthony J Mirabella <[email protected]>

* prwe/WAL: refactor WAL run loop

Signed-off-by: Anthony J Mirabella <[email protected]>

* make gotidy

Signed-off-by: Anthony J Mirabella <[email protected]>

* lint

Signed-off-by: Anthony J Mirabella <[email protected]>

* fix data races

Signed-off-by: Anthony J Mirabella <[email protected]>

* Ensure locking around entire WAL persistence routine

Signed-off-by: Anthony J Mirabella <[email protected]>

* Address PR feedback

Signed-off-by: Anthony J Mirabella <[email protected]>

* fix lint issues

Signed-off-by: Anthony J Mirabella <[email protected]>

* update read index inside WAL reader routine

Signed-off-by: Anthony J Mirabella <[email protected]>

* Update CHANGELOG.md

* Undo unrelated go.mod changes

Signed-off-by: Anthony J Mirabella <[email protected]>

Co-authored-by: Emmanuel T Odeke <[email protected]>
Co-authored-by: Alex Boten <[email protected]>
Co-authored-by: Juraci Paixão Kröhling <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants