Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy: Query goroutine leak when store.response-timeout is set #7618

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

cincinnat
Copy link
Contributor

@cincinnat cincinnat commented Aug 9, 2024

time.AfterFunc() returns a time.Timer object whose C field is nil, accroding to the documentation. A goroutine blocks forever on reading from a nil channel, leading to a goroutine leak on random slow queries for Thanos.

This goroutine leak would be most apparent for busy services with query.promql-engine=thanos, when grouroutins tend to stuck in batches, thanks to the wide usage of sync.Once by the engine.

  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

@cincinnat cincinnat force-pushed the query-goroutine-leak branch 4 times, most recently from 910a242 to a4b9301 Compare August 9, 2024 13:10
Copy link
Contributor

@MichaHoffmann MichaHoffmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, thank you!

@saswatamcode
Copy link
Member

@cincinnat could you kindly rebase on latest main? We had a CI issue, which seems to fixed. Want to merge this on green 🙂

time.AfterFunc() returns a time.Timer object whose C field is nil,
accroding to the documentation. A goroutine blocks forever on reading
from a `nil` channel, leading to a goroutine leak on random slow
queries.

Signed-off-by: Mikhail Nozdrachev <[email protected]>
@saswatamcode saswatamcode merged commit 4050c73 into thanos-io:main Aug 13, 2024
19 of 20 checks passed
@cincinnat cincinnat deleted the query-goroutine-leak branch August 13, 2024 07:39
saswatamcode pushed a commit to saswatamcode/thanos that referenced this pull request Aug 13, 2024
…nos-io#7618)

time.AfterFunc() returns a time.Timer object whose C field is nil,
accroding to the documentation. A goroutine blocks forever on reading
from a `nil` channel, leading to a goroutine leak on random slow
queries.

Signed-off-by: Mikhail Nozdrachev <[email protected]>
saswatamcode added a commit that referenced this pull request Aug 13, 2024
* Proxy: Query goroutine leak when `store.response-timeout` is set (#7618)

time.AfterFunc() returns a time.Timer object whose C field is nil,
accroding to the documentation. A goroutine blocks forever on reading
from a `nil` channel, leading to a goroutine leak on random slow
queries.

Signed-off-by: Mikhail Nozdrachev <[email protected]>

* pkg/clientconfig: fix TLS configs with only CA (#7634)

065e3dd introduced a regression: TLS configurations for Thanos Ruler
query and alerting with only a CA file failed to load.

For instance, the following snippet is a valid query configuration:

```
- static_configs:
  - prometheus.example.com:9090
  scheme: https
  http_config:
    tls_config:
      ca_file: /etc/ssl/cert.pem
```

The test fixtures (CA, certificate and key files) are copied from
prometheus/common and are valid until 2072.

Signed-off-by: Simon Pasquier <[email protected]>

* Cut patch release v0.36.1

Signed-off-by: Saswata Mukherjee <[email protected]>

* Fix failing e2e test (#7620)

Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Mikhail Nozdrachev <[email protected]>
Signed-off-by: Simon Pasquier <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>
Co-authored-by: Mikhail Nozdrachev <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Co-authored-by: Harry John <[email protected]>
saswatamcode added a commit that referenced this pull request Aug 14, 2024
* CHANGELOG: Mark 0.36 as in progress

Signed-off-by: Michael Hoffmann <[email protected]>

* Cut release candidate v0.36.0-rc.0 (#7490)

Signed-off-by: Michael Hoffmann <[email protected]>

* Cut release candidate 0.36.0 rc.1 (#7510)

* *: fix server grpc histograms (#7493)

Signed-off-by: Michael Hoffmann <[email protected]>

* Close endpoints after the gRPC server has terminated (#7509)

Endpoints are currently closed as soon as we receive a SIGTERM or SIGINT.
This causes in-flight queries to get cancelled since outgoing connections
get closed instantly.

This commit moves the endpoints.Close call after the grpc server shutdown
to make sure connections are available as long as the server is running.

Signed-off-by: Filip Petkovski <[email protected]>

* Cut release candidate v0.36.0-rc.1

Signed-off-by: Michael Hoffmann <[email protected]>

---------

Signed-off-by: Michael Hoffmann <[email protected]>
Signed-off-by: Filip Petkovski <[email protected]>
Co-authored-by: Filip Petkovski <[email protected]>

* Cut release v0.36.0 (#7578)

Signed-off-by: Michael Hoffmann <[email protected]>

* Cut patch release `v0.36.1` (#7636)

* Proxy: Query goroutine leak when `store.response-timeout` is set (#7618)

time.AfterFunc() returns a time.Timer object whose C field is nil,
accroding to the documentation. A goroutine blocks forever on reading
from a `nil` channel, leading to a goroutine leak on random slow
queries.

Signed-off-by: Mikhail Nozdrachev <[email protected]>

* pkg/clientconfig: fix TLS configs with only CA (#7634)

065e3dd introduced a regression: TLS configurations for Thanos Ruler
query and alerting with only a CA file failed to load.

For instance, the following snippet is a valid query configuration:

```
- static_configs:
  - prometheus.example.com:9090
  scheme: https
  http_config:
    tls_config:
      ca_file: /etc/ssl/cert.pem
```

The test fixtures (CA, certificate and key files) are copied from
prometheus/common and are valid until 2072.

Signed-off-by: Simon Pasquier <[email protected]>

* Cut patch release v0.36.1

Signed-off-by: Saswata Mukherjee <[email protected]>

* Fix failing e2e test (#7620)

Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Mikhail Nozdrachev <[email protected]>
Signed-off-by: Simon Pasquier <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>
Co-authored-by: Mikhail Nozdrachev <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Co-authored-by: Harry John <[email protected]>

---------

Signed-off-by: Michael Hoffmann <[email protected]>
Signed-off-by: Filip Petkovski <[email protected]>
Signed-off-by: Mikhail Nozdrachev <[email protected]>
Signed-off-by: Simon Pasquier <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>
Co-authored-by: Michael Hoffmann <[email protected]>
Co-authored-by: Filip Petkovski <[email protected]>
Co-authored-by: Mikhail Nozdrachev <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Co-authored-by: Harry John <[email protected]>
hczhu-db pushed a commit to databricks/thanos that referenced this pull request Aug 22, 2024
* Proxy: Query goroutine leak when `store.response-timeout` is set (thanos-io#7618)

time.AfterFunc() returns a time.Timer object whose C field is nil,
accroding to the documentation. A goroutine blocks forever on reading
from a `nil` channel, leading to a goroutine leak on random slow
queries.

Signed-off-by: Mikhail Nozdrachev <[email protected]>

* pkg/clientconfig: fix TLS configs with only CA (thanos-io#7634)

065e3dd introduced a regression: TLS configurations for Thanos Ruler
query and alerting with only a CA file failed to load.

For instance, the following snippet is a valid query configuration:

```
- static_configs:
  - prometheus.example.com:9090
  scheme: https
  http_config:
    tls_config:
      ca_file: /etc/ssl/cert.pem
```

The test fixtures (CA, certificate and key files) are copied from
prometheus/common and are valid until 2072.

Signed-off-by: Simon Pasquier <[email protected]>

* Cut patch release v0.36.1

Signed-off-by: Saswata Mukherjee <[email protected]>

* Fix failing e2e test (thanos-io#7620)

Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>

---------

Signed-off-by: Mikhail Nozdrachev <[email protected]>
Signed-off-by: Simon Pasquier <[email protected]>
Signed-off-by: Saswata Mukherjee <[email protected]>
Signed-off-by: 🌲 Harry 🌊 John 🏔 <[email protected]>
Co-authored-by: Mikhail Nozdrachev <[email protected]>
Co-authored-by: Simon Pasquier <[email protected]>
Co-authored-by: Harry John <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants