Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(subscriber): use monotonic Instants for all timestamps #288

Merged
merged 3 commits into from
Feb 17, 2022

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Feb 17, 2022

Motivation

Currently, the console-subscriber crate records all timestamps as
SystemTimes. This is because they are eventually sent over the wire as
protobuf Timestamps, which can be constructed from a SystemTime.
They cannot be constructed from Instants, because Instant is opaque,
and does not expose access to the underlying OS time.

However, using SystemTime is not really correct for our use case. We
only use timestamps for calculating durations; we only have to serialize
them because some durations are calculated in the console UI rather than
in-process. We don't need timestamps that are globally consistent with
a shared timebase, but we do need monotonicity --- using SystemTime
leaves us vulnerable to clock skew, if (for example), an NTP clock skew
adjustment causes the system clock to run backwards far enough that a
poll appears to end "before" it started (as in issue #286). If we were
using monotonic Instants, all polls should always have positive
durations, but with SystemTimes, this isn't necessarily the case.

Furthermore, Instant::now() may have less performance overhead than
SystemTime::now(), at least on some platforms.

Solution

This branch changes console-subscriber to always take timestamps using
Instant::now() rather than using SystemTime::now(), and store all
timestamps as Instants. In order to convert these Instants into
SystemTimes that can be sent over the wire, we construct a reference
TimeAnchor, consisting of a paired Instant and SystemTime recorded
at the same time when the ConsoleLayer is constructed. We can then
construct "system times" that are monotonic, by calculating the duration
between a given Instant and the anchor Instant, and adding that
duration to the anchor SystemTime. These are not real system
timestamps, as they will never run backwards if the system clock is
adjusted; they are relative only to the base process start time as
recorded by the anchor. However, they are monotonic, and all durations
calculated from them will be reasonable.

This is part of the change I proposed in #254. I'm not going to close
that issue yet, though, as it also described potentially switching to
use the quanta crate rather than std::time::Instant to reduce the
overhead of recording monotonic timestamps.

Fixes #286

## Motivation

Currently, the `console-subscriber` crate records all timestamps as
`SystemTime`s. This is because they are eventually sent over the wire as
protobuf `Timestamp`s, which can be constructed from a `SystemTime`.
They cannot be constructed from `Instant`s, because `Instant` is opaque,
and does not expose access to the underlying OS time.

However, using `SystemTime` is not really correct for our use case. We
only use timestamps for calculating durations; we only have to serialize
them because some durations are calculated in the console UI rather than
in-process. We *don't* need timestamps that are globally consistent with
a shared timebase, but we *do* need monotonicity --- using `SystemTime`
leaves us vulnerable to clock skew, if (for example), an NTP clock skew
adjustment causes the system clock to run backwards far enough that a
poll appears to end "before" it started (as in issue #286). If we were
using monotonic `Instant`s, all polls should always have positive
durations, but with `SystemTime`s, this isn't necessarily the case.

Furthermore, `Instant::now()` may have less performance overhead than
`SystemTime::now()`, at least on some platforms.

## Solution

This branch changes `console-subscriber` to always take timestamps using
`Instant::now()` rather than using `SystemTime::now()`, and store all
timestamps as `Instant`s. In order to convert these `Instant`s into
`SystemTime`s that can be sent over the wire, we construct a reference
`TimeAnchor`, consisting of a paired `Instant` and `SystemTime` recorded
at the same time when the `ConsoleLayer` is constructed. We can then
construct "system times" that are monotonic, by calculating the duration
between a given `Instant` and the anchor `Instant`, and adding that
duration to the anchor `SystemTime`. These are not *real* system
timestamps, as they will never run backwards if the system clock is
adjusted; they are relative only to the base process start time as
recorded by the anchor. However, they *are* monotonic, and all durations
calculated from them will be reasonable.

This is part of the change I proposed in #254. I'm not going to close
that issue yet, though, as it also described potentially switching to
use the `quanta` crate rather than `std::time::Instant` to reduce the
overhead of recording monotonic timestamps.

Fixes #286
hawkw added a commit that referenced this pull request Feb 17, 2022
## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

While I was making this change, I also did a little refactoring. I
factored out the code that's shared between constructing the initial
update and subsequent updates into methods that are called with an
`Include` enum, so we can reuse them when building the first update for
a subscription and when building subsequent updates in `publish`. IMO
this makes the code much more readable.

Fixes #266
Depends on #288
Signed-off-by: Eliza Weisman <[email protected]>
@hawkw hawkw enabled auto-merge (squash) February 17, 2022 18:34
hawkw added a commit that referenced this pull request Feb 17, 2022
## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

While I was making this change, I also did a little refactoring. I
factored out the code that's shared between constructing the initial
update and subsequent updates into methods that are called with an
`Include` enum, so we can reuse them when building the first update for
a subscription and when building subsequent updates in `publish`. IMO
this makes the code much more readable.

Fixes #266
Depends on #288
@hawkw hawkw merged commit abc0830 into main Feb 17, 2022
@hawkw hawkw deleted the eliza/no-system-time branch February 17, 2022 18:39
hawkw added a commit that referenced this pull request Feb 17, 2022
## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

Fixes #266
Depends on #288
hawkw added a commit that referenced this pull request Feb 17, 2022
## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

Fixes #266
Depends on #288
hawkw added a commit that referenced this pull request Feb 17, 2022
* refac(subscriber): factor out update construction

This branch refactors how the aggregator task builds update messages. In
particular, it factors out the shared code between building the initial
update for a subscription and building the subsequent updates in
`publish` into methods that are called with an `Include` enum to
determine whether the update should include all tasks/resources/ops, or
just the ones since the last update. This makes the code in
`add_instrument_subscription` and in `publish` much easier to read.

This is just the refactor part from #289 split out into a separate
branch; this should make no functional change. If we decide that #289 is
the correct way to fix the bug, we can rebase that branch onto this, so
that it _just_ includes the bugfix. This way, we can land the refactor
anyway, because I think it's nice.

* fix(subscriber): record timestamps for updates last

## Motivation

Currently, when constructing an update message to send over the wire, a
timestamp is taken first, and then the protobuf data is constructed.
This can lead to issues where the "now" timestamp is actually _before_
timestamps present in the stats sent in that update, since the stats for
a particular task/resource/async op might be updated on another thread
after taking the update's "now" timestamp. This results in issues like
#266.

## Solution

There's no actual reason to take those timestamps *before* we assemble
the update. This branch changes the aggregator to build all the various
data updates in an update message, and *then* record the update's "now"
timestamp. Any timestamps for tasks/resources/async ops that are
recorded after the update's "now" timestamp will now be included in the
*next* update.

Fixes #266
Depends on #288
hawkw added a commit that referenced this pull request Feb 18, 2022
## 0.1.3  (2022-02-18)

#### Bug Fixes

*  record timestamps for updates last (#289) ([703f1aa](703f1aa),
   closes [#266](266))
*  use monotonic `Instant`s for all timestamps (#288)
   ([abc0830](abc0830), closes [#286](286))
*  bail rather than panic when encountering clock skew (#287)
   ([24db8c6](24db8c6), closes [#286](286))
*  fix compilation on targets without 64-bit atomics (#282)
   ([5590fdb](5590fdb), closes [#279](279))
hawkw added a commit that referenced this pull request Feb 18, 2022
#### Features

*  add `Builder::filter_env_var` builder parameter (#276)
   ([dbdb149](dbdb149), closes [#206](206))

#### Bug Fixes

*  record timestamps for updates last (#289) ([703f1aa](703f1aa),
   closes [#266](266))
*  use monotonic `Instant`s for all timestamps (#288)
   ([abc0830](abc0830), closes [#286](286))
*  bail rather than panic when encountering clock skew (#287)
   ([24db8c6](24db8c6), closes [#286](286))
*  fix compilation on targets without 64-bit atomics (#282)
   ([5590fdb](5590fdb), closes [#279](279))
hawkw added a commit that referenced this pull request Feb 18, 2022
#### Features

*  add `Builder::filter_env_var` builder parameter (#276)
   ([dbdb149](dbdb149), closes [#206](206))

#### Bug Fixes

*  record timestamps for updates last (#289) ([703f1aa](703f1aa),
   closes [#266](266))
*  use monotonic `Instant`s for all timestamps (#288)
   ([abc0830](abc0830), closes [#286](286))
*  bail rather than panic when encountering clock skew (#287)
   ([24db8c6](24db8c6), closes [#286](286))
*  fix compilation on targets without 64-bit atomics (#282)
   ([5590fdb](5590fdb), closes [#279](279))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

the current poll must have started before it ended
2 participants