Skip to content

Commit

Permalink
Merge pull request #12 from metrics-rs/lots_of_fixins
Browse files Browse the repository at this point in the history
General fixes: core::arch, benchmarking, libc vs mach, etc.
  • Loading branch information
tobz authored Apr 12, 2020
2 parents c5b3200 + 1353aa5 commit 7d592b9
Show file tree
Hide file tree
Showing 16 changed files with 372 additions and 317 deletions.
17 changes: 0 additions & 17 deletions .rustfmt.toml

This file was deleted.

28 changes: 19 additions & 9 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[package]
name = "quanta"
version = "0.4.2-alpha.0"
authors = ["Toby Lawrence <tlawrence@nuclearfurnace.com>"]
authors = ["Toby Lawrence <toby@nuclearfurnace.com>"]
edition = "2018"

license = "MIT"
Expand All @@ -16,18 +16,28 @@ readme = "README.md"

keywords = ["rdtsc", "timing", "nanosecond"]

[[bench]]
name = "timing"
harness = false

[features]
asm = ["tsc"]
tsc = []
metrics = ["metrics-core"]

[dependencies]
libc = "^0.2"
metrics-core = { version = "^0.5", optional = true }

[target.'cfg(windows)'.dependencies]
winapi = { version = "0.3.6", features = ["profileapi"] }
[target.'cfg(target_os = "linux")'.dependencies]
libc = "^0.2"

[target.'cfg(target_os = "macos")'.dependencies]
mach = "^0.3"

[target.'cfg(target_os = "ios")'.dependencies]
mach = "^0.3"

[target.'cfg(target_os = "windows")'.dependencies]
winapi = { version = "^0.3", features = ["profileapi"] }

[dev-dependencies.clocksource]
version = "^0.4"
features = []
[dev-dependencies]
criterion = "^0.3"
clocksource = "^0.5"
44 changes: 11 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,46 +28,24 @@ __quanta__ is a high-speed timing library, useful for getting the current time _
The API documentation of this library can be found at [docs.rs/quanta](https://docs.rs/quanta/).

## general features
- time in nanoseconds
- super fast! (see the benchmarks)
- high-precision mode!
- monotonic time in nanoseconds or raw cycles
- extremely low overhead where possible
- optimized for instruction-level accuracy in measurements
- mockable!
- cross-platform! (we target Linux, Windows, macOS, Solaris, \*BSD)
- cross-platform!
- fun, science-y name!

## performance

quanta provides high-speed access to the native system timing facilities and in general, with optimized assembly turned off, is generally on par with the standard library and external crates:
## platform / architecture support

test bench::time_clocksource_counter ... bench: 30,060 ns/iter (+/- 2,051)
test bench::time_clocksource_counter_delta ... bench: 74,790 ns/iter (+/- 2,897)
test bench::time_clocksource_time ... bench: 30,439 ns/iter (+/- 2,571)
test bench::time_clocksource_time_delta ... bench: 60,429 ns/iter (+/- 5,393)
test bench::time_hotmic_now ... bench: 30,202 ns/iter (+/- 1,643)
test bench::time_hotmic_now_delta ... bench: 59,499 ns/iter (+/- 5,829)
test bench::time_hotmic_raw ... bench: 29,371 ns/iter (+/- 2,110)
test bench::time_hotmic_raw_delta ... bench: 66,385 ns/iter (+/- 2,904)
test bench::time_instant_delta ... bench: 64,285 ns/iter (+/- 3,311)
test bench::time_instant_now ... bench: 18,603 ns/iter (+/- 1,116)
For platforms, we have tier 1 support for Linux, Windows, and macOS/iOS. Platforms such as Solaris or various BSDs has tier 2.5 support: `quanta` should work on them by virtue of depending on `libc`, but we don't test or build on these platforms as all.

The non-delta tests represent the time it takes to take a single time measurement, while the delta tests represent the time to take two measurements and calculate the delta. We can see that without using the optimized assembly features that both `quanta` and `clocksource` provide, taking single measurements is slower than [`Instant::now`] but generally consumes the same amount of time overall to take the measurements and calculate the delta, around 60-65ns.
Architecture-wise, x86/x86-64 and SSE2 are required for the optimized TSC codepath. This is handled transparently via compile-time target features, so you must build with the appropriate compiler flags to specify the CPU features where your binary will run, as runtime detection is not supported.

Using optimized assembly, things can be much faster:

test bench::time_clocksource_counter ... bench: 11,424 ns/iter (+/- 848)
test bench::time_clocksource_counter_delta ... bench: 36,813 ns/iter (+/- 2,047)
test bench::time_clocksource_time ... bench: 25,499 ns/iter (+/- 2,101)
test bench::time_clocksource_time_delta ... bench: 50,761 ns/iter (+/- 3,114)
test bench::time_hotmic_now ... bench: 18,918 ns/iter (+/- 1,591)
test bench::time_hotmic_now_delta ... bench: 38,367 ns/iter (+/- 2,134)
test bench::time_hotmic_raw ... bench: 10,984 ns/iter (+/- 814)
test bench::time_hotmic_raw_delta ... bench: 29,635 ns/iter (+/- 1,685)
test bench::time_instant_delta ... bench: 63,968 ns/iter (+/- 3,805)
test bench::time_instant_now ... bench: 18,096 ns/iter (+/- 1,381)
## performance

Both `quanta` and `clocksource` provide a way for the caller to get the "raw" measurement from the underlying time source, which is an unrefined value that needs to be scaled by a reference time source to end up as a meanginful value. This is provided for taking measurments in tight loops where the deltas can be calculated after the fact. For `clocksource`, the `counter` mode is the raw value, and `time` mode is the `Instant::now` equivalent. For `quanta`, `raw` mode and `now` are as described above.
Accessing the TSC on a modern x86 processor has an extremely low overhead of roughly ~11ns, and `quanta` provides the thinnest possible layer over this. Using the native time facilities, such as `clock_gettime(CLOCK_MONOTONIC)` on Linux, you may expect to see closer to 17-18ns of overhead.

We can see that both `quanta` and `clocksource` are measurably faster than `Instant::now` both in taking the discrete measurements and computing the delta. `quanta`, however, edges out `clocksource`.
Measurements have not been taken for non-x86-based architectures/platforms.

## why use this over stdlib or clocksource?

Expand All @@ -76,7 +54,7 @@ The performance alone is enough to choose this over the stdlib timing facilities
When compared to `clocksource`, though, we have a few extra features that can make the difference:

- `Clock` can be mocked, allowing you to easily control the passage of time in your tests
- `Clock` provides a `start` and `end` method which, in optimized `asm` mode, can replace calls to `raw` and provide more accuracy in the measurement of the code in between
- `Clock` provides `start` and `end` as replacements for `raw`, which are optimized for instruction-level accuracy, avoiding instruction reordering that might taint measurements

## license

Expand Down
8 changes: 1 addition & 7 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,8 @@ trigger: ["master"]
pr: ["master"]

jobs:
# Check the crate formatting.
- template: ci/azure-rustfmt.yml

# Actaully test the crate.
- template: ci/azure-test-stable.yml

# Test it to make sure it still works on our minimum version.
- template: ci/azure-test-minimum.yaml

# Now test it against nightly w/ ASM support.
- template: ci/azure-test-nightly.yml
- template: ci/azure-bench-stable.yml
Loading

0 comments on commit 7d592b9

Please sign in to comment.