Merge pull request #12 from metrics-rs/lots_of_fixins

General fixes: core::arch, benchmarking, libc vs mach, etc.
metrics-rs · Apr 12, 2020 · 7d592b9 · 7d592b9
2 parents c5b3200 + 1353aa5
commit 7d592b9
Show file tree

Hide file tree

Showing 16 changed files with 372 additions and 317 deletions.
diff --git a/.rustfmt.toml b/.rustfmt.toml
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,7 +1,7 @@
 [package]
 name = "quanta"
 version = "0.4.2-alpha.0"
-authors = ["Toby Lawrence <tlawrence@nuclearfurnace.com>"]
+authors = ["Toby Lawrence <toby@nuclearfurnace.com>"]
 edition = "2018"
 
 license = "MIT"
@@ -16,18 +16,28 @@ readme = "README.md"
 
 keywords = ["rdtsc", "timing", "nanosecond"]
 
+[[bench]]
+name = "timing"
+harness = false
+
 [features]
-asm = ["tsc"]
-tsc = []
 metrics = ["metrics-core"]
 
 [dependencies]
-libc = "^0.2"
 metrics-core = { version = "^0.5", optional = true }
 
-[target.'cfg(windows)'.dependencies]
-winapi = { version = "0.3.6", features = ["profileapi"] }
+[target.'cfg(target_os = "linux")'.dependencies]
+libc = "^0.2"
+
+[target.'cfg(target_os = "macos")'.dependencies]
+mach = "^0.3"
+
+[target.'cfg(target_os = "ios")'.dependencies]
+mach = "^0.3"
+
+[target.'cfg(target_os = "windows")'.dependencies]
+winapi = { version = "^0.3", features = ["profileapi"] }
 
-[dev-dependencies.clocksource]
-version = "^0.4"
-features = []
+[dev-dependencies]
+criterion = "^0.3"
+clocksource = "^0.5"
diff --git a/README.md b/README.md
@@ -28,46 +28,24 @@ __quanta__ is a high-speed timing library, useful for getting the current time _
 The API documentation of this library can be found at [docs.rs/quanta](https://docs.rs/quanta/).
 
 ## general features
-- time in nanoseconds
-- super fast! (see the benchmarks)
-- high-precision mode!
+- monotonic time in nanoseconds or raw cycles
+- extremely low overhead where possible
+- optimized for instruction-level accuracy in measurements
 - mockable!
-- cross-platform! (we target Linux, Windows, macOS, Solaris, \*BSD)
+- cross-platform!
 - fun, science-y name!
 
-## performance
-
-quanta provides high-speed access to the native system timing facilities and in general, with optimized assembly turned off, is generally on par with the standard library and external crates:
+## platform / architecture support
 
-    test bench::time_clocksource_counter       ... bench:      30,060 ns/iter (+/- 2,051)
-    test bench::time_clocksource_counter_delta ... bench:      74,790 ns/iter (+/- 2,897)
-    test bench::time_clocksource_time          ... bench:      30,439 ns/iter (+/- 2,571)
-    test bench::time_clocksource_time_delta    ... bench:      60,429 ns/iter (+/- 5,393)
-    test bench::time_hotmic_now                ... bench:      30,202 ns/iter (+/- 1,643)
-    test bench::time_hotmic_now_delta          ... bench:      59,499 ns/iter (+/- 5,829)
-    test bench::time_hotmic_raw                ... bench:      29,371 ns/iter (+/- 2,110)
-    test bench::time_hotmic_raw_delta          ... bench:      66,385 ns/iter (+/- 2,904)
-    test bench::time_instant_delta             ... bench:      64,285 ns/iter (+/- 3,311)
-    test bench::time_instant_now               ... bench:      18,603 ns/iter (+/- 1,116)
+For platforms, we have tier 1 support for Linux, Windows, and macOS/iOS.  Platforms such as Solaris or various BSDs has tier 2.5 support: `quanta` should work on them by virtue of depending on `libc`, but we don't test or build on these platforms as all.
 
-The non-delta tests represent the time it takes to take a single time measurement, while the delta tests represent the time to take two measurements and calculate the delta.  We can see that without using the optimized assembly features that both `quanta` and `clocksource` provide, taking single measurements is slower than [`Instant::now`] but generally consumes the same amount of time overall to take the measurements and calculate the delta, around 60-65ns.
+Architecture-wise, x86/x86-64 and SSE2 are required for the optimized TSC codepath.  This is handled transparently via compile-time target features, so you must build with the appropriate compiler flags to specify the CPU features where your binary will run, as runtime detection is not supported.
 
-Using optimized assembly, things can be much faster:
-
-    test bench::time_clocksource_counter       ... bench:      11,424 ns/iter (+/- 848)
-    test bench::time_clocksource_counter_delta ... bench:      36,813 ns/iter (+/- 2,047)
-    test bench::time_clocksource_time          ... bench:      25,499 ns/iter (+/- 2,101)
-    test bench::time_clocksource_time_delta    ... bench:      50,761 ns/iter (+/- 3,114)
-    test bench::time_hotmic_now                ... bench:      18,918 ns/iter (+/- 1,591)
-    test bench::time_hotmic_now_delta          ... bench:      38,367 ns/iter (+/- 2,134)
-    test bench::time_hotmic_raw                ... bench:      10,984 ns/iter (+/- 814)
-    test bench::time_hotmic_raw_delta          ... bench:      29,635 ns/iter (+/- 1,685)
-    test bench::time_instant_delta             ... bench:      63,968 ns/iter (+/- 3,805)
-    test bench::time_instant_now               ... bench:      18,096 ns/iter (+/- 1,381)
+## performance
 
-Both `quanta` and `clocksource` provide a way for the caller to get the "raw" measurement from the underlying time source, which is an unrefined value that needs to be scaled by a reference time source to end up as a meanginful value.  This is provided for taking measurments in tight loops where the deltas can be calculated after the fact.  For `clocksource`, the `counter` mode is the raw value, and `time` mode is the `Instant::now` equivalent.  For `quanta`, `raw` mode and `now` are as described above.
+Accessing the TSC on a modern x86 processor has an extremely low overhead of roughly ~11ns, and `quanta` provides the thinnest possible layer over this.  Using the native time facilities, such as `clock_gettime(CLOCK_MONOTONIC)` on Linux, you may expect to see closer to 17-18ns of overhead.
 
-We can see that both `quanta` and `clocksource` are measurably faster than `Instant::now` both in taking the discrete measurements and computing the delta.  `quanta`, however, edges out `clocksource`.
+Measurements have not been taken for non-x86-based architectures/platforms.
 
 ## why use this over stdlib or clocksource?
 
@@ -76,7 +54,7 @@ The performance alone is enough to choose this over the stdlib timing facilities
 When compared to `clocksource`, though, we have a few extra features that can make the difference:
 
 - `Clock` can be mocked, allowing you to easily control the passage of time in your tests
-- `Clock` provides a `start` and `end` method which, in optimized `asm` mode, can replace calls to `raw` and provide more accuracy in the measurement of the code in between
+- `Clock` provides `start` and `end` as replacements for `raw`, which are optimized for instruction-level accuracy, avoiding instruction reordering that might taint measurements
 
 ## license
 

diff --git a/azure-pipelines.yml b/azure-pipelines.yml
@@ -2,14 +2,8 @@ trigger: ["master"]
 pr: ["master"]
 
 jobs:
-# Check the crate formatting.
 - template: ci/azure-rustfmt.yml
-
-# Actaully test the crate.
 - template: ci/azure-test-stable.yml
-
-# Test it to make sure it still works on our minimum version.
 - template: ci/azure-test-minimum.yaml
-
-# Now test it against nightly w/ ASM support.
 - template: ci/azure-test-nightly.yml
+- template: ci/azure-bench-stable.yml