Your Swiss Army knife for swiftly processing any amount of data. Implemented for industrial and educational purposes alike.
This codebase is well-documented and commented, in an effort to expose the wonderful algorithms of data analysis to the masses.
We're ever expanding, but for now the following are implemented.
- Standard deviation, both for generic slices and clusters.
- Fast median and mean for large datasets with limited options of values (clusters)
- O(n) - linear time - algorithms, both for arbitrary generic lists (any type of number) and clusters:
- percentile
- median
- standard deviation
- mean
- percentile
- Ordinary least square for linear and polynomial regression
- Naive (O(n²))Theil-Sen estimator for both linear and polynomial (O(n^(m)), where m is the degree + 1) regression
- Exponential/growth and power regression, with correct handling of negatives (most other applications silently ignores them)
- "best fit" method if you don't know which regression model to use
- (binary) A basic plotting feature to preview the equation in relation to the input data
This application supports using it both as a library (with optional cargo features), an interactive CLI program, and through piping data to it, through standard input.
It accepts any comma/space separated values. Scientific notation is supported. This is minimalistic by design, as other programs may be used to produce/modify the data before it's processed by us.
Using the subcommand completion
, std-dev automatically generates shell completions for your shell and tries to put them in the appropriate location.
When using Bash or Zsh, you should run std-dev as root, as we need root privileges to write to their completion directories.
Alternatively, use the --print
option to yourself write the completion file.
When using this as a library, I recommend disabling all features (except base
) (std-dev = { version = "0.1", default-features = false, features = ["base"] }
)
and enabling those you need.
bin
(default, binary feature): This enables the binary to compile.prettier
(default, binary feature): Makes the binary output prettier. Includes colours and prompts for interactive use.completion
(default, binary feature): Enable the ability to generate shell completions.regression
(default, library and binary feature): Enables all regression estimators. This requiresnalgebra
, which provides linear algebra.ols
(default, library feature): Enables the use of OLS, which is the "default" estimator. This also enables polynomial Theil-Sen for degrees > 2 & polynomial regression inbest_fit
functions.arbitrary-precision
(default, library feature): Uses arbitrary precision algebra for >10 degree polynomial regression.percentile-rand
(default, base, library feature): Enables the recommendedpivot_fn
for percentile-related functions.simplify-fraction
(default, base, library feature): Fractions are simplified. Relaxes the requirements for fraction input and implements Eq & Ord for fractions.generic-impls
(default, base, library feature): Makesmean
,standard_deviation
, and percentile resolving generic over numbers. This enables you to use numerical types from other libraries without hassle.
Documentation of the main branch can be found at doc.icelk.dev.
To document with information on which cargo features enables the code,
set the environment variable RUSTDOCFLAGS
to --cfg docsrs
(e.g. in Fish set -x RUSTDOCFLAGS "--cfg docsrs"
)
and then run cargo +nightly doc
.
This library aims to be as fast as possible while maintaining easily readable code.
As all algorithms are executed in linear time now, this is not as useful, but nevertheless an interesting feature. If you already have clustered data, this feature is great.
When using the clusters feature (turning your list into a ClusterList
),
calculations are done per unique value.
Say you have a dataset of infant height, in centimeters.
That's probably only going to be some 40 different values, but potentially millions of entries.
Using clusters, all that data is only processed as O(40)
, not O(millions)
. (I know that notation isn't right, but you get my point).
Creating this cluster involves adding all the values to a map. This takes O(n)
time, but is very slow compared to all other algorithms.
After creation, most operations in this library are executed in O(m)
time, where m is the count of unique values.