-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Best algorithm for random integers #9
Comments
128 bit keys are hard... The defaults in the library are pretty much the best I was able to achieve for random, well distributed 128 bit data on a few different devices I had on hand. That said, it varies considerably based upon:
Are you able to provide any more details on these? Particularly a type definition if the keys are embedded in a struct or tuple. I'll try to replicate your case and see if there's a better combo. I do know that voracious_sort does a little better on 128 bit keys most of the time. Presumably this is due to the way the byte lookups for keys have been unrolled. I wasn't able to entirely eliminate the cost of having the additional flexibility around keys that rdst provides, so voracious_sort may be better here if it's compatible with your dataset. If you have fairly large structs / tuples that contain the keys, you could try utilizing the Regions sort in this library via a custom tuner. I didn't ever get the performance I expected out of it for plain integer slices, but for larger structs it does theoretically minimize the expensive copying of data a little. Ska sort might also be good here on the single-threaded side. These also sometimes come out on top when the data is unevenly distributed. For a custom tuner for very large 128 bit key datasets, you'll likely want the same pattern as the StandardTuner:
For that first byte, there are three multi-threaded choices:
You can see more detail on each algorithm at the top of their respective files under For larger structs, theoretically you may have better luck with Regions sort and Ska sort as they are more efficient memory-wise. If it's just Let me know if you can provide any more details on your situation and I can try to test out a few options! |
Ah, I just followed the trail of GitHub issues. This tuning is worth trying:
Also worth trying:
For larger item counts, I think Regions will become useful. For smaller item counts on my M1 MBP, Ska sort actually wins when used at all levels, I presume because the structs are a little larger and Ska sort does less copying. Full benchmark for testinguse block_pseudorand::block_rand;
use criterion::{
black_box, criterion_group, criterion_main, AxisScale, BatchSize, BenchmarkId, Criterion,
PlotConfiguration, Throughput,
};
use rdst::{RadixKey, RadixSort};
use std::time::Duration;
use rdst::tuner::{Algorithm, Tuner, TuningParams};
const TOTAL_ITEMS: usize = 160_000_000;
struct MyTuner;
impl Tuner for MyTuner {
fn pick_algorithm(&self, p: &TuningParams, _counts: &[usize]) -> Algorithm {
match p.level {
15 => Algorithm::Scanning,
// Also worth trying, particularly if you increase the count
// 15 => Algorithm::Regions,
_ => Algorithm::Ska,
}
}
}
#[derive(Debug, Clone, Copy)]
pub struct ComplexStruct {
pub key: [u64; 2],
pub val: u64,
}
impl RadixKey for ComplexStruct {
const LEVELS: usize = 16;
#[inline]
fn get_level(&self, level: usize) -> u8 {
(self.key[1 - level / 8] >> (level % 8) * 8) as u8
}
}
fn gen_input(n: usize) -> Vec<ComplexStruct> {
let data: Vec<u64> = block_rand(n);
let data_2: Vec<u64> = block_rand(n);
let data_3: Vec<u64> = block_rand(n);
(0..n)
.into_iter()
.map(|i| ComplexStruct {
key: [data[i], data_2[i]],
val: data_3[i],
})
.collect()
}
fn full_sort_complex(c: &mut Criterion) {
let mut input_sets: Vec<(&str, Vec<ComplexStruct>)> = vec![
("random", gen_input(TOTAL_ITEMS)),
];
let tests: Vec<(&str, Box<dyn Fn(Vec<ComplexStruct>)>)> = vec![
(
"rdst_custom",
Box::new(|mut input| {
input
.radix_sort_builder()
.with_tuner(&MyTuner {})
.sort();
black_box(input);
}),
),
(
"rdst_standard",
Box::new(|mut input| {
input
.radix_sort_builder()
.sort();
black_box(input);
}),
),
];
let mut group = c.benchmark_group("complex_sort");
group.sample_size(10);
group.measurement_time(Duration::from_secs(5));
group.warm_up_time(Duration::from_secs(5));
group.plot_config(PlotConfiguration::default().summary_scale(AxisScale::Logarithmic));
for set in input_sets.iter_mut() {
let l = set.1.len();
group.throughput(Throughput::Elements(l as u64));
for t in tests.iter() {
let name = format!("{} {}", set.0, t.0);
group.bench_with_input(BenchmarkId::new(name, l), set, |bench, set| {
bench.iter_batched(|| set.1.clone(), &*t.1, BatchSize::SmallInput);
});
}
}
group.finish();
}
criterion_group!(complex_sort, full_sort_complex,);
criterion_main!(complex_sort); And in Cargo.toml: [[bench]]
name = "complex_sort"
harness = false Run with:
With StandardTuner and 160M items on an M1 MBP:
With that custom tuner:
That structure of specifically tuning around levels is convenient if you are tuning for a single target. But once you go deeper into supporting varying sizes of input array you'll want to also deviate based upon the number of items being sorted. |
Based on your experience and testing, do you have a suggested best tuning for sorting random 128-bit keys?
The text was updated successfully, but these errors were encountered: