Skip to content

performance

PpHd edited this page Jun 1, 2024 · 149 revisions

C/C++ container library benchmark :

Benchmark code is available here

For the following results, the best result within 5% has been selected: more than 10 tries of the bench have been launch, then the selected value is the minimum bound of the longest segments of all values within 5%.

## C libraries

Type List Array Tree Dict S Dict B Sort
M*LIB 572 ms 312 ms 296 ms 32 ms 704 ms 888 ms
M*LIB+ 148 ms 316 ms 300 ms 36 ms 532 ms 884 ms
GLIB 592 ms 1356 ms 2056 ms 424 ms 728 ms 1300 ms
KLIB 812 ms 312 ms 436 ms 56 ms 756 ms 784 ms
CollectionC 720 ms 1624 ms 1284 ms 452 ms 1200 ms 2696 ms
LIBDYNAMIC NA 1092 ms NA 120 ms 676 ms NA
TOMMY DS 624 ms 752 ms 1380 ms 184 ms 584 ms NA
UTHASH 564 ms 328 ms NA 420 ms 1276 ms 1276 ms
QLIBC TO 1476 ms 3196 ms 1872 ms NA NA NA
LIBSRT NA 2192 ms NA NA NA 1252 ms
CMC NA  872 ms 1404 ms  328 ms NA NA
CTL 592 ms 312 ms 1204 ms 460 ms  NA  1256 ms
STC 584 ms 324 ms 908 ms 104 ms 640 ms 1304 ms
POTTERY 572 ms 1680 ms 488 ms ? 60 ms 908 ms 916 ms
CC  542 ms 720 ms NA 113 ms 1336 ms NA
VERSTABLE NA NA NA 80 ms 620 ms NA

## C++ libraries

Type List Array Tree Dict S Dict B Sort
STL 616 ms 772 ms 1228 ms 424 ms 1028 ms 740 ms
QT 596 ms 1172 ms 1300 ms 456 ms 840 ms 816 ms
DENSE HMAP NA NA NA 48 ms 804 ms NA
FLAT HMAP NA NA NA 92 ms 808 ms NA
EMILIB NA NA NA 44 ms 644 ms NA
EMHASH NA NA NA 36 ms 521 ms NA
HOPSCOTCHMAP NA NA NA 80 ms 844 ms NA
RIGTORP HMAP NA NA NA 40 ms 960 ms NA
BOOST UNORDERED  NA NA NA 38 ms  426 ms  NA

For theses results, the best one of 3 tries was kept:

Type Queue MPMC
M*LIB CONCURRENT+DEQUEUE 579 ms
M*LIB BUFFER 375 ms
M*LIB MPMC Queue 36 ms
LIBLFDS MPMC 83 ms
CONCURRENT QUEUE 279 ms
BOOST LOCK FREE 338 ms
RIGTORP MPMC Queue 225 ms

The used versions are:

COMPONENT VERSION
CPU Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
GCC 10.1
M*LIB 743959ebde33e9a5452799e02c245cee02ebada2
Qt 5.7.1
GLIB 2.50.3
KLIB 928581a78413bed4efa956731b35b18a638f20f3
CollectionsC 775447bfa379417382b2db8454f19039f4d3deb4
ConcurrentQueue 38e6a6f0185a98c3aaf2a95aa109ba041221d527
Emilib 0d90f4c0cd44813f6898b417bf3f4ef574a5b136
EMHASH 7e51e4cce5ae053c3e000466163aebd620e1004d
Flast HMAP 2c4687431f978f02a3780e24b8b701d22aa32d9c
Rigtorp HMAP 2f91f0cf97e9a5b6881a8cdf37bf5fbd470b3ee6
Rigtorp MPMC Q b74a9f06469ec268b82eca3b5fe28dd58f312d06
Hopscotch HMAP 59e03db96365fa22de26729a412d632e0b06df3b
LibCollections c259a9a781fa03a4a02b6b7d5118fd202574c347
LibDynamic d680409177b23ee649c0521c4a1ef647478c1f2f
LibSRT e3c8a7091f42dbb3a055252bee6bccb171a6eddc
LibLFDS 7.1.1
QLIBC 25d5f5ce44ec4c863edbeaecdcb4d3c05dcf3aa7
RapidString 2a3ef51e29f37c6dbc62b46c8d39f058689d40c9
SDS fb463145c9c245636feb28b5aac0fc897e16f67e
Dense HMAP f93c0c69e959c1c77611d0ba8d107aa971338811
SparsePP 5dc50f795b9783ce5778c03b6cad5d1c7d0d29ea
Tommy DS 8a99599168ef7e56276ed3dbdeee40a27b039da7
UTHASH 28334219b5327937cb05a593dc2d6ba622ce5bcb
BOOST 1.84
C Macro Collections  v0.23.1
CTL b787374d672c97e173164fcb6068dced0e1a3619
STC 9f1a51593ae7dd45db46f39ac18901ad175af18e
POTTERY e6ac5f45feb6119521e1b79c65b2062d5ebbceb0
CC 3cbef63a2608e82bdd8c1cde04a7ec62cbd69f7c
VERSTABLE afbe27c7914aad2d936c3569128588799b3f0913

Notes:

  • Tree means "Totally Sorted Tree": Red-Black Tree, AVL Tree, B-Tree or equivalent.
  • Dict S: Dictionary with 8 bytes type,
  • Dict B: Dictionary with 256 bytes type .
  • M*LIB+: M*LIB with mempool option activated (replacing malloc).
  • M*LIB's Dict S used Open Addressing
  • M*LIB's Dict B used separate chaining.
  • Qt5 doesn't have a container that is a set offering total ordering. As such, it was QMap which was tested instead.
  • Glib uses its own memory allocator (gslice).
  • Tommy DS doesn't have a insert_or_replace function, as such, it is insert alone which is benched (two equal keys can both be appended in the dictionary), which is faster.
  • QLibC hash functions support only const char pointer as keys.
  • A result marked ? means that the computation of the benchmark was wrong, and so the bench time is unreliable.

Code generation analysis:

Let's compare two equivalent code. One with M*Lib, the other one with standard classic array. The code is inspired by this question.

The classic C version:

  const int dimension = 999;
  struct Pixel * pixels = malloc(sizeof(struct Pixel) * dimension * dimension);
  for(int i = 0 ; i < dimension * dimension; ++i)
    {
      pixels[i].r = 255;
      pixels[i].g = 0;
      pixels[i].b = 0;
    }
  free(pixels);

The equivalent M*lib version:

  const int dimension = 999;
  array_pixel_t pixels;
  array_pixel_init(pixels);
  array_pixel_reserve(pixels, dimension * dimension);
  for(int i = 0; i < dimension * dimension; ++i)
    {
      const struct Pixel p = { 255, 0, 0};
      array_pixel_push_back (pixels, p);
    }
  array_pixel_clear(pixels);

Let's compare the timings (gcc -O2 -march=native)

Code Time
classic C 856ms
M*Lib 604ms

M*LIB timing is even faster! And M*Lib performs much more work with the data (the push_back operator may realloc the data if needed). How is it possible?

Let's deep into the generated code. For the classic version:

movl	$2994003, %edi
call	malloc
movq	%rax, %rdi
leaq	2994003(%rax), %rdx
.L11:
movb	$-1, (%rax)
addq	$3, %rax
movb	$0, -2(%rax)
movb	$0, -1(%rax)
cmpq	%rdx, %rax
jne	.L11
call	free

This is a fair and simple translation of the C code: the inner loop is simple and efficient. Now let's look for the M_*lib version:

movl	$2994003, %edi
call	malloc
movq	%rax, %rdi
testq	%rax, %rax
je	.L9
xorl	%eax, %eax
.L2:
leaq	(%rdi,%rax), %rdx
movl	$255, %ecx
addq	$3, %rax
movw	%cx, (%rdx)
movb	$0, 2(%rdx)
cmpq	$2994003, %rax
jne	.L2
call	free

Despite having a lot more of codes (due to the push_back operator checking that it doesn't have to realloc the array to be able to push the data), all this code have been optimized out by the compiler! The inner loop have been translated into a simple and efficient version with a nice optimization that is not present in the naive version that explains the timing difference.

There is no doubt however that the naive version can be as fast as the M*LIB one with a little rework of the original C code. However, there is no doubt however than using M*LIB is as fast as plain array.

Mempool benchmark

Benchmark from http://natsys-lab.blogspot.fr/2015/09/fast-memory-pool-allocators-boost-nginx.html with use of M*LIB mempool allocators on a core 2 duo:

           small object size:    24
             big object size:    240
            huge object size:    8305

          mallocfree (Small):    1763ms
  mallocfree w/ free (Small):    1149ms
            mallocfree (Big):    858ms
    mallocfree w/ free (Big):    541ms

         GLib slices (Small):    2228ms
 GLib slices w/ free (Small):    1345ms
           GLib slices (Big):    725ms
   GLib slices w/ free (Big):    381ms

         boost::pool (Small):    965ms
 boost::pool w/ free (Small):    562ms
           boost::pool (Big):    526ms
   boost::pool w/ free (Big):    470ms
    boost::pool cr. & destr.:    264ms

            ngx_pool (Small):    806ms
              ngx_pool (Big):    445ms
      ngx_pool w/ free (Mix):    2767ms
       ngx_pool cr. & destr.:    216ms

            tfw_pool (Small):    858ms
    tfw_pool w/ free (Small):    290ms
              tfw_pool (Big):    312ms
      tfw_pool w/ free (Big):    111ms
      tfw_pool w/ free (Mix):    302ms
       tfw_pool cr. & destr.:    99ms

       M*LIB mempool (Small):    465ms
         M*LIB mempool (Big):    353ms
 M*LIB mempool w/ free (Mix):    275ms

M*LIB performs very good on this test.

UDB Benchmark

UDB Benchmark is an integer / string benchmark for associated arrays (hash tables, Red/Black Tree, etc.) developed by the author of KLIB. It can be found here. More information about the benchmark is available here.

The results for N=5000000 (default value) for a i5-4690K are:

LIBRARY INT TIME STR TIME
google_dense 0.220 1.110
mlib-oa 0.150 0.870
sgi_map 2.820 6.880
boost_hash 1.230 1.640
uthash 1.160 1.520
unordered_map_gcc 0.700 1.300
khash 0.170 0.680
NP_rbtree 2.790 5.310
glib_hash 0.960 1.050
mlib 0.280 0.610
stx_btree 1.230 4.440
kbtree 1.490 4.900
glib_tree 5.280 6.240
sgi_hash_map 0.860 1.380
google_sparse 0.670 2.010

String Benchmark

This benchmark is taken from Bstring benchmark

This table show the number of operations per second on a core i5 4690K normalized with the STL:

Operations GCC STL CBString LIBSRT M*LIB  SDS POTTERY
empty constructor 100 21 12 175 13 19
non-empty constructor 100 87 30 125 52 23
small non-empty constructor 100 12 4 119 7 9
char * assignment 100 85 46 264 57 52
char extraction 100 100 31 88 100 87
scan 100 184 259 425 467 476
concatenation 100 95 18 282 44 8
replace 100 72 NA 171 6 NA

And the master benchmark which consists in creating 2000000 strings based on random integers, concatenating them into a single string and replacing 2 sub-strings in it:

Library Time
M*LIB 0.09s
STL 0.21s
BSTRLIB 0.48s
POTTERY 4.27s
SDS 7.92s

Notes:

  • GCC 10.2
  • M*LIB e31a8dc407c3d63181218e693efa08955e1f8924
  • Better String f0ff1e808102a42cdc7204a4bb6fe231a24c4546
  • LIBSRT eee28e6dfc23f76c7b8f76f32ef68418619064be
  • SDS fb463145c9c245636feb28b5aac0fc897e16f67e
  • POTTERY 32378c84428625428a392d0e7b4c0d1c888552cb
  • The operations-benchmark may be unfair for non-inline libraries (like Better String µ/ LIBSRT / SDS) as operations on constants strings can be optimized very heavily with inline implementation (like STL & M*LIB).