performance

C/C++ container library benchmark :

Benchmark code is available here

For the following results, the best result within 5% has been selected: more than 10 tries of the bench have been launch, then the selected value is the minimum bound of the longest segments of all values within 5%.

## C libraries

Type	List	Array	Tree	Dict S	Dict B	Sort
M*LIB	572 ms	312 ms	296 ms	32 ms	704 ms	888 ms
M*LIB+	148 ms	316 ms	300 ms	36 ms	532 ms	884 ms
GLIB	592 ms	1356 ms	2056 ms	424 ms	728 ms	1300 ms
KLIB	812 ms	312 ms	436 ms	56 ms	756 ms	784 ms
CollectionC	720 ms	1624 ms	1284 ms	452 ms	1200 ms	2696 ms
LIBDYNAMIC	NA	1092 ms	NA	120 ms	676 ms	NA
TOMMY DS	624 ms	752 ms	1380 ms	184 ms	584 ms	NA
UTHASH	564 ms	328 ms	NA	420 ms	1276 ms	1276 ms
QLIBC TO	1476 ms	3196 ms	1872 ms	NA	NA	NA
LIBSRT	NA	2192 ms	NA	NA	NA	1252 ms
CMC	NA	872 ms	1404 ms	328 ms	NA	NA
CTL	592 ms	312 ms	1204 ms	460 ms	NA	1256 ms
STC	584 ms	324 ms	908 ms	104 ms	640 ms	1304 ms
POTTERY	572 ms	1680 ms	488 ms ?	60 ms	908 ms	916 ms
CC	542 ms	720 ms	NA	113 ms	1336 ms	NA
VERSTABLE	NA	NA	NA	80 ms	620 ms	NA

## C++ libraries

Type	List	Array	Tree	Dict S	Dict B	Sort
STL	616 ms	772 ms	1228 ms	424 ms	1028 ms	740 ms
QT	596 ms	1172 ms	1300 ms	456 ms	840 ms	816 ms
DENSE HMAP	NA	NA	NA	48 ms	804 ms	NA
FLAT HMAP	NA	NA	NA	92 ms	808 ms	NA
EMILIB	NA	NA	NA	44 ms	644 ms	NA
EMHASH	NA	NA	NA	36 ms	521 ms	NA
HOPSCOTCHMAP	NA	NA	NA	80 ms	844 ms	NA
RIGTORP HMAP	NA	NA	NA	40 ms	960 ms	NA
BOOST UNORDERED	NA	NA	NA	38 ms	426 ms	NA

For theses results, the best one of 3 tries was kept:

Type	Queue MPMC
M*LIB CONCURRENT+DEQUEUE	579 ms
M*LIB BUFFER	375 ms
M*LIB MPMC Queue	36 ms
LIBLFDS MPMC	83 ms
CONCURRENT QUEUE	279 ms
BOOST LOCK FREE	338 ms
RIGTORP MPMC Queue	225 ms

The used versions are:

COMPONENT	VERSION
CPU	Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz
GCC	10.1
M*LIB	743959ebde33e9a5452799e02c245cee02ebada2
Qt	5.7.1
GLIB	2.50.3
KLIB	928581a78413bed4efa956731b35b18a638f20f3
CollectionsC	775447bfa379417382b2db8454f19039f4d3deb4
ConcurrentQueue	38e6a6f0185a98c3aaf2a95aa109ba041221d527
Emilib	0d90f4c0cd44813f6898b417bf3f4ef574a5b136
EMHASH	7e51e4cce5ae053c3e000466163aebd620e1004d
Flast HMAP	2c4687431f978f02a3780e24b8b701d22aa32d9c
Rigtorp HMAP	2f91f0cf97e9a5b6881a8cdf37bf5fbd470b3ee6
Rigtorp MPMC Q	b74a9f06469ec268b82eca3b5fe28dd58f312d06
Hopscotch HMAP	59e03db96365fa22de26729a412d632e0b06df3b
LibCollections	c259a9a781fa03a4a02b6b7d5118fd202574c347
LibDynamic	d680409177b23ee649c0521c4a1ef647478c1f2f
LibSRT	e3c8a7091f42dbb3a055252bee6bccb171a6eddc
LibLFDS	7.1.1
QLIBC	25d5f5ce44ec4c863edbeaecdcb4d3c05dcf3aa7
RapidString	2a3ef51e29f37c6dbc62b46c8d39f058689d40c9
SDS	fb463145c9c245636feb28b5aac0fc897e16f67e
Dense HMAP	f93c0c69e959c1c77611d0ba8d107aa971338811
SparsePP	5dc50f795b9783ce5778c03b6cad5d1c7d0d29ea
Tommy DS	8a99599168ef7e56276ed3dbdeee40a27b039da7
UTHASH	28334219b5327937cb05a593dc2d6ba622ce5bcb
BOOST	1.84
C Macro Collections	v0.23.1
CTL	b787374d672c97e173164fcb6068dced0e1a3619
STC	9f1a51593ae7dd45db46f39ac18901ad175af18e
POTTERY	e6ac5f45feb6119521e1b79c65b2062d5ebbceb0
CC	3cbef63a2608e82bdd8c1cde04a7ec62cbd69f7c
VERSTABLE	afbe27c7914aad2d936c3569128588799b3f0913

Notes:

Tree means "Totally Sorted Tree": Red-Black Tree, AVL Tree, B-Tree or equivalent.
Dict S: Dictionary with 8 bytes type,
Dict B: Dictionary with 256 bytes type .
M*LIB+: M*LIB with mempool option activated (replacing malloc).
M*LIB's Dict S used Open Addressing
M*LIB's Dict B used separate chaining.
Qt5 doesn't have a container that is a set offering total ordering. As such, it was QMap which was tested instead.
Glib uses its own memory allocator (gslice).
Tommy DS doesn't have a insert_or_replace function, as such, it is insert alone which is benched (two equal keys can both be appended in the dictionary), which is faster.
QLibC hash functions support only const char pointer as keys.
A result marked ? means that the computation of the benchmark was wrong, and so the bench time is unreliable.

Code generation analysis:

Let's compare two equivalent code. One with M*Lib, the other one with standard classic array. The code is inspired by this question.

The classic C version:

  const int dimension = 999;
  struct Pixel * pixels = malloc(sizeof(struct Pixel) * dimension * dimension);
  for(int i = 0 ; i < dimension * dimension; ++i)
    {
      pixels[i].r = 255;
      pixels[i].g = 0;
      pixels[i].b = 0;
    }
  free(pixels);

The equivalent M*lib version:

  const int dimension = 999;
  array_pixel_t pixels;
  array_pixel_init(pixels);
  array_pixel_reserve(pixels, dimension * dimension);
  for(int i = 0; i < dimension * dimension; ++i)
    {
      const struct Pixel p = { 255, 0, 0};
      array_pixel_push_back (pixels, p);
    }
  array_pixel_clear(pixels);

Let's compare the timings (gcc -O2 -march=native)

Code	Time
classic C	856ms
M*Lib	604ms

M*LIB timing is even faster! And M*Lib performs much more work with the data (the push_back operator may realloc the data if needed). How is it possible?

Let's deep into the generated code. For the classic version:

movl	$2994003, %edi
call	malloc
movq	%rax, %rdi
leaq	2994003(%rax), %rdx
.L11:
movb	$-1, (%rax)
addq	$3, %rax
movb	$0, -2(%rax)
movb	$0, -1(%rax)
cmpq	%rdx, %rax
jne	.L11
call	free

This is a fair and simple translation of the C code: the inner loop is simple and efficient. Now let's look for the M_*lib version:

movl	$2994003, %edi
call	malloc
movq	%rax, %rdi
testq	%rax, %rax
je	.L9
xorl	%eax, %eax
.L2:
leaq	(%rdi,%rax), %rdx
movl	$255, %ecx
addq	$3, %rax
movw	%cx, (%rdx)
movb	$0, 2(%rdx)
cmpq	$2994003, %rax
jne	.L2
call	free

Despite having a lot more of codes (due to the push_back operator checking that it doesn't have to realloc the array to be able to push the data), all this code have been optimized out by the compiler! The inner loop have been translated into a simple and efficient version with a nice optimization that is not present in the naive version that explains the timing difference.

There is no doubt however that the naive version can be as fast as the M*LIB one with a little rework of the original C code. However, there is no doubt however than using M*LIB is as fast as plain array.

Mempool benchmark

Benchmark from http://natsys-lab.blogspot.fr/2015/09/fast-memory-pool-allocators-boost-nginx.html with use of M*LIB mempool allocators on a core 2 duo:

           small object size:    24
             big object size:    240
            huge object size:    8305

          mallocfree (Small):    1763ms
  mallocfree w/ free (Small):    1149ms
            mallocfree (Big):    858ms
    mallocfree w/ free (Big):    541ms

         GLib slices (Small):    2228ms
 GLib slices w/ free (Small):    1345ms
           GLib slices (Big):    725ms
   GLib slices w/ free (Big):    381ms

         boost::pool (Small):    965ms
 boost::pool w/ free (Small):    562ms
           boost::pool (Big):    526ms
   boost::pool w/ free (Big):    470ms
    boost::pool cr. & destr.:    264ms

            ngx_pool (Small):    806ms
              ngx_pool (Big):    445ms
      ngx_pool w/ free (Mix):    2767ms
       ngx_pool cr. & destr.:    216ms

            tfw_pool (Small):    858ms
    tfw_pool w/ free (Small):    290ms
              tfw_pool (Big):    312ms
      tfw_pool w/ free (Big):    111ms
      tfw_pool w/ free (Mix):    302ms
       tfw_pool cr. & destr.:    99ms

       M*LIB mempool (Small):    465ms
         M*LIB mempool (Big):    353ms
 M*LIB mempool w/ free (Mix):    275ms

M*LIB performs very good on this test.

UDB Benchmark

UDB Benchmark is an integer / string benchmark for associated arrays (hash tables, Red/Black Tree, etc.) developed by the author of KLIB. It can be found here. More information about the benchmark is available here.

The results for N=5000000 (default value) for a i5-4690K are:

LIBRARY	INT TIME	STR TIME
google_dense	0.220	1.110
mlib-oa	0.150	0.870
sgi_map	2.820	6.880
boost_hash	1.230	1.640
uthash	1.160	1.520
unordered_map_gcc	0.700	1.300
khash	0.170	0.680
NP_rbtree	2.790	5.310
glib_hash	0.960	1.050
mlib	0.280	0.610
stx_btree	1.230	4.440
kbtree	1.490	4.900
glib_tree	5.280	6.240
sgi_hash_map	0.860	1.380
google_sparse	0.670	2.010

String Benchmark

This benchmark is taken from Bstring benchmark

This table show the number of operations per second on a core i5 4690K normalized with the STL:

Operations	GCC STL	CBString	LIBSRT	M*LIB	SDS	POTTERY
empty constructor	100	21	12	175	13	19
non-empty constructor	100	87	30	125	52	23
small non-empty constructor	100	12	4	119	7	9
char * assignment	100	85	46	264	57	52
char extraction	100	100	31	88	100	87
scan	100	184	259	425	467	476
concatenation	100	95	18	282	44	8
replace	100	72	NA	171	6	NA

And the master benchmark which consists in creating 2000000 strings based on random integers, concatenating them into a single string and replacing 2 sub-strings in it:

Library	Time
M*LIB	0.09s
STL	0.21s
BSTRLIB	0.48s
POTTERY	4.27s
SDS	7.92s

Notes:

GCC 10.2
M*LIB e31a8dc407c3d63181218e693efa08955e1f8924
Better String f0ff1e808102a42cdc7204a4bb6fe231a24c4546
LIBSRT eee28e6dfc23f76c7b8f76f32ef68418619064be
SDS fb463145c9c245636feb28b5aac0fc897e16f67e
POTTERY 32378c84428625428a392d0e7b4c0d1c888552cb
The operations-benchmark may be unfair for non-inline libraries (like Better String µ/ LIBSRT / SDS) as operations on constants strings can be optimized very heavily with inline implementation (like STL & M*LIB).

Provide feedback

Saved searches