[wip] snabbmark: Add preliminary "byteops" benchmark #755

lukego · 2016-02-13T16:35:40Z

Fun weekend hack...

This branch adds a new command snabbmark byteops that measures byte-oriented operations with diverse parameters and produces a comprehensive CSV file. The intention is to systematically measure and compare the performance of operations like memcpy and checksum at different levels of the cache heirarchy, with different alignments, and with different distributions of input sizes. This is in the same spirit as #688 and #744
.
TLDR: Full CSV output for 10 runs on lugano-1. (45K rows.)

The parameters tested are:

Function: memcpy, cksum, cksumavx2. (This program seems to "fuzz" a problem in cksumsse2 that needs to be looked into!)
Displacement: maximum value for random displacement of source/destination arrays in memory. Uses different values intended to exercise L1/L2/L3/DRAM.
Distribution of input sizes: 64 bytes, 256 bytes, 1500 bytes, uniform distribution 0..10240, loguniform (proportionally more smaller values) distribution 0..10240
Individually, the alignment of src/dst/len values to 1/2/4/8/16/32/48/64 byte boundaries.
This benchmark comprehensively measures the performance of various
byte-oriented operations (copy, checksum) with many different
variations (implementation; distribution of input sizes; alignment of
src/dst/len; etc).

The resulting CSV file includes:

nbatch: Aggregate number of iterations measured using the same parameters.
nbytes: Aggregate bytes for all iterations in the batch.
nanos: Nanoseconds elapsed to process the whole batch.
cycles, ref_cycles, instructions, l1-hits, l2-hits, l3-hits, l3-misses, branch-misses: Performance counter report for the batch

Here is some example output:

@@ cpu       ;name      ;nbatch    ;nbytes    ;disp      ;lendist   ;lenalign  ;dstalign  ;srcalign  ;nanos     ;cycle     ;refcycle  ;instr     ;l1-hit    ;l2-hit    ;l3-hit    ;l3-miss   ;br-miss   
@@ E5-1650 v3;cksum     ;10000     ;50635943  ;10475520  ;uniform   ;1         ;1         ;1         ;90059546  ;313139680 ;0         ;404778783 ;177018645 ;156826    ;0         ;0         ;143315    
@@ E5-1650 v3;cksumavx2 ;10000     ;51742863  ;10475520  ;uniform   ;1         ;1         ;1         ;3405091   ;9773415   ;0         ;17941408  ;484452    ;353159    ;0         ;0         ;29045     
@@ E5-1650 v3;memcpy    ;10000     ;50969828  ;10475520  ;uniform   ;1         ;1         ;1         ;15737622  ;19974604  ;0         ;504955    ;225133    ;67525     ;0         ;0         ;4150      
@@ E5-1650 v3;cksum     ;10000     ;11307167  ;0         ;loguniform;1         ;1         ;1         ;19444333  ;65850861  ;0         ;63286809  ;41024066  ;2504      ;0         ;0         ;25362     
@@ E5-1650 v3;cksumavx2 ;10000     ;11008282  ;0         ;loguniform;1         ;1         ;1         ;10229210  ;2220052   ;0         ;4907108   ;882071    ;1550      ;0         ;0         ;29783     
@@ E5-1650 v3;memcpy    ;10000     ;11375385  ;0         ;loguniform;1         ;1         ;1         ;922086    ;1070289   ;0         ;505070    ;454292    ;600       ;0         ;0         ;16904

Is there anybody brave enough to try and analyze this data? :-)

This benchmark comprehensively measures the performance of various byte-oriented operations (copy, checksum) with many different variations (implementation; distribution of input sizes; alignment of src/dst/len; etc). This is working but messy first-cut code.

mention-bot · 2016-02-13T16:35:41Z

By analyzing the blame information on this pull request, we identified @eugeneia and @hb9cwp to be potential reviewers

lukego · 2016-02-14T10:26:20Z

@petebristow what do you think about this in the context of #692?

I am thinking that this PR basically steals that idea but keeps everything local inside snabbmark and so defers the discussion about a more generalized benchmarking library/framework until the future when we have more code and experience and are thinking about how to refactor everything based on lessons learned.

I can imagine having additional functions for benchmarking apps and app networks alongside this one for benchmarking byte-oriented functions. (This one can also be improved a bit e.g. to recognize the checksum tests ignore the destination arguments and so prune those out of the permutation space.) These could initially share code via subroutines in snabbmark.lua.

lukego · 2016-02-16T06:17:22Z

Baby steps...

I asked my mum about this kind of data analysis (she's good with statistics) and the keyword she gave me was SST (Total Sum of Squares). This lead me to the Kahn Academy videos on Analysis of Variance (Inferential Statistics). Looks promising.

"A little knowledge is a dangerous thing"... early days yet. Please pipe up if you are interested in these things :).

lukego · 2016-02-23T21:03:29Z

Closing this PR for now. I will reopen when I have something new to show.

lukego · 2016-02-25T21:19:04Z

My mum tells me that she did an analysis of variance with a few variables and it says that the benchmarks are not sensitive to the alignment of the source operand but they are sensitive to the combination of operation (cksum, cksumavx2, memcpy) and displacement (cache level).

Here is a picture of the latter:

This seems to be saying that the base checksum is always slow, the AVX2 checksum varies by a factor of 2 depending on L1/L2/L3/DRAM, and the memcpy operation is dramatically faster in L1 cache (up to 12KB working set size).

I find it encouraging that this kind of thing can be detected purely numerically without any explanation of what the data actually represents.

Everything with a grain of salt at this stage, but still... feels like progress.

lukego · 2016-02-26T16:06:39Z

So, seriously, is R the coolest thing in the universe? Probably.

I very easily loaded the CSV file into R:

> d <- read.csv(file='data4.csv', sep=';')
> df <- data.frame(d$byte.cyc, d$disp, d$lendist, d$l1.hit, d$srcalign, d$lenalign, d$name)
> d$disp <- as.factor(d$disp)

and then, completely by magic, it is able to just tell me the answers to all of these big questions I have been wondering about. Like:

> summary(aov(d.byte.cyc ~ d.srcalign * d.lenalign * d.disp, data=df))
                                Df Sum Sq Mean Sq  F value Pr(>F)    
d.srcalign                       1     86      86    4.364 0.0367 *  
d.lenalign                       1    125     125    6.343 0.0118 *  
d.disp                           4 139577   34894 1773.645 <2e-16 ***
d.srcalign:d.lenalign            1      0       0    0.003 0.9551    
d.srcalign:d.disp                4     78      20    0.993 0.4098    
d.lenalign:d.disp                4     20       5    0.250 0.9095    
d.srcalign:d.lenalign:d.disp     4      0       0    0.004 1.0000    
Residuals                    46060 906175      20                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

which says to me that:

The maximum displacement of data in memory (disp) makes a big difference. This is effectively the working set size i.e. which level of the memory hierarchy will be serving the requests.
The source and destination data alignments don't matter: not individually, not in combination with each other, and not in combination with the memory hierarchy. (This is expected on Haswell: it would be interesting to compare with a Sandy Bridge machine where various SIMD instructions carry penalties for unaligned access.)

Holy smokes! Imagine taking hours of benchmarking numbers and getting a detailed analysis of them in one second. Mind: blown.

This seems like exactly the tool that I need to move forward with micro-optimizations like the asm blitter in #719 where I want to understand how robust the optimization is to different workloads.

EDIT: Actually it is saying that the alignments do matter a little bit... I said they don't matter based on the apparently much smaller effect than for disp.

lukego · 2016-02-26T16:23:09Z

ahem :)

I think the conclusion above is okay but the details are wrong: I didn't properly declare that some more of the numeric columns in the CSV file are "factors". Correcting that we get a similar table:

> summary(aov(d.byte.cyc ~ d.srcalign * d.disp * d.lenalign, data=df))
                                Df Sum Sq Mean Sq  F value Pr(>F)    
d.srcalign                       7    134      19    0.965  0.455    
d.disp                           4 139577   34894 1762.442 <2e-16 ***
d.lenalign                       7    167      24    1.207  0.295    
d.srcalign:d.disp               28    146       5    0.263  1.000    
d.srcalign:d.lenalign           49      0       0    0.000  1.000    
d.disp:d.lenalign               28     41       1    0.073  1.000    
d.srcalign:d.disp:d.lenalign   196      1       0    0.000  1.000    
Residuals                    45760 905995      20                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

where column Df (degrees of freedom) is now consistent with the number of possible values. This analysis took more like a minute to run on my little Chromebook.

Grain, salt, etc.

lukego · 2016-02-26T16:34:03Z

If anybody wants to try then here is the full script:

#!/usr/bin/env Rscript
print('loading data4.csv file')
d <- read.csv(file='data4.csv', sep=';')
# Columns that are "factors" in the experiment.
# (R would auto-detect if the values were non-numeric.)
d$srcalign <- as.factor(d$srcalign)
d$lenalign <- as.factor(d$lenalign)
d$disp     <- as.factor(d$disp)
# Create a "data frame" with some columns to look at
df <- data.frame(d$byte.cyc, d$disp, d$srcalign, d$lenalign, d$name)
# run the analysis of variance
print('running analysis of variance')
summary(aov(d.byte.cyc ~ d.srcalign * d.disp * d.lenalign, data=df))

which expects to find data4.csv in the same directory.

The expected output:

[luke@lugano-1:~/git/r]$ ./byteops 
[1] "loading data4.csv file"
[1] "running analysis of variance"
                                Df Sum Sq Mean Sq  F value Pr(>F)    
d.srcalign                       7    134      19    0.965  0.455    
d.disp                           4 139577   34894 1762.442 <2e-16 ***
d.lenalign                       7    167      24    1.207  0.295    
d.srcalign:d.disp               28    146       5    0.263  1.000    
d.srcalign:d.lenalign           49      0       0    0.000  1.000    
d.disp:d.lenalign               28     41       1    0.073  1.000    
d.srcalign:d.disp:d.lenalign   196      1       0    0.000  1.000    
Residuals                    45760 905995      20                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

lukego · 2016-02-26T16:56:00Z

Handy link: R Tutorial that was the best one I found.

lukego · 2016-02-28T14:37:27Z

Here is a Google Doc summarizing some more thoughts that my mum shared about this data set. I haven't fully digested them yet.

lukego closed this Feb 23, 2016

lukego mentioned this pull request Feb 25, 2016

next: Collected changes for the v2016.03 release #778

Merged

lukego mentioned this pull request Mar 8, 2016

PMU musings #808

Open

lukego mentioned this pull request Apr 27, 2016

[wip] IP checksum in AVX2 assembler (prototype rewrite) #899

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] snabbmark: Add preliminary "byteops" benchmark #755

[wip] snabbmark: Add preliminary "byteops" benchmark #755

lukego commented Feb 13, 2016

mention-bot commented Feb 13, 2016

lukego commented Feb 14, 2016

lukego commented Feb 16, 2016

lukego commented Feb 23, 2016

lukego commented Feb 25, 2016

lukego commented Feb 26, 2016

lukego commented Feb 26, 2016

lukego commented Feb 26, 2016

lukego commented Feb 26, 2016

lukego commented Feb 28, 2016

[wip] snabbmark: Add preliminary "byteops" benchmark #755

[wip] snabbmark: Add preliminary "byteops" benchmark #755

Conversation

lukego commented Feb 13, 2016

mention-bot commented Feb 13, 2016

lukego commented Feb 14, 2016

lukego commented Feb 16, 2016

lukego commented Feb 23, 2016

lukego commented Feb 25, 2016

lukego commented Feb 26, 2016

lukego commented Feb 26, 2016

lukego commented Feb 26, 2016

lukego commented Feb 26, 2016

lukego commented Feb 28, 2016