Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] snabbmark: Add preliminary "byteops" benchmark #755

Closed
wants to merge 1 commit into from

Conversation

lukego
Copy link
Member

@lukego lukego commented Feb 13, 2016

Fun weekend hack...

This branch adds a new command snabbmark byteops that measures byte-oriented operations with diverse parameters and produces a comprehensive CSV file. The intention is to systematically measure and compare the performance of operations like memcpy and checksum at different levels of the cache heirarchy, with different alignments, and with different distributions of input sizes. This is in the same spirit as #688 and #744
.
TLDR: Full CSV output for 10 runs on lugano-1. (45K rows.)

The parameters tested are:

  • Function: memcpy, cksum, cksumavx2. (This program seems to "fuzz" a problem in cksumsse2 that needs to be looked into!)
  • Displacement: maximum value for random displacement of source/destination arrays in memory. Uses different values intended to exercise L1/L2/L3/DRAM.
  • Distribution of input sizes: 64 bytes, 256 bytes, 1500 bytes, uniform distribution 0..10240, loguniform (proportionally more smaller values) distribution 0..10240
  • Individually, the alignment of src/dst/len values to 1/2/4/8/16/32/48/64 byte boundaries.
    This benchmark comprehensively measures the performance of various
    byte-oriented operations (copy, checksum) with many different
    variations (implementation; distribution of input sizes; alignment of
    src/dst/len; etc).

The resulting CSV file includes:

  • nbatch: Aggregate number of iterations measured using the same parameters.
  • nbytes: Aggregate bytes for all iterations in the batch.
  • nanos: Nanoseconds elapsed to process the whole batch.
  • cycles, ref_cycles, instructions, l1-hits, l2-hits, l3-hits, l3-misses, branch-misses: Performance counter report for the batch

Here is some example output:

@@ cpu       ;name      ;nbatch    ;nbytes    ;disp      ;lendist   ;lenalign  ;dstalign  ;srcalign  ;nanos     ;cycle     ;refcycle  ;instr     ;l1-hit    ;l2-hit    ;l3-hit    ;l3-miss   ;br-miss   
@@ E5-1650 v3;cksum     ;10000     ;50635943  ;10475520  ;uniform   ;1         ;1         ;1         ;90059546  ;313139680 ;0         ;404778783 ;177018645 ;156826    ;0         ;0         ;143315    
@@ E5-1650 v3;cksumavx2 ;10000     ;51742863  ;10475520  ;uniform   ;1         ;1         ;1         ;3405091   ;9773415   ;0         ;17941408  ;484452    ;353159    ;0         ;0         ;29045     
@@ E5-1650 v3;memcpy    ;10000     ;50969828  ;10475520  ;uniform   ;1         ;1         ;1         ;15737622  ;19974604  ;0         ;504955    ;225133    ;67525     ;0         ;0         ;4150      
@@ E5-1650 v3;cksum     ;10000     ;11307167  ;0         ;loguniform;1         ;1         ;1         ;19444333  ;65850861  ;0         ;63286809  ;41024066  ;2504      ;0         ;0         ;25362     
@@ E5-1650 v3;cksumavx2 ;10000     ;11008282  ;0         ;loguniform;1         ;1         ;1         ;10229210  ;2220052   ;0         ;4907108   ;882071    ;1550      ;0         ;0         ;29783     
@@ E5-1650 v3;memcpy    ;10000     ;11375385  ;0         ;loguniform;1         ;1         ;1         ;922086    ;1070289   ;0         ;505070    ;454292    ;600       ;0         ;0         ;16904     

Is there anybody brave enough to try and analyze this data? :-)

This benchmark comprehensively measures the performance of various
byte-oriented operations (copy, checksum) with many different
variations (implementation; distribution of input sizes; alignment of
src/dst/len; etc).

This is working but messy first-cut code.
@mention-bot
Copy link

By analyzing the blame information on this pull request, we identified @eugeneia and @hb9cwp to be potential reviewers

@lukego
Copy link
Member Author

lukego commented Feb 14, 2016

@petebristow what do you think about this in the context of #692?

I am thinking that this PR basically steals that idea but keeps everything local inside snabbmark and so defers the discussion about a more generalized benchmarking library/framework until the future when we have more code and experience and are thinking about how to refactor everything based on lessons learned.

I can imagine having additional functions for benchmarking apps and app networks alongside this one for benchmarking byte-oriented functions. (This one can also be improved a bit e.g. to recognize the checksum tests ignore the destination arguments and so prune those out of the permutation space.) These could initially share code via subroutines in snabbmark.lua.

@lukego
Copy link
Member Author

lukego commented Feb 16, 2016

Baby steps...

I asked my mum about this kind of data analysis (she's good with statistics) and the keyword she gave me was SST (Total Sum of Squares). This lead me to the Kahn Academy videos on Analysis of Variance (Inferential Statistics). Looks promising.

"A little knowledge is a dangerous thing"... early days yet. Please pipe up if you are interested in these things :).

@lukego
Copy link
Member Author

lukego commented Feb 23, 2016

Closing this PR for now. I will reopen when I have something new to show.

@lukego
Copy link
Member Author

lukego commented Feb 25, 2016

My mum tells me that she did an analysis of variance with a few variables and it says that the benchmarks are not sensitive to the alignment of the source operand but they are sensitive to the combination of operation (cksum, cksumavx2, memcpy) and displacement (cache level).

Here is a picture of the latter:

pastedgraphic-1

This seems to be saying that the base checksum is always slow, the AVX2 checksum varies by a factor of 2 depending on L1/L2/L3/DRAM, and the memcpy operation is dramatically faster in L1 cache (up to 12KB working set size).

I find it encouraging that this kind of thing can be detected purely numerically without any explanation of what the data actually represents.

Everything with a grain of salt at this stage, but still... feels like progress.

@lukego
Copy link
Member Author

lukego commented Feb 26, 2016

So, seriously, is R the coolest thing in the universe? Probably.

I very easily loaded the CSV file into R:

> d <- read.csv(file='data4.csv', sep=';')
> df <- data.frame(d$byte.cyc, d$disp, d$lendist, d$l1.hit, d$srcalign, d$lenalign, d$name)
> d$disp <- as.factor(d$disp)

and then, completely by magic, it is able to just tell me the answers to all of these big questions I have been wondering about. Like:

> summary(aov(d.byte.cyc ~ d.srcalign * d.lenalign * d.disp, data=df))
                                Df Sum Sq Mean Sq  F value Pr(>F)    
d.srcalign                       1     86      86    4.364 0.0367 *  
d.lenalign                       1    125     125    6.343 0.0118 *  
d.disp                           4 139577   34894 1773.645 <2e-16 ***
d.srcalign:d.lenalign            1      0       0    0.003 0.9551    
d.srcalign:d.disp                4     78      20    0.993 0.4098    
d.lenalign:d.disp                4     20       5    0.250 0.9095    
d.srcalign:d.lenalign:d.disp     4      0       0    0.004 1.0000    
Residuals                    46060 906175      20                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

which says to me that:

  1. The maximum displacement of data in memory (disp) makes a big difference. This is effectively the working set size i.e. which level of the memory hierarchy will be serving the requests.
  2. The source and destination data alignments don't matter: not individually, not in combination with each other, and not in combination with the memory hierarchy. (This is expected on Haswell: it would be interesting to compare with a Sandy Bridge machine where various SIMD instructions carry penalties for unaligned access.)

Holy smokes! Imagine taking hours of benchmarking numbers and getting a detailed analysis of them in one second. Mind: blown.

This seems like exactly the tool that I need to move forward with micro-optimizations like the asm blitter in #719 where I want to understand how robust the optimization is to different workloads.

EDIT: Actually it is saying that the alignments do matter a little bit... I said they don't matter based on the apparently much smaller effect than for disp.

@lukego
Copy link
Member Author

lukego commented Feb 26, 2016

ahem :)

I think the conclusion above is okay but the details are wrong: I didn't properly declare that some more of the numeric columns in the CSV file are "factors". Correcting that we get a similar table:

> summary(aov(d.byte.cyc ~ d.srcalign * d.disp * d.lenalign, data=df))
                                Df Sum Sq Mean Sq  F value Pr(>F)    
d.srcalign                       7    134      19    0.965  0.455    
d.disp                           4 139577   34894 1762.442 <2e-16 ***
d.lenalign                       7    167      24    1.207  0.295    
d.srcalign:d.disp               28    146       5    0.263  1.000    
d.srcalign:d.lenalign           49      0       0    0.000  1.000    
d.disp:d.lenalign               28     41       1    0.073  1.000    
d.srcalign:d.disp:d.lenalign   196      1       0    0.000  1.000    
Residuals                    45760 905995      20                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

where column Df (degrees of freedom) is now consistent with the number of possible values. This analysis took more like a minute to run on my little Chromebook.

Grain, salt, etc.

@lukego
Copy link
Member Author

lukego commented Feb 26, 2016

If anybody wants to try then here is the full script:

#!/usr/bin/env Rscript
print('loading data4.csv file')
d <- read.csv(file='data4.csv', sep=';')
# Columns that are "factors" in the experiment.
# (R would auto-detect if the values were non-numeric.)
d$srcalign <- as.factor(d$srcalign)
d$lenalign <- as.factor(d$lenalign)
d$disp     <- as.factor(d$disp)
# Create a "data frame" with some columns to look at
df <- data.frame(d$byte.cyc, d$disp, d$srcalign, d$lenalign, d$name)
# run the analysis of variance
print('running analysis of variance')
summary(aov(d.byte.cyc ~ d.srcalign * d.disp * d.lenalign, data=df))

which expects to find data4.csv in the same directory.

The expected output:

[luke@lugano-1:~/git/r]$ ./byteops 
[1] "loading data4.csv file"
[1] "running analysis of variance"
                                Df Sum Sq Mean Sq  F value Pr(>F)    
d.srcalign                       7    134      19    0.965  0.455    
d.disp                           4 139577   34894 1762.442 <2e-16 ***
d.lenalign                       7    167      24    1.207  0.295    
d.srcalign:d.disp               28    146       5    0.263  1.000    
d.srcalign:d.lenalign           49      0       0    0.000  1.000    
d.disp:d.lenalign               28     41       1    0.073  1.000    
d.srcalign:d.disp:d.lenalign   196      1       0    0.000  1.000    
Residuals                    45760 905995      20                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

@lukego
Copy link
Member Author

lukego commented Feb 26, 2016

Handy link: R Tutorial that was the best one I found.

@lukego
Copy link
Member Author

lukego commented Feb 28, 2016

Here is a Google Doc summarizing some more thoughts that my mum shared about this data set. I haven't fully digested them yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants