Suggestions for the slideshow for day 1 the EsiWACE3 training #1

wmotion · 2024-09-20T09:59:53Z

Slides for training "GPU Optimization with Kernel Tuner". Day 1 (12 Sept 2024)

On this issue

I follow on the invitation dropped during the training to provide feedback. The permalink to the document is https://github.com/KernelTuner/kernel_tuner_tutorial/blob/d33bd6db03958667a6a71e89750b710f0e70203e/slides/2024_VSC_ESiWACE3/GPU%20Optimization%20with%20Kernel%20Tuner%20-%20Day%201.pdf.
Purpose of the notes. These notes propose possible improvements for attendees that are (growing) experts in a topical field, have been exposed to computer science mostly empirically, and whose working memory does not contain all pre-knowledge required by the training. These learners will also naturally come back to the same slides as study material for all that was not clear on the fly during the training. This persona for the beneficiary is close enough to profile of the training attendees, I presume. I would expect that these notes prompt a slide edit rather than individual answers to all points.
Stage of development of these notes. I wrote them as I was studying the slides after the training. I opened this issue in the repo before completing my own study precisely to track and highlight where something can prove unclear on first contact. In this way, the suggestions should be generalisable to a wider range of learners with varying familiarity with the training topic. Lastly, I might revisit this document a few more times as the staircase wit pushes me to do so; for a heavier engagement in the revising work we may discuss a collaboration on other terms.
Scope of remarks. Most of the remarks regard word choices. Some might also regard the order of the slides and the order of topics inside a single slide. Some questions are rhetorical in the sense that they express doubts arising while reading the slides. Lastly, many things in the slides work just fine, and I don't mention them for lack of time and not of appreciation; this bias is unfair to the instructors, admittedly.
Content organisation. The sectioning below reflects the separations in the slide set. The low-level headings report the slide number in point.
Notes are shared as is, like the slides themselves. I hope every little helps, although realistically you may have to triage which little actually helps.

Introduction

6 Learning objectives

"Integrate the tuning results into your application": I do not think the training day covers this topic.

Introduction to Auto-Tuning and Kernel Tuner

8 Auto-tuning GPU applications

I would arrange the bullet list in a more tangible order, for example from the smallest unit of programming (thread) to largest combinations of relevant notions.
'Loop unrolling factors': delete factors if you don't cover such factors later. The word factor is prone to ambiguity too, because different disciplines use it differently.
Take note that the phrase design space morphs into search space in the following. Learners should be aware of the change of meaning. Managing that difference carefully may also bear opportunities for useful distinctions. Until we speak of optimisation, for auto-tuning problems I would stick to design space.

9 Manual optimization versus auto-tuning

Take note of this first use of template. That can have more meanings in computing. There should be no confusion for the learners. I assume that here 'templated' means having generic parameters that are tunable. This issue returns below.
Describe or illustrate briefly what a code generator is. The term is technical. Illustrations (in words) can be useful. Wikipedia does not help single out a simple meaning for it.
Take note of the first use of benchmark. Benchmarking is easily intended as a comparison against a benchmark, a touchstone, a reference case, that is, some ground truth. Here you seem to use it for 'measuring'. The issue returns below.

10 Large search space of kernel configurations

Please add in the title (either the slide or the graph) how many configurations have been tested here.

13 Kernel Tuner is:

A hint that optimisation methods are discussed later in the day give a sense of perspective.
The notion of development-time does not fit this region of text clearly: is that a time where no compiling occurs? Is it possible to tune empirically before computation?
Software based is a meagre description here and perhaps obvious: this line is an opportunity to hint at more to come (written in Python, and so on...)
Easy to integrate is not explained in these slides, I think. Fine to mention this property in this one list, though.
Discrete parameters made me wonder if I should watch out for non-discrete parameters in numerical computing. Or do you mean the tunable parameters must be given as a collection of discrete items (lists, tuples, and so forth)? Are we talking maths or computer science here?

14 Kernel Tuner architecture

Take note of the use of the word function here. Troubles of ambiguity ahead.
The backend cuda-python is not mentioned any longer after this slide. Is it redundant here (then delete) or is it omitted afterwards (then touch upon it again)?

15 Minimal example

This slide deserves to be copied in the next section to support the explanation there within close range.

16 What Kernel Turner does

Whose instance? Instance is a potentially very ambiguous word as soon as it intersects the OOP descriptions. Is 'parameter configuration' a valid alternative?
Take note of benchmark occurring. (I will not underline all occurrences of it.)

17 Installation...

See 14. This is the first place where cuda-python is lacking.
You could add a pointer here about finding out other GPUs elsewhere and work out the hands-on exercises after the training.

Integrating Kernel Tuner with your Code and User-Defined Metrics

22 Kernel Tuner compiles...

Kernel Tuner is written in Python but you also show some code in C++ in slide 79. If this alternative to Python exists, this slide is the place to prepare the learners.

23 Specifying Kernel source code

There is no information about the most frequent positional argument, tune_params(), and that surely deserves room.
Take note of a second usage of word function.
Explain (hint at) what a templating engine is, for the uninitiated.
Instead of option I suggest optional argument.
lang option should be formatted differently as code and text fonts.

24 Kernel compilation

Template expansion: are you thinking of templates as in C++? or just of a template in a generic sense? or precisely of the templating that KT offers? Expansion could also be explained for the uninitiated with a synonym (substitution, for example).
Take note that host code is ambiguous. Which? Do you mean the code the kernel was originally part of? Or the Python code the kernel gets embedded into because we use KT? Or both? This issue returns elsewhere, although I will not signal all occurrences.

26 Kernel argument types

First sentence. It sounds strange that Python cannot reconstruct something of NumPy. Rephrase and clarify 'reconstruct' perhaps.
Last sentence. Is the performance recommendation general because it makes KT run robustly? Or because GPUs work best like that regardless?

27 Summary

This summary contains information that has not been presented before, hence it is not a summary. Is guidelines a better name perhaps?
linkage sounds abruptly technical. Do you have a better term/paraphrase for it?
on-the-fly wrapper generation is obscure. Rephrase like KT gives a limited support for wrapping templates on the fly with [specify what]?
Returning ambiguity of template and host code when explaining CuPy.
Whose templates by the way?
Note why, after having said (correctly) "isolate device code from host code to simplify separate compilation", host code has to be defined precisely each time to avoid confusion.

Grid and thread block dimensions

29 CUDA Thread Hierarchy

Linguistic ambiguity in fixed-sized blocks. Don't we tune their size in fact (see slides 30, 32)? Are they fixed as machine presets by the vendor? Are they fixed at compile-time by the user? Hint: use 'set' instead of 'fix'?

31 Why do thread block....

Opt for writing either threads block or thread block all over the document.
SMs. If the independence is full, it sounds odd that they contain "thread block slots", which relate to the super-unit of the SMs themselves.
What do you actually mean with independence? In the sequel you play around with the threads: perhaps with their independence too? The wording has a bearing on the content to come.
What should a learner think of when reading slots?

32 Specifying thread block dimensions...

params is a new variable name; is this a reserved KT word or a generic placeholder for something else that is left unspecified?
Can you make a little space in the slide to illustrate the usage of block_size_name?

33 Specify thread block...

What kind of access action do you have in mind? I feel you mean you 'set' them up here and not that you 'get' them.
compile-time sizes multi-dimensional data is cryptic and perhaps ungrammatical. Did you mean that the compiler reserves the shared memory based on the block dimensions, or something like that?

36 Specifying grid dimensions

problem_size is an important positional argument and is not explained clearly here. "Describes the dimensions across which threads are created" is very generic: the drawing also suggests that problem_size is the size of the grid used to process the input data, which is more tangible. Please rephrase.
Make the distinction between data input size and problem_size clear.

37 Grid divisor lists

This topic is unclear.
Since this is an optional argument, you should first mention what its default value is and means. As a learner I wonder why I would ever need to disable this optional feature in the first place.
I would use optional argument instead of optional parameter for global consistency.
This slide breaks the reading flow because it splits apart two slides on the same topic problem_size. Please swap with 38.

38 `problem_size`

'use strings' sounds like a shorthand for defining a dictionary key. This uncertainty is compounded with the fact that you have not introduced tune_params() in/around slide 23.
The last code line is an illustration of grid_div_x but is unclear for the same reasons as slide 37.
What is reduction.py? What makes this new entity relevant for this slide? Unclear for the learner.

User-defined metrics

41 User-defined metrics

Code. What is the variable (dictionary key, I presume) p? Is it a shorthand that the instructors use just in these slides, or always in KT? Is it a shorthand that the learners can use too without further caution? This returns later in slide 90.

GPU Optimization and the Search Space

44 Title slide

Search-Space Restrictions is fairer to the content and gives the title more balance.
Recalling the remark on slide 10, I would still speak of design space restrictions since no search is in order here, because we are still optimizing the GPU performance by code manipulation.
Are these operations really optimisation in the sense that will be invoked soon? For example, are we minimising a cost function? Is accelerating maybe a better term?

45 GPU code optimization

I suggest this order for the bullet points: 2-1-3-4. The notion of tunability is new and needs to be introduced as some kind of ease at tuning things.
Note the use of parameters here as distinct from the arguments for the KT calls. While conventional, this distinction helps separate two contexts in these slides.

46 Overview of GPU Optimizations

The link to the paper https://dl.acm.org/doi/full/10.1145/3570638 would be timely here; you give it later on the second training day.

48. (Partial) Loop Unrolling

The partial in the slide title is not explained in the slide body. So remove it if it is not essential because it creates a false expectation for detail.
Code generator and compiler appear to be two different things, even though compilers are about code generation. Possible confusions should be forestalled; see also slide 9.
What is the practical difference between not allowed and disabled in the explanation of #pragma unroll <value>?
Clear that KT inserts parameters with #define. But should it be remembered because it has been noticed before (where?) or because it is important information in itself? In the latter case, slide 33 already touches upon #define in KT. So closing the topic of passing parameters there makes the story line more compact.
Understanding the handling of loop_unroll_factor_ still requires substantial reflection. To connect with previous lines, that must be a variable of type integer, so it ought to be declared earlier. But the example in the following slide does not show an assignment line and does not comfort that expectation. There, the loop variable k has been appended to the name loop_unroll_factor_, but this construction could be presented more orderly. This topic needs to be rephrased.
And what is unrolling in the first place? You correctly focus on why and how, but the what-question is important to the learners too.

49 Partial loop unrolling

This loop unrolling sounds as though it is really partial because the latter word is not between parentheses in the slide title. But this partial quality has not been explained or indicated either here or in slide 49.
Bring the note on loop_unroll_factor_k to the top of the slide or to the previous slides for compactness?
This slide, unlike the following one, does not show a baseline code and a modified code. That would be useful here too.

50 Reducing register usage

Why? I doubt that a limited resource limits occupancy. I suggest saturate occupancy instead.
How? Compiling is an operation of creating executable files from source code, while keeping values in register is an operation of memory management. Doing the one "rather than" other sounds like mixing apples and pears: how the one is an alternative to the other is unclear in the first place. Do you perhaps mean that you'd better set the constant values at compile-time rather than as tunable parameters in a template? Would that imply the message 'do not tune too much'?
How? Note the ambiguity of templates. Do you mean the templates of KT, which normally works with tunable parameters? Or templates of another kind and origin? (Incidentally, to repeat as hangover cure, a clear terminology seems to go like the tunable parameters are an argument of KT methods.)
How? Limiting and disabling loop unrolling was the technique to speed up computing shown in the previous slides. So that technique has an important counterproductive effect, which could have been highlighted early in slide 48 as one more bullet point.

51 Reducing register usage

block_per_sm is a new, unexplained variable.
You could refer to these two code snippets as 'baseline' and 'variant' for example, and use this glossary everywhere else to direct the attention of the learners where the focus is.

52 Varying work per thread

Title. Is the proposed strategy about varying (up and down, as in the slide title) or only increasing (up, as in the slide body)? In the latter case there is no need make things in the title vaguer. The stress question is whether one should increase the work per thread to obtain a reduction of data reuse and locality; the correct terminology follows from the answer.

53 Varying work per thread

Introduce which calculation we are working with here. To me this looks calculating a single entry in a product matrix.
Introduce why #pragma roll is relevant here, all the more because it is a technique already explained (although you did not say what that does).
The name global_size_x in the picture is new. Previously the same object was named problem_size[0]. Please link the two notions or use a consistent terminology.
Explain that we increase the work per thread by borrowing values from other blocks, if I understood the code snippet correctly.

54 Vectorization

The what question is lacking. Add one line saying what a vector is in computing. I presume that presenting it from the hardware perspective is better. The term has widely different interpretations across disciplines.
Why? Explain with a short phrase what memory throughput is. Specifying the units of measurement can help to the same effect.
How? The points are short and cryptic. The question how can cover two kinds of answers interesting to the learners: (1) How the user writes the code to reap that benefit; (2) How the computer implements that technique. It would be nice that the instructors hint at both. This suggestions on the how-question applies to the previous slides on the other techniques.

Output verification

56 Programming tunable application

Source is ambiguous. I presume you mean the source code. Is this a viable rephrasing? Tunable source code lets you maintain in a single document many different versions of the same program or Tunable parameters lets you maintain in a single source code many different versions of the same program.

57 Output verification

Inform first that answer is an optional argument of KT.
Reference can have a special meaning in programming and create confusion (for example, with the instructors as they speak freely about coding). As alternative I suggest baseline, touchstone, yardstick.
Benchmarking becomes ambiguous and confusing here. Returning issue. Do you understand benchmarking as the operation of assessing the correctness of a calculation or the operation of measuring its performance? The phrase before benchmarking muddles what happens when.
Why is KT running the kernel just once here? Take note that benchmark as in ground-truth value returns in slide 65 (on caching).
Note that answer is a set of kernel arguments and cannot be compared with an output, unless you say earlier that the baseline results are passed as an item of answer.
Are the kernel arguments that don't need verification always the input data? In that case you could simply say that we use None for the kernel input arguments.
In the nested list, the order should be 1,2,4,3 so that all items about answer come one after another, and the subtopic is rounded up neatly.

58 Simple answer example

answer=answer works but can cause confusion, surely when the instructors speak during the training. I propose something like answer=baseline, where the optional argument and the variable have different names. This guideline is already applied in slide 62 to a clean effect.

Search space restrictions

61 Dependent parameters example

Title. What do dependent parameters depend on here? What explains the dependence in point? Recall that dependent variable has a mathematical ring to it that the learners are probably more familiar with.
Consider simplifying the last sentence into blocks do not contain more than 512 threads each?

Caching tuning results

64 Caching

Recall that the learners may be more familiar with the other meaning of cache, as in memory hierarchies. Please reassure that caching = saving = backing up here.
I suggest the consistent usage of 'optional argument' for option.
The ambiguity of benchmarking strikes again here, since it could mean either the comparison with a reference simulation (validation language) or the measure of performance in executing a configuration of the kernel (profiling-like language). At this point of the slide set, we have spoken of both meanings.

68 Performance portability

'The property that...' is unfinished writing. I suggest something like An application with similar performance on different hardware is said to be portable.
Application is a new salient word showing up here. You have already used code or program in a similar sense, I think. So, if there is no purpose in introducing this new term in a special sense, I would stick to code or program.
In 'we select a kernel based on...' I have that impression that you just select a kernel configuration and you do not pick up a new kernel.
The missing word is in parenthesis here in "can be done (at) compile-time or run-time".
'earlier obtained tuning results' -> 'tuning results obtained earlier'.

69 store_results

Function is better described as a 'method' in this Python-savvy context.
Write store_results() with empty arguments to underline this is a method, as already done in slide 23, for example.
Because of a muddled order of presentation, it is unclear that env is an optional argument of KT, while that new item could be usefully introduced here.
Take note that problem_size returns here. Defining it more clearly above pays off in the long run.

70 Compile-time kernel selection

Are you selecting a kernel here, actually? It all looks like you are primarily selecting the GPU you run the kernel on and, hence, just an optional argument or a configuration parameter of the same tunable kernel.
Host and application are both prone to ambiguity. What is the host application here? (a) the Python script calling KT? (b) the original code from which one has to isolate the kernel each time, according to the best practice you suggested from the start?

71 Compile-time kernel selection example

create_device_targets() is a new method that deserves an introduction, at this time lacking.

76 Compile-time kernel selection example

First appearance of make. Many users may know what a Makefile does but not how it works. And some may compile the code from the command line without make. Please consider a new line of text that explains whether the snippet underlines that make is necessary (because it does something irreplaceable) or that we need to add to the compiler nvcc some specific flags (options, by the way).

77 Run-time kernel selection

That the programming language of the host application can vary is breaking news here. If the host application is KT, then its language is not just Python, as said earlier in slide 22. If the host application is where the kernel was embedded originally, its language should not be relevant any longer, because you recommend to move the kernel into a Python script. It looks like the story line of the slides has a hiccup here. Please repair.

78 Run-time kernel selection in Python

Comment on (and/or highlight) which code lines are the same as in the compile-time selection, and which are specially edited to implement the run-time selection instead.
"Python host application" is clear here, and is an example of the clarity that can be achieved elsewhere.
Spend a line to explain the new feature kernelbuilder you import.
In the PythonKernel() call, do the names belong to the arguments or to argument instances? To present the argument names like in the API manual, you could use a commented line before the call. As a learner, I want the example to continue at the same level of detail as the other snippet at the top of the page.

79 Run-time kernel selection in C++

Using KT from C++ has not been announced earlier and falls in like a completely new possibility. If I got this point right, this capability should be announced earlier among the general features of KT, while here you should explain in a few lines what one should take notice of in the slide.
Who/what "uses Kernel Laucher header-only C++ library"?

Optimization strategies

81 Large search space of ...

Same remark on the graph as in slide 10. That design space now becomes a search space, in my view.

82 Optimization strategies in Kernel Tuner

This slide could be merged with slide 85 below. This looks like an open-ended list, while the implementations in KT, as expected from the title, are set and given, I presume.

83 Speeding up auto-tuning

Is the time in the ordinates the time that each optimization strategy took to find the best performing configuration already found by the brute-force approach?
In other words, do the GFLOPS/s refer to (a) the best performing configuration picked by each optimisation strategy?; (b) the time consumed by the optimisation strategy to find the brute-force configuration? The question arises insofar optimisation strategies approximate the outcome of brute-force auto-tuning.

84 Your mileage may vary

Unravel GEMM please.

85 How to use a search strategy

A basinhopping should be split basin hopping.
Is basin hopping a local optimisation method as I glimpse here, or a global optimisation method as I glimpse in slide 82?
tune_kernel() with empty arguments will show it is a method.

Observers

87 Observers introduction

Whose behavior?
Is benchmarking here tuning or validating? Known linguistic vulnerability.

88 Observer base class

Same potential ambiguity with benchmarking in def get_results(self). I am afraid that this linguistic vulnerability may affect the software documentation.

90 NVMLObserver example

See remark on p in slide 41.

Closing Remarks

94 Learning objectives

Same remarks as slide 6.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for the slideshow for day 1 the EsiWACE3 training #1

Suggestions for the slideshow for day 1 the EsiWACE3 training #1

wmotion commented Sep 20, 2024

Suggestions for the slideshow for day 1 the EsiWACE3 training #1

Suggestions for the slideshow for day 1 the EsiWACE3 training #1

Comments

wmotion commented Sep 20, 2024

Slides for training "GPU Optimization with Kernel Tuner". Day 1 (12 Sept 2024)

On this issue

Introduction

6 Learning objectives

Introduction to Auto-Tuning and Kernel Tuner

8 Auto-tuning GPU applications

9 Manual optimization versus auto-tuning

10 Large search space of kernel configurations

13 Kernel Tuner is:

14 Kernel Tuner architecture

15 Minimal example

16 What Kernel Turner does

17 Installation...

Integrating Kernel Tuner with your Code and User-Defined Metrics

22 Kernel Tuner compiles...

23 Specifying Kernel source code

24 Kernel compilation

26 Kernel argument types

27 Summary

Grid and thread block dimensions

29 CUDA Thread Hierarchy

31 Why do thread block....

32 Specifying thread block dimensions...

33 Specify thread block...

36 Specifying grid dimensions

37 Grid divisor lists

38 problem_size

User-defined metrics

41 User-defined metrics

GPU Optimization and the Search Space

44 Title slide

45 GPU code optimization

46 Overview of GPU Optimizations

48. (Partial) Loop Unrolling

49 Partial loop unrolling

50 Reducing register usage

51 Reducing register usage

52 Varying work per thread

53 Varying work per thread

54 Vectorization

Output verification

56 Programming tunable application

57 Output verification

58 Simple answer example

Search space restrictions

61 Dependent parameters example

Caching tuning results

64 Caching

68 Performance portability

69 store_results

70 Compile-time kernel selection

71 Compile-time kernel selection example

76 Compile-time kernel selection example

77 Run-time kernel selection

78 Run-time kernel selection in Python

79 Run-time kernel selection in C++

Optimization strategies

81 Large search space of ...

82 Optimization strategies in Kernel Tuner

83 Speeding up auto-tuning

84 Your mileage may vary

85 How to use a search strategy

Observers

87 Observers introduction

88 Observer base class

90 NVMLObserver example

Closing Remarks

94 Learning objectives

38 `problem_size`