Skip to content

This repository contains the figures, tables and source code in the ICS'24 paper: "Accelerated Auto-Tuning of GPU Kernels for Tensor Computations".

Notifications You must be signed in to change notification settings

HPCRL/Ansor-AF-DS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ICS24_Ansor_AF_DS

Introduction

This repository contains the figures, tables data and source code in the paper ICS'24: Accelerated Auto-Tuning of GPU Kernels for Tensor Computations.

1. Benchmarks

.
├── benchmarks

Benchmarks for re-collecting the data, including the following benchmarks:

  • bench_roller for evaluating the top50 performance of the rolle, tvm and Cuda toolkit(> 12.0) are required.
  • [Ansor] source code of Ansor v0.9
  • [Ansor-AF] source code of Ansor-AF
  • [Ansor-DS] source code of Ansor-DS
  • [Ansor-AF-DS] source code of Ansor-AF-DS
  • test contains the scripts for re-collecting the data of Ansor, Ansor-AF, Ansor-DS, and Ansor-AF-DS. Please use the bash script to run the benchmarks.

Create Conda environment

To build TVM from source and install the necessary Python packages, follow these steps:

Create conda environment:

conda create -n ansor python=3.10
conda activate ansor
conda install -c conda-forge xgboost=1.5.0 numpy decorator attrs tornado psutil cloudpickle pandas scipy pytest

The conda environment setting was from the official documentation of TVM

Benchmark script setting and explanation

Use the following command to run tests:

bash run_tests_times_conv.sh conv2d cuda num_of_runs num_sm num_shared_mem network num_trials num_init_states threshold pz_num

For example:

bash run_tests_times_conv.sh conv2d cuda 1 128 48 yolo 5 64 0.6 0

This command runs the YOLO network on CUDA with the following parameters:

  • 128 Streaming Multiprocessors (SMs)
  • 48k shared memory
  • 5 start points
  • 64 initial configurations for building the model
  • 0.6 threshold
  • problem size 0 (leave it empty to test all the problem sizes)
Benchmark the Ansor-AF-DS

Before proceeding, please make sure that both CUDA and LLVM are installed on your system. You can verify this by running the following commands in your terminal:

llvm-config --version
nvcc --version

Build Ansor-AF-DS first:

git clone [email protected]:HPCRL/Ansor-AF-DS.git --recursive
cd Ansor-AF-DS/benchmarks/Ansor_AF_DS

export TVM_HOME=$PWD && export PYTHONPATH=$TVM_HOME/python
mkdir -p build && cd ./build

cp "$TVM_HOME/cmake/config.cmake" ./

sed -i 's/set(USE_CUDA OFF)/set(USE_CUDA ON)/' config.cmake
sed -i 's/set(USE_LLVM OFF)/set(USE_LLVM ON)/' config.cmake

cmake ..
make -j8

Then go to the benchmarks folder and test: (The following setting is used for NVIDIA RTX 4090; please refer to the previous explanation and change it for your GPUs)

cd ../../
bash run_tests_times_conv.sh conv2d cuda 3 128 48 yolo 5 64 0.6
bash run_tests_times_conv.sh conv2d cuda 3 128 48 resnet 5 64 0.6
bash run_tests_times_mm.sh matmul cuda 3 128 48 5 64 0.6
Benchmark Ansor

Build TVM first:

cd Ansor-AF-DS/benchmarks/Ansor

export TVM_HOME=$PWD && export PYTHONPATH=$TVM_HOME/python
mkdir -p build && cd ./build

cp "$TVM_HOME/cmake/config.cmake" ./

sed -i 's/set(USE_CUDA OFF)/set(USE_CUDA ON)/' config.cmake
sed -i 's/set(USE_LLVM OFF)/set(USE_LLVM ON)/' config.cmake

cmake ..
make -j8

Then go to the benchmarks folder and test: (The following setting is used for NVIDIA RTX 4090; please refer to the previous explanation and change it for your GPUs)

cd ../../
bash run_tests_times_mm.sh matmul cuda 3 128 48 1000 64
bash run_tests_times_conv.sh conv2d cuda 3 128 48 yolo 1000 64 0.6
bash run_tests_times_conv.sh conv2d cuda 3 128 48 resnet 1000 64 0.6

2. Reproduce the variability data

.
├── cal_var

This folder contains the script and data to calculate the variability of Ansor-AF-DS(in 2 minutes and after 1000-trials) and Ansor(1000-trials)

Calculate the variability

python3 calc_var.py

3. Reproduce the figures

.
├── figures

This folder contains the scripts for reproducing the figures in the paper.

Draw all the figures

bash plot.sh

Scatter plot

python3 plot_scatter.py

Cudnn VS Ansor

python3 cudnn-ansor3090.py 
python3 cudnn-ansor4090.py 

Ablation 1

python3 plot_stack_ablation1.py

Ablation 2

python3 plot_stack_ablation2.py

Performance plot scripts

python3 plot_all_perf_stack3090.py
python3 plot_all_perf_stack4090.py

Variability plot scripts

python3 plot_var_perf_3090.py 
python3 plot_var_perf_4090.py

About

This repository contains the figures, tables and source code in the ICS'24 paper: "Accelerated Auto-Tuning of GPU Kernels for Tensor Computations".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published