Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add GPU autoscheduler #6856

Merged
merged 72 commits into from
Apr 4, 2023
Merged

Add GPU autoscheduler #6856

merged 72 commits into from
Apr 4, 2023

Conversation

aekul
Copy link
Contributor

@aekul aekul commented Jul 19, 2022

This is a draft PR to merge the anderson2021 GPU autoscheduler (#5602). It includes the code (there is some overlap with adams2019, but it will probably be a substantial amount of work to deduplicate them), some tests, utility scripts for generating data/statistics/etc. (these can be removed if desired), and baseline weights (trained on a V100).

@steven-johnson
Copy link
Contributor

Step 1: please run run-clang-format.sh and run-clang-tidy.sh on these and fix the issues :-)

Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there really is a lot of overlap with Adams2019 here. Would it be tractable to merge these changes into Adams2019, or at least have shared code between them?

.unroll(r, 8);
A.in().compute_at(prod, r).vectorize(_0).unroll(_1);
B.in().compute_at(prod, r).vectorize(_0).unroll(_1);
if (!auto_schedule) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere: as of #6838, you no longer should examine the auto_schedule GeneratorParam directly; instead, please call the using_autoscheduler() method.


$(BIN)/cost_model/%.a: $(BIN)/cost_model.generator
@mkdir -p $(@D)
$^ -g $* -o $(BIN)/cost_model -f $* target=$(HL_TARGET)-no_runtime auto_schedule=false enable_debug_output=$(ENABLE_DEBUG_OUTPUT) -e stmt,static_library,h,assembly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere: auto_schedule=false is no longer supported and should be removed.

$(BIN)/%/demo.a: $(GENERATOR_BIN)/demo.generator $(BIN)/libautoschedule_anderson2021.$(SHARED_EXT)
@mkdir -p $(@D)
HL_WEIGHTS_DIR=$(SRC)/baseline.weights \
$(GENERATOR_BIN)/demo.generator -g demo -o $(@D) -f demo target=$* auto_schedule=true -p $(BIN)/libautoschedule_anderson2021.$(SHARED_EXT) -s Anderson2021
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and elsewhere: auto_schedule=true -s Anderson2021 is no longer supported syntax. Instead, specify autoscheduler=Anderson2021 to replace both of these.

@aekul
Copy link
Contributor Author

aekul commented Aug 18, 2022

Some of the files (e.g. ASLog.h/.cpp, PerfectHashMap.h, and a few others) are identical and could be merged. Would autoschedulers/common be the right place for them? Looks like ASLog.h/.cpp are already there.

Unfortunately, most of the other files (e.g. Autoschedule.h/cpp, LoopNest.h/cpp, etc.) have diverged considerably and it would be a significant amount of work to merge them.

@steven-johnson
Copy link
Contributor

Some of the files (e.g. ASLog.h/.cpp, PerfectHashMap.h, and a few others) are identical and could be merged. Would autoschedulers/common be the right place for them? Looks like ASLog.h/.cpp are already there.

Yep!

Unfortunately, most of the other files (e.g. Autoschedule.h/cpp, LoopNest.h/cpp, etc.) have diverged considerably and it would be a significant amount of work to merge them.

No worries then.

@aekul
Copy link
Contributor Author

aekul commented Feb 16, 2023

@steven-johnson The buildbots are green now

Copy link
Contributor

@steven-johnson steven-johnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty great, thanks for doing all this work to get this landed! Just a handful of mostly style nits and such, with a few relatively minor things that need addressing (and a few things that would be nice for someone to work on as a followup).

IntrusivePtr<State> optimal_schedule(int beam_size);
};

#ifdef HALIDE_ALLOW_LEGACY_AUTOSCHEDULER_API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: HALIDE_ALLOW_LEGACY_AUTOSCHEDULER_API is now removed in Halide main, so this should just be removed here and elsewhere. (We can do this as a followup PR if you like.)

That said, here are the (legacy) env vars you can still use when HALIDE_ALLOW_LEGACY_AUTOSCHEDULER_API
is defined:

HL_BEAM_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: since HALIDE_ALLOW_LEGACY_AUTOSCHEDULER_API is now removed in Halide main, most of these env vars are probably no longer actually in use (settings would be controlled by GeneratorParams instead), so this should probably get pruned (subsequent PR is fine)


typedef PerfectHashMap<FunctionDAG::Node::Stage, ScheduleFeatures> StageMapOfScheduleFeatures;

void find_and_apply_schedule(FunctionDAG &dag, const std::vector<Function> &outputs, const Anderson2021Params &params, const Target &target, CostModel *cost_model, int beam_size, StageMapOfScheduleFeatures *schedule_features);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Style nit: this is an awfully long line

"Incorrect size for pipeline features");
int num_stages = 0;
for (const auto &n : dag.nodes) {
if (!n.is_input) num_stages += (int)n.stages.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit, here and elsewhere: generally, Halide always encloses if clauses in braces, even for a single line:

if (...) {
   ...
}

*(cost_ptrs(i)) = dst(i);
if (std::isnan(dst(i))) {
any_nans = true;
aslog(0) << "Prediction " << i << " is NaN. True runtime is " << true_runtimes(i) << "\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we prefer the autoschedulers to be 'quiet' by default -- do we really want/need these to be aslog(0) (vs aslog(1) etc)?

StageMap<StageMap<bool>> descendants;
root->get_stages_computed_in_each_compute_root_loop(descendants);

aslog(0) << "BEGIN compute locations\n";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be aslog(0)?

namespace Internal {
namespace Autoscheduler {

#define MAX_THREADS_PER_BLOCK 1024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer constexpr int for consts like this

#include <cstdint>
#include <vector>

using std::vector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never, ever, ever put a using into a .h file in the global namespace.

# benchmarked serially.
BATCH_SIZE=80
EPOCHS=200
NUM_GPUS=$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely, NVidia is not the only GPU vendor out there :-)

echo "Predict only mode: ON"
fi

DEFAULT_SAMPLES_DIR_NAME="${SAMPLES_DIR:-autotuned_samples}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hard-to-parse error messages are not good, please fix :-)

}

cursor = 0;
cost_per_stage_ptrs.clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have been wrong, but it seems that the buildbots are not running the GPU autoscheduler related tests?
For example, I pulled this branch and tried running anderson2021_test_apps_autoscheduler (i.e., test/autoschedulers/anderson2021/test.cpp) but it instantly fails, complaining that access to cost_per_stage_ptrs is out of bound. I commented this line (cost_per_stage_ptrs.clear();) and it seems to work again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have been wrong, but it seems that the buildbots are not running the GPU autoscheduler related tests?

That's correct and as-intended, since this branch hasn't landed yet. The buildbots usually aren't configured to test major new features before they have landed. This PR is close, but has a number of nits that need fixing first.

Comment on lines 220 to 232
// Make the vectorized dimension of the inner loop 32 (or as
// close as possible)
int64_t inner_extent = std::min(c->size[vectorized_loop_index], (int64_t)32);

if (c->stage->index == 0) {
vector<int64_t> tiling(c->node->dimensions, 1);

// Split into parallelized and serial
c = c->parallelize_in_tiles(tiling, loop_nest, params, target, true, false);

if (vectorized_loop_index >= 0) {
tiling[vectorized_loop_index] = inner_extent;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that it is not checked for whether vectorized_loop_index == -1 here in line 222 (and it seems that -1 is a legal value here because of the subsequent check in line 230), and leads to a crash when running test/autoschedulers/anderson2021/test.cpp. I changed

int64_t inner_extent = std::min(c->size[vectorized_loop_index], (int64_t)32);

to

int64_t inner_extent = vectorized_loop_index != -1 ? std::min(c->size[vectorized_loop_index], (int64_t)32) : (int64_t)32;

to fix this but I am not sure if this is right.

@aekul
Copy link
Contributor Author

aekul commented Apr 4, 2023

@steven-johnson Thanks for the review! Made all the requested changes.

Thanks to all the others who left comments/suggestions too.

@aekul aekul merged commit df354c5 into halide:main Apr 4, 2023
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be a duplicate of apps/images/rgb_small.png

@@ -6,10 +6,10 @@ using namespace Halide;

class ConvRelu : public Halide::Generator<ConvRelu> {
public:
Input<Buffer<float, 4>> input{"input"};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes look unintentional (same for the file below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed here: #7475

@steven-johnson steven-johnson added the release_notes For changes that may warrant a note in README for official releases. label Apr 4, 2023
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
Add Anderson2021 GPU autoscheduler
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buildbot_test_everything Buildbots should run all available tests on this PR (unless build_test_nothing is set). release_notes For changes that may warrant a note in README for official releases.
Projects
None yet
Development

Successfully merging this pull request may close these issues.