-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]FamilySeer: A new search method for Auto-scheduler #57
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,212 @@ | ||
- Feature Name: (FamilySeer: A new search method for Auto-scheduler) | ||
- Start Date: (2021-01-07) | ||
- RFC PR: [apache/tvm-rfcs#57](https://github.com/apache/tvm-rfcs/pull/57) | ||
- GitHub Issue: [apache/tvm#9875](https://github.com/apache/tvm/pull/9875) | ||
|
||
# Summary | ||
[summary]: #summary | ||
|
||
We propose FamilySeer, a new search method that optimizes search efficiency and quality of the Auto-scheduler. We introduce several features: | ||
|
||
- FamilySeer exploits the subgraph similarity to form a collection of subgraph families and constructs cost models at subgraph family basis to improve cost model accuracy. | ||
- We enable subgraphs within each family to share the search results within each tuning iteration, avoiding costly code measurements on real hardware and thus accelerating the search process to converge to optimal results. | ||
- We also make some general optimizations like enabling parallel measurement on single node with multiple GPUs and training the cost model on GPU. | ||
|
||
# Motivation | ||
[motivation]: #motivation | ||
|
||
Auto-scheduler (Ansor) uses code sketch and optimization rules to generate a large search space. The search space defined by Ansor has shown great opportunities and therefore the search quality and the search efficiency are determined by how we search the space. | ||
|
||
Ansor utilizes improved cost model and task scheduler to help explore the search space. The cost model analyzes and finds high-performance code transformations in the search space and the task scheduler allocates the time budget to different computation graphs. However, we find serval drawbacks to this approach: | ||
|
||
The accuracy of the cost model determines the search quality, but Ansor uses monolithic cost model to predict different computation graphs (subgraphs), resulting in an accuracy loss during tuning. | ||
|
||
The task scheduler allocates most of the time budget to subgraphs with most improving potential (i.e., those with the highest latency). This approach works well at the beginning of the autotuning. However, as the potential subgraph gradually reaches its peak performance with adequate time budget, other subgraphs have little time budget to reach its peak performance. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMHO, the solution to this problem is to improve the task scheduler strategy. Since the task scheduler predicts the peak performance of a task, as long as this prediction is accurate enough, it shouldn't spend more time on the task that has achieved the peak performance. |
||
|
||
The search process will at the end take a dozen of hours. This motivates us to find better way to explore the search space. | ||
|
||
# Guide-level explanation | ||
[guide-level-explanation]: #guide-level-explanation | ||
|
||
We integrate our search method into Auto-Scheduler. Therefore, users only need to change some of the parameters to enable our search method. | ||
|
||
We use the code below in [Auto-scheduling a Neural Network for NVIDIA GPU](https://tvm.apache.org/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html#begin-tuning) as an example: | ||
|
||
```python | ||
#... | ||
|
||
# load all task and into tuner | ||
tuner = auto_scheduler.TaskScheduler(tasks, task_weights) | ||
|
||
#define tuning option for tuner | ||
tune_option = auto_scheduler.TuningOptions( | ||
num_measure_trials=200, # change this to 20000 to achieve the best performance | ||
runner=measure_ctx.runner, | ||
measure_callbacks=[auto_scheduler.RecordToFile(log_file)], | ||
) | ||
|
||
# start tuning | ||
#tuner.tune(tune_option) #add new parameter to tune function | ||
tuner.tune(tune_option,search_policy="sketch.xgb.family_op") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The policy name could be more informative. |
||
|
||
``` | ||
|
||
The `tuner` loads the `tune_option` into the `tune` function. There are several parameters in the `tune` function (Refer to class [Taskscheduler](https://tvm.apache.org/docs/reference/api/python/auto_scheduler.html?highlight=taskscheduler#tvm.auto_scheduler.TaskScheduler)). Users can enable our method by changing the `search_policy` parameter to `sketch.xgb.family_<family_algorithm>`. We currently provide two family algorithms as an option: `op` refers to classifying subgraphs based on the core operation, and `hash` refers to classifying subgraphs based on operation sequence. We recommend using `op` to achieve better performance. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By this description, In addition, since you recommend to use |
||
|
||
# Reference-level explanation | ||
[reference-level-explanation]: #reference-level-explanation | ||
|
||
Our search method consists of three steps: | ||
|
||
1. Identifying similar subgraphs | ||
```python | ||
def make_family_group( | ||
tasks, | ||
search_policy, | ||
): | ||
"""identify each subgraphs and group them into subgraph family. | ||
""" | ||
if search_policy == "default": | ||
search_policy = "sketch.xgb" | ||
|
||
if isinstance(search_policy, str): | ||
policy = search_policy.split(".") | ||
if len(policy) == 2: | ||
return {} | ||
elif len(policy) == 3: | ||
_, _, model_group = policy | ||
_, class_type = model_group.split("_") | ||
else: | ||
raise ValueError("Invalid search policy: " + search_policy) | ||
|
||
family_group = {} | ||
if class_type == "op": | ||
for idx, task in enumerate(tasks): | ||
task_layers = task.desc.split('_') | ||
if task_layers[1] not in family_group: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to me that this is a bit too ad-hoc:
|
||
family_group[task_layers[1]] = [] | ||
family_group[task_layers[1]].append(idx) | ||
else: | ||
family_group[task_layers[1]].append(idx) | ||
|
||
elif class_type == "hash": | ||
for idx, task in enumerate(tasks): | ||
first = task.workload_key.find("[\"") + 2 | ||
end = task.workload_key.find("\",") | ||
task_hash = task.workload_key[first:end] | ||
if task_hash not in family_group: | ||
family_group[task_hash] = [] | ||
family_group[task_hash].append(idx) | ||
else: | ||
family_group[task_hash].append(idx) | ||
|
||
elif class_type == "ind": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is |
||
for idx, task in enumerate(tasks): | ||
if task.workload_key not in family_group: | ||
family_group[task.workload_key] = [] | ||
family_group[task.workload_key].append(idx) | ||
else: | ||
family_group[task.workload_key].append(idx) | ||
|
||
if family_group is not None: | ||
for key, value in family_group.items(): | ||
print("family group :", key, "---", value) | ||
|
||
return family_group | ||
``` | ||
|
||
We use static analyzing to classify the subgraphs(tasks) based on their attributes. | ||
|
||
2. Constructing family cost model | ||
```python | ||
elif "family" in model_group: | ||
# generate cost model for each family | ||
cost_model_pool = [] | ||
for _,group in family_group.items(): | ||
if model_type == "xgb": | ||
cost_model = XGBModel( | ||
num_warmup_sample=len(group) * num_measures_per_round, | ||
model_file=load_model_file, | ||
adapative_training=adapative_training, | ||
) | ||
if load_model_file and os.path.isfile(load_model_file): | ||
logger.info("TaskScheduler: Load pretrained model...") | ||
cost_model.load(load_model_file) | ||
elif load_log_file: | ||
logger.info("TaskScheduler: Reload measured states and train the model...") | ||
cost_model.update_from_file(load_log_file) | ||
elif model_type == "random": | ||
cost_model = RandomModel() | ||
else: | ||
raise ValueError("Invalid search policy: " + search_policy) | ||
cost_model_pool.append(cost_model) | ||
|
||
#bind each subgraph(task) with its family cost model | ||
search_policies = [] | ||
for task_idx,task in enumerate(tasks): | ||
for group_idx,group in enumerate(family_group.values()): | ||
if task_idx in group: | ||
search_policies.append( | ||
SketchPolicy( | ||
task, | ||
cost_model_pool[group_idx], | ||
params=search_policy_params, | ||
verbose=verbose, | ||
init_search_callbacks=init_search_callbacks, | ||
) | ||
) | ||
|
||
``` | ||
|
||
After identifying similar subgraphs, We return `family_group` (a list of subgraph families) and build a cost model for each subgraph family. | ||
|
||
3.Foresee tuning | ||
|
||
```python | ||
def _tune_family_task(self, task_idx_groups,skip_measures_per_round): | ||
"""Tune the select family task for one round. | ||
""" | ||
for task_idx in task_idx_groups: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC, in this case once TaskScheduler picks a family, all tasks in this family will be tuned for one round? If so, it seems not that efficient as some tasks may not need that many trials. Can we still let task scheduler pick one task at a time, but we just use different cost model based on the family it belongs to? |
||
# Run pre-tune callbacks | ||
for callback in self.callbacks: | ||
callback.pre_tune(self, task_idx) | ||
|
||
measure_inputs, measure_results = self.search_policies[task_idx].continue_search_one_round( | ||
skip_measures_per_round, self.measurer | ||
) | ||
|
||
…… | ||
``` | ||
|
||
The foresee tuning takes `task_idx_groups` (A list of subgraph families) and `skip_measures_per_round` as inputs and tunes all the subgraphs inside the list. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is |
||
|
||
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
When searching on a larger search space (such as larger batch size), FamilySeer performs similarly or sometimes worse than Auto-scheduler. This is because a larger search space requires more time before the cost model can provide an accurate prediction. Deploying an inaccurate cost model on Foresee tuning may result in spending time budget on non-improvement code transformations. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to me that this can be improved somehow. For example, at the warmup stage (e.g., the first N trials), all cost models share the same training data from all tasks so that you have the same behavior as the current auto-scheduler. Afterward, you apply different data to each cost model to benefit from the task groups. |
||
|
||
# Rationale and alternatives | ||
[rationale-and-alternatives]: #rationale-and-alternatives | ||
|
||
Auto-Scheduler generates a large enough search space, so searching the space with efficiency is important. With FamilySeer, Users can search for the same optimal code under less time budget. We hope that our search method can be an alternative option for those who expect to obtain better optimal code under a limited time budget. | ||
|
||
# Prior art | ||
[prior-art]: #prior-art | ||
|
||
Please refer to [this paper](https://arxiv.org/abs/2201.00194). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not a "prior" art lol |
||
|
||
# Unresolved questions | ||
[unresolved-questions]: #unresolved-questions | ||
|
||
Our search method is up for [discussion](https://discuss.tvm.apache.org/t/rfc-familyseer-a-new-search-method-for-auto-scheduler/11877). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not an unresolved question. Unresolved questions should be the technical difficulty or the drawbacks of the proposed approach. |
||
|
||
# Future possibilities | ||
[future-possibilities]: #future-possibilities | ||
|
||
1. Dynamic subgraph family anlaysis | ||
|
||
FamilySeer currently relies on static analysis to identify subgraphs, which might result in misjudgments on some of the subgraphs. We are looking for an alternative method to identify subgraphs dynamically while maintaining the same time budget. | ||
|
||
2. Advanced Foresee tuning | ||
|
||
Auto-tuning is the procedure of looking for the only best optimal code for deep learning network. Many less performed codes have to be evaluated to build an accurate cost model. By accurately analyzing the subgraph similarity, we can draw a relationship map between each subgraph and focus on building highly accurate cost model for the most related subgraphs. Once an accurate cost model has been built, We can predict optimal code for other subgraphs instead of searching iteratively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link you put is the pull request but not the tracking issue. Please open a new issue to track the PR progress and put the link here.