Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]FamilySeer: A new search method for Auto-scheduler #57

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
212 changes: 212 additions & 0 deletions rfcs/0057-FamilySeer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
- Feature Name: (FamilySeer: A new search method for Auto-scheduler)
- Start Date: (2021-01-07)
- RFC PR: [apache/tvm-rfcs#57](https://github.com/apache/tvm-rfcs/pull/57)
- GitHub Issue: [apache/tvm#9875](https://github.com/apache/tvm/pull/9875)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link you put is the pull request but not the tracking issue. Please open a new issue to track the PR progress and put the link here.


# Summary
[summary]: #summary

We propose FamilySeer, a new search method that optimizes search efficiency and quality of the Auto-scheduler. We introduce several features:

- FamilySeer exploits the subgraph similarity to form a collection of subgraph families and constructs cost models at subgraph family basis to improve cost model accuracy.
- We enable subgraphs within each family to share the search results within each tuning iteration, avoiding costly code measurements on real hardware and thus accelerating the search process to converge to optimal results.
- We also make some general optimizations like enabling parallel measurement on single node with multiple GPUs and training the cost model on GPU.

# Motivation
[motivation]: #motivation

Auto-scheduler (Ansor) uses code sketch and optimization rules to generate a large search space. The search space defined by Ansor has shown great opportunities and therefore the search quality and the search efficiency are determined by how we search the space.

Ansor utilizes improved cost model and task scheduler to help explore the search space. The cost model analyzes and finds high-performance code transformations in the search space and the task scheduler allocates the time budget to different computation graphs. However, we find serval drawbacks to this approach:

The accuracy of the cost model determines the search quality, but Ansor uses monolithic cost model to predict different computation graphs (subgraphs), resulting in an accuracy loss during tuning.

The task scheduler allocates most of the time budget to subgraphs with most improving potential (i.e., those with the highest latency). This approach works well at the beginning of the autotuning. However, as the potential subgraph gradually reaches its peak performance with adequate time budget, other subgraphs have little time budget to reach its peak performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the solution to this problem is to improve the task scheduler strategy. Since the task scheduler predicts the peak performance of a task, as long as this prediction is accurate enough, it shouldn't spend more time on the task that has achieved the peak performance.


The search process will at the end take a dozen of hours. This motivates us to find better way to explore the search space.

# Guide-level explanation
[guide-level-explanation]: #guide-level-explanation

We integrate our search method into Auto-Scheduler. Therefore, users only need to change some of the parameters to enable our search method.

We use the code below in [Auto-scheduling a Neural Network for NVIDIA GPU](https://tvm.apache.org/docs/how_to/tune_with_autoscheduler/tune_network_cuda.html#begin-tuning) as an example:

```python
#...

# load all task and into tuner
tuner = auto_scheduler.TaskScheduler(tasks, task_weights)

#define tuning option for tuner
tune_option = auto_scheduler.TuningOptions(
num_measure_trials=200, # change this to 20000 to achieve the best performance
runner=measure_ctx.runner,
measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
)

# start tuning
#tuner.tune(tune_option) #add new parameter to tune function
tuner.tune(tune_option,search_policy="sketch.xgb.family_op")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The policy name could be more informative. family_op is not a common term after all.


```

The `tuner` loads the `tune_option` into the `tune` function. There are several parameters in the `tune` function (Refer to class [Taskscheduler](https://tvm.apache.org/docs/reference/api/python/auto_scheduler.html?highlight=taskscheduler#tvm.auto_scheduler.TaskScheduler)). Users can enable our method by changing the `search_policy` parameter to `sketch.xgb.family_<family_algorithm>`. We currently provide two family algorithms as an option: `op` refers to classifying subgraphs based on the core operation, and `hash` refers to classifying subgraphs based on operation sequence. We recommend using `op` to achieve better performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By this description, op seems not an accurate term. IIUC, both op and hash target to subgraphs. The only different is the way to determine the similarity, so again, it would be good to have a better naming.

In addition, since you recommend to use op for better performance, could you elaborate the need of hash?


# Reference-level explanation
[reference-level-explanation]: #reference-level-explanation

Our search method consists of three steps:

1. Identifying similar subgraphs
```python
def make_family_group(
tasks,
search_policy,
):
"""identify each subgraphs and group them into subgraph family.
"""
if search_policy == "default":
search_policy = "sketch.xgb"

if isinstance(search_policy, str):
policy = search_policy.split(".")
if len(policy) == 2:
return {}
elif len(policy) == 3:
_, _, model_group = policy
_, class_type = model_group.split("_")
else:
raise ValueError("Invalid search policy: " + search_policy)

family_group = {}
if class_type == "op":
for idx, task in enumerate(tasks):
task_layers = task.desc.split('_')
if task_layers[1] not in family_group:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this is a bit too ad-hoc:

  1. It relies on the task.desc, which is supposed to be used only for user reference and the format isn't guaranteed.
  2. It identifies the second op (e.g., task_layers[1]) to be the anchor op, but this may not 100% hold.

family_group[task_layers[1]] = []
family_group[task_layers[1]].append(idx)
else:
family_group[task_layers[1]].append(idx)

elif class_type == "hash":
for idx, task in enumerate(tasks):
first = task.workload_key.find("[\"") + 2
end = task.workload_key.find("\",")
task_hash = task.workload_key[first:end]
if task_hash not in family_group:
family_group[task_hash] = []
family_group[task_hash].append(idx)
else:
family_group[task_hash].append(idx)

elif class_type == "ind":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is ind the third version? I didn't find it in the previous section.

for idx, task in enumerate(tasks):
if task.workload_key not in family_group:
family_group[task.workload_key] = []
family_group[task.workload_key].append(idx)
else:
family_group[task.workload_key].append(idx)

if family_group is not None:
for key, value in family_group.items():
print("family group :", key, "---", value)

return family_group
```

We use static analyzing to classify the subgraphs(tasks) based on their attributes.

2. Constructing family cost model
```python
elif "family" in model_group:
# generate cost model for each family
cost_model_pool = []
for _,group in family_group.items():
if model_type == "xgb":
cost_model = XGBModel(
num_warmup_sample=len(group) * num_measures_per_round,
model_file=load_model_file,
adapative_training=adapative_training,
)
if load_model_file and os.path.isfile(load_model_file):
logger.info("TaskScheduler: Load pretrained model...")
cost_model.load(load_model_file)
elif load_log_file:
logger.info("TaskScheduler: Reload measured states and train the model...")
cost_model.update_from_file(load_log_file)
elif model_type == "random":
cost_model = RandomModel()
else:
raise ValueError("Invalid search policy: " + search_policy)
cost_model_pool.append(cost_model)

#bind each subgraph(task) with its family cost model
search_policies = []
for task_idx,task in enumerate(tasks):
for group_idx,group in enumerate(family_group.values()):
if task_idx in group:
search_policies.append(
SketchPolicy(
task,
cost_model_pool[group_idx],
params=search_policy_params,
verbose=verbose,
init_search_callbacks=init_search_callbacks,
)
)

```

After identifying similar subgraphs, We return `family_group` (a list of subgraph families) and build a cost model for each subgraph family.

3.Foresee tuning

```python
def _tune_family_task(self, task_idx_groups,skip_measures_per_round):
"""Tune the select family task for one round.
"""
for task_idx in task_idx_groups:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, in this case once TaskScheduler picks a family, all tasks in this family will be tuned for one round? If so, it seems not that efficient as some tasks may not need that many trials. Can we still let task scheduler pick one task at a time, but we just use different cost model based on the family it belongs to?

# Run pre-tune callbacks
for callback in self.callbacks:
callback.pre_tune(self, task_idx)

measure_inputs, measure_results = self.search_policies[task_idx].continue_search_one_round(
skip_measures_per_round, self.measurer
)

……
```

The foresee tuning takes `task_idx_groups` (A list of subgraph families) and `skip_measures_per_round` as inputs and tunes all the subgraphs inside the list.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is skip_measures_per_round?


# Drawbacks
[drawbacks]: #drawbacks

When searching on a larger search space (such as larger batch size), FamilySeer performs similarly or sometimes worse than Auto-scheduler. This is because a larger search space requires more time before the cost model can provide an accurate prediction. Deploying an inaccurate cost model on Foresee tuning may result in spending time budget on non-improvement code transformations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this can be improved somehow. For example, at the warmup stage (e.g., the first N trials), all cost models share the same training data from all tasks so that you have the same behavior as the current auto-scheduler. Afterward, you apply different data to each cost model to benefit from the task groups.


# Rationale and alternatives
[rationale-and-alternatives]: #rationale-and-alternatives

Auto-Scheduler generates a large enough search space, so searching the space with efficiency is important. With FamilySeer, Users can search for the same optimal code under less time budget. We hope that our search method can be an alternative option for those who expect to obtain better optimal code under a limited time budget.

# Prior art
[prior-art]: #prior-art

Please refer to [this paper](https://arxiv.org/abs/2201.00194).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a "prior" art lol
You should just put this link to the summary section. And for this section, you should put other related work (e.g., Ansor, FlexTensor, AutoTVM, etc).


# Unresolved questions
[unresolved-questions]: #unresolved-questions

Our search method is up for [discussion](https://discuss.tvm.apache.org/t/rfc-familyseer-a-new-search-method-for-auto-scheduler/11877).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not an unresolved question. Unresolved questions should be the technical difficulty or the drawbacks of the proposed approach.


# Future possibilities
[future-possibilities]: #future-possibilities

1. Dynamic subgraph family anlaysis

FamilySeer currently relies on static analysis to identify subgraphs, which might result in misjudgments on some of the subgraphs. We are looking for an alternative method to identify subgraphs dynamically while maintaining the same time budget.

2. Advanced Foresee tuning

Auto-tuning is the procedure of looking for the only best optimal code for deep learning network. Many less performed codes have to be evaluated to build an accurate cost model. By accurately analyzing the subgraph similarity, we can draw a relationship map between each subgraph and focus on building highly accurate cost model for the most related subgraphs. Once an accurate cost model has been built, We can predict optimal code for other subgraphs instead of searching iteratively.