Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

HPO Benchmark Fixes and New Features #3925

Merged
merged 31 commits into from
Jul 26, 2021
Merged
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
9fe6529
fix parallel run bug
xiaowu0162 Jul 11, 2021
c22740c
handle None returned by tuner
xiaowu0162 Jul 11, 2021
7714dcc
cast params to int for random forest
xiaowu0162 Jul 11, 2021
b2a0967
split nnismall
xiaowu0162 Jul 11, 2021
7044c3a
improve docs
xiaowu0162 Jul 12, 2021
76500d5
search space
xiaowu0162 Jul 12, 2021
2ec6393
doc refactor
xiaowu0162 Jul 13, 2021
2eff6f7
doc update
xiaowu0162 Jul 13, 2021
ae6640a
Revert "search space"
xiaowu0162 Jul 13, 2021
79bc92f
Revert "Revert "search space""
xiaowu0162 Jul 13, 2021
764c0d7
Revert "doc update"
xiaowu0162 Jul 13, 2021
ad90fb1
Revert "doc refactor"
xiaowu0162 Jul 13, 2021
d1d1fb8
doc refactor
xiaowu0162 Jul 13, 2021
2dbe783
doc update
xiaowu0162 Jul 13, 2021
a05b2a8
doc update
xiaowu0162 Jul 13, 2021
0699dd0
debug
xiaowu0162 Jul 13, 2021
646d1c9
doc update
xiaowu0162 Jul 13, 2021
80b97aa
MLP skeleton
xiaowu0162 Jul 13, 2021
50c2e18
config
xiaowu0162 Jul 13, 2021
7ea2b8c
mlp search space
xiaowu0162 Jul 13, 2021
3bc845d
mlp search space
xiaowu0162 Jul 14, 2021
25c6c6e
doc update
xiaowu0162 Jul 14, 2021
75f6149
doc update
xiaowu0162 Jul 14, 2021
7c3758f
doc fix
xiaowu0162 Jul 14, 2021
f8b6905
doc fix
xiaowu0162 Jul 14, 2021
2735a4f
add doc for MLP
xiaowu0162 Jul 14, 2021
ce41669
Update hpo_benchmark_stats.rst
xiaowu0162 Jul 14, 2021
4876ff0
update search space for MLP
xiaowu0162 Jul 15, 2021
efcde76
doc fix
xiaowu0162 Jul 15, 2021
dc144da
Update hpo_benchmark.rst
xiaowu0162 Jul 19, 2021
df4b1ab
Update hpo_benchmark.rst
xiaowu0162 Jul 19, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
348 changes: 145 additions & 203 deletions docs/en_US/hpo_benchmark.rst

Large diffs are not rendered by default.

205 changes: 205 additions & 0 deletions docs/en_US/hpo_benchmark_stats.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,205 @@
HPO Benchmark Example Statistics
================================

A Benchmark Example
^^^^^^^^^^^^^^^^^^^

As an example, we ran the "nnismall" benchmark with the random forest search space on the following 8 tuners: "TPE",
"Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DNGOTuner". For convenience of reference, we also list
the search space we experimented on here. Note that the way in which the search space is written may significantly affect
hyperparameter optimization performance, and we plan to conduct further experiments on how well NNI built-in tuners adapt
to different search space formulations using this benchmarking tool.

.. code-block:: json

{
"n_estimators": {"_type":"randint", "_value": [8, 512]},
"max_depth": {"_type":"choice", "_value": [4, 8, 16, 32, 64, 128, 256, 0]},
"min_samples_leaf": {"_type":"randint", "_value": [1, 8]},
"min_samples_split": {"_type":"randint", "_value": [2, 16]},
"max_leaf_nodes": {"_type":"randint", "_value": [0, 4096]}
}

As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on
one tuner. For a more detailed description of the tasks, please check
``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. For binary and multi-class
classification tasks, the metric "auc" and "logloss" were used for evaluation, while for regression, "r2" and "rmse" were used.

After the script finishes, the final scores of each tuner are summarized in the file ``results[time]/reports/performances.txt``.
Since the file is large, we only show the following screenshot and summarize other important statistics instead.

.. image:: ../img/hpo_benchmark/performances.png
:target: ../img/hpo_benchmark/performances.png
:alt:

When the results are parsed, the tuners are also ranked based on their final performance. The following three tables show
the average ranking of the tuners for each metric (logloss, rmse, auc).

Also, for every tuner, their performance for each type of metric is summarized (another view of the same data).
We present this statistics in the fourth table. Note that this information can be found at ``results[time]/reports/rankings.txt``.

Average rankings for metric rmse (for regression tasks). We found that Anneal performs the best among all NNI built-in tuners.

.. list-table::
:header-rows: 1

* - Tuner Name
- Average Ranking
* - Anneal
- 3.75
* - Random
- 4.00
* - Evolution
- 4.44
* - DNGOTuner
- 4.44
* - SMAC
- 4.56
* - TPE
- 4.94
* - GPTuner
- 4.94
* - MetisTuner
- 4.94

Average rankings for metric auc (for classification tasks). We found that SMAC performs the best among all NNI built-in tuners.

.. list-table::
:header-rows: 1

* - Tuner Name
- Average Ranking
* - SMAC
- 3.67
* - GPTuner
- 4.00
* - Evolution
- 4.22
* - Anneal
- 4.39
* - MetisTuner
- 4.39
* - TPE
- 4.67
* - Random
- 5.33
* - DNGOTuner
- 5.33

Average rankings for metric logloss (for classification tasks). We found that Random performs the best among all NNI built-in tuners.

.. list-table::
:header-rows: 1

* - Tuner Name
- Average Ranking
* - Random
- 3.36
* - DNGOTuner
- 3.50
* - SMAC
- 3.93
* - GPTuner
- 4.64
* - TPE
- 4.71
* - Anneal
- 4.93
* - Evolution
- 5.00
* - MetisTuner
- 5.93

To view the same data in another way, for each tuner, we present the average rankings on different types of metrics. From the table, we can find that, for example, the DNGOTuner performs better for the tasks whose metric is "logloss" than for the tasks with metric "auc". We hope this information can to some extent guide the choice of tuners given some knowledge of task types.

.. list-table::
:header-rows: 1

* - Tuner Name
- rmse
- auc
- logloss
* - TPE
- 4.94
- 4.67
- 4.71
* - Random
- 4.00
- 5.33
- 3.36
* - Anneal
- 3.75
- 4.39
- 4.93
* - Evolution
- 4.44
- 4.22
- 5.00
* - GPTuner
- 4.94
- 4.00
- 4.64
* - MetisTuner
- 4.94
- 4.39
- 5.93
* - SMAC
- 4.56
- 3.67
- 3.93
* - DNGOTuner
- 4.44
- 5.33
- 3.50

Besides these reports, our script also generates two graphs for each fold of each task: one graph presents the best score received by each tuner until trial x, and another graph shows the score that each tuner receives in trial x. These two graphs can give some information regarding how the tuners are "converging" to their final solution. We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.

.. image:: ../img/hpo_benchmark/car_fold1_1.jpg
:target: ../img/hpo_benchmark/car_fold1_1.jpg
:alt:


.. image:: ../img/hpo_benchmark/car_fold1_2.jpg
:target: ../img/hpo_benchmark/car_fold1_2.jpg
:alt:

The previous two graphs are generated for fold 1 of the task "car". In the first graph, we observe that most tuners find a relatively good solution within 40 trials. In this experiment, among all tuners, the DNGOTuner converges fastest to the best solution (within 10 trials). Its best score improved for three times in the entire experiment. In the second graph, we observe that most tuners have their score flucturate between 0.8 and 1 throughout the experiment. However, it seems that the Anneal tuner (green line) is more unstable (having more fluctuations) while the GPTuner has a more stable pattern. This may be interpreted as the Anneal tuner explores more aggressively than the GPTuner and thus its scores for different trials vary a lot. Regardless, although this pattern can to some extent hint a tuner's position on the explore-exploit tradeoff, it is not a comprehensive evaluation of a tuner's effectiveness.

.. image:: ../img/hpo_benchmark/christine_fold0_1.jpg
:target: ../img/hpo_benchmark/christine_fold0_1.jpg
:alt:


.. image:: ../img/hpo_benchmark/christine_fold0_2.jpg
:target: ../img/hpo_benchmark/christine_fold0_2.jpg
:alt:


.. image:: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
:target: ../img/hpo_benchmark/cnae-9_fold0_1.jpg
:alt:


.. image:: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
:target: ../img/hpo_benchmark/cnae-9_fold0_2.jpg
:alt:


.. image:: ../img/hpo_benchmark/credit-g_fold1_1.jpg
:target: ../img/hpo_benchmark/credit-g_fold1_1.jpg
:alt:


.. image:: ../img/hpo_benchmark/credit-g_fold1_2.jpg
:target: ../img/hpo_benchmark/credit-g_fold1_2.jpg
:alt:


.. image:: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
:target: ../img/hpo_benchmark/titanic_2_fold1_1.jpg
:alt:


.. image:: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
:target: ../img/hpo_benchmark/titanic_2_fold1_2.jpg
:alt:
2 changes: 1 addition & 1 deletion docs/en_US/hyperparameter_tune.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,4 +25,4 @@ according to their needs.
WebUI <Tutorial/WebUI>
How to Debug <Tutorial/HowToDebug>
Advanced <hpo_advanced>
Benchmark for Tuners <hpo_benchmark>
HPO Benchmarks <hpo_benchmark>
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
- name: __defaults__
folds: 2
cores: 2
max_runtime_seconds: 300

- name: Australian
openml_task_id: 146818

- name: blood-transfusion
openml_task_id: 10101

- name: christine
openml_task_id: 168908

- name: credit-g
openml_task_id: 31

- name: kc1
openml_task_id: 3917

- name: kr-vs-kp
openml_task_id: 3

- name: phoneme
openml_task_id: 9952

- name: sylvine
openml_task_id: 168912
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
---
- name: __defaults__
folds: 2
cores: 2
max_runtime_seconds: 300

- name: car
openml_task_id: 146821

- name: cnae-9
openml_task_id: 9981

- name: dilbert
openml_task_id: 168909

- name: fabert
openml_task_id: 168910

- name: jasmine
openml_task_id: 168911

- name: mfeat-factors
openml_task_id: 12

- name: segment
openml_task_id: 146822

- name: vehicle
openml_task_id: 53
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
- name: __defaults__
folds: 2
cores: 2
max_runtime_seconds: 300

- name: cholesterol
openml_task_id: 2295

- name: liver-disorders
openml_task_id: 52948

- name: kin8nm
openml_task_id: 2280

- name: cpu_small
openml_task_id: 4883

- name: titanic_2
openml_task_id: 211993

- name: boston
openml_task_id: 4857

- name: stock
openml_task_id: 2311

- name: space_ga
openml_task_id: 4835

Loading