microsoft · QuanluZhang · Jul 26, 2021 · Jul 11, 2021 · Jul 11, 2021 · Jul 11, 2021
diff --git a/docs/en_US/hpo_benchmark.rst b/docs/en_US/hpo_benchmark.rst
@@ -8,11 +8,11 @@ Terminology
 ^^^^^^^^^^^
 
 
-* **task**\ : a task can be thought of as (dataset, evaluator). It gives out a dataset containing (train, valid, test), and based on the received predictions, the evaluator evaluates a given metric (e.g., mse for regression, f1 for classification). 
-* **benchmark**\ : a benchmark is a set of tasks, along with other external constraints such as time and resource. 
-* **framework**\ : given a task, a framework conceives answers to the proposed regression or classification problem and produces predictions. Note that the automlbenchmark framework does not pose any restrictions on the hypothesis space of a framework. In our implementation in this folder, each framework is a tuple (tuner, architecture), where architecture provides the hypothesis space (and search space for tuner), and tuner determines the strategy of hyperparameter optimization. 
+* **task**\ : a task can be thought of as a tuple (dataset, metric). It provides train and test datasets to the frameworks. Then, based on the returns predictions on the test set, the task evaluates the metric (e.g., mse for regression, f1 for classification) and reports the score. 
+* **benchmark**\ : a benchmark is a set of tasks, along with other external constraints such as time limits. 
+* **framework**\ : given a task, a framework solves the proposed regression or classification problem using train data and produces predictions on the test set. The automlbenchmark framework does not pose any restrictions on the hypothesis space of a framework. In our implementation, each framework is a tuple (tuner, architecture), where architecture provides the hypothesis space, and tuner optimizes the hyperparameters of the architecture. In our implementation, to solve a task, we let the tuner continuously tune the hyperparameters (by giving it cross-validation score on the train data as feedback) until the time or trial limit is reached. Then, the architecture is retrained on the entire train set using the best set of hyperparameters. 
 * **tuner**\ : a tuner or advisor defined in the hpo folder, or a custom tuner provided by the user. 
-* **architecture**\ : an architecture is a specific method for solving the tasks, along with a set of hyperparameters to optimize (i.e., the search space). In our implementation, the architecture calls tuner multiple times to obtain possible hyperparameter configurations, and produces the final prediction for a task. See ``./nni/extensions/NNI/architectures`` for examples.
+* **architecture**\ : an architecture is a specific method for solving the tasks, along with a set of hyperparameters to optimize (i.e., the search space). See ``./nni/extensions/NNI/architectures`` for examples.
 
 Note: currently, the only architecture supported is random forest. The architecture implementation and search space definition can be found in ``./nni/extensions/NNI/architectures/run_random_forest.py``. The tasks in benchmarks "nnivalid" and "nnismall" are suitable to solve with random forests. 
 
@@ -32,47 +32,52 @@ Run predefined benchmarks on existing tuners
 
    ./runbenchmark_nni.sh [tuner-names]
 
-This script runs the benchmark 'nnivalid', which consists of a regression task, a binary classification task, and a multi-class classification task. After the script finishes, you can find a summary of the results in the folder results_[time]/reports/. To run on other predefined benchmarks, change the ``benchmark`` variable in ``runbenchmark_nni.sh``. Some benchmarks are defined in ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks``\ , and others are defined in ``/examples/trials/benchmarking/automlbenchmark/automlbenchmark/resources/benchmarks/``. One example of larger benchmarks is "nnismall", which consists of 8 regression tasks, 8 binary classification tasks, and 8 multi-class classification tasks.
+This script runs the benchmark 'nnivalid', which consists of a regression task, a binary classification task, and a multi-class classification task. After the script finishes, you can find a summary of the results in the folder results_[time]/reports/. To run on other predefined benchmarks, change the ``benchmark`` variable in ``runbenchmark_nni.sh``. Some benchmarks are defined in ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks``\ , and others are defined in ``/examples/trials/benchmarking/automlbenchmark/automlbenchmark/resources/benchmarks/``. One example of larger benchmarks is "nnismall", which consists of 8 regression tasks, 8 binary classification tasks, and 8 multi-class classification tasks. We also provide three separate 8-task benchmarks "nnismall-regression", "nnismall-binary", and "nnismall-multiclass" corresponding to the three types of tasks in nnismall.
 
-By default, the script runs the benchmark on all embedded tuners in NNI. If provided a list of tuners in [tuner-names], it only runs the tuners in the list. Currently, the following tuner names are supported: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DNGOTuner", "Hyperband", "BOHB". It is also possible to evaluate custom tuners. See the next sections for details. 
+By default, the script runs the benchmark on all embedded tuners in NNI. If provided a list of tuners in [tuner-names], it only runs the tuners in the list. Currently, the following tuner names are supported: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DNGOTuner", "Hyperband", "BOHB". It is also possible to run the benchmark on custom tuners. See the next sections for details. 
 
-By default, the script runs the specified tuners against the specified benchmark one by one. To run all the experiments simultaneously in the background, set the "serialize" flag to false in ``runbenchmark_nni.sh``. 
+By default, the script runs the specified tuners against the specified benchmark one by one. To run the experiment for all tuners simultaneously in the background, set the "serialize" flag to false in ``runbenchmark_nni.sh``. 
 
-Note: the SMAC tuner, DNGO tuner, and the BOHB advisor has to be manually installed before any experiments can be run on it. Please refer to `this page <https://nni.readthedocs.io/en/stable/Tuner/BuiltinTuner.html?highlight=nni>`_ for more details on installing SMAC and BOHB.
+Note: the SMAC tuner, DNGO tuner, and the BOHB advisor has to be manually installed before running benchmarks on them. Please refer to `this page <https://nni.readthedocs.io/en/stable/Tuner/BuiltinTuner.html?highlight=nni>`_ for more details on installation.
 
 Run customized benchmarks on existing tuners
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-To run customized benchmarks, add a benchmark_name.yaml file in the folder ``./nni/benchmarks``\ , and change the ``benchmark`` variable in ``runbenchmark_nni.sh``. See ``./automlbenchmark/resources/benchmarks/`` for some examples of defining a custom benchmark.
+You can design your own benchmarks and evaluate the performance of NNI tuners on them. To run customized benchmarks, add a benchmark_name.yaml file in the folder ``./nni/benchmarks``\ , and change the ``benchmark`` variable in ``runbenchmark_nni.sh``. See ``./automlbenchmark/resources/benchmarks/`` for some examples of defining a custom benchmark.
 
 Run benchmarks on custom tuners
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-To use custom tuners, first make sure that the tuner inherits from ``nni.tuner.Tuner`` and correctly implements the required APIs. For more information on implementing a custom tuner, please refer to `here <https://nni.readthedocs.io/en/stable/Tuner/CustomizeTuner.html>`_. Next, perform the following steps:
+You may also use the benchmark to compare a custom tuner written by yourself with the NNI built-in tuners. To use custom tuners, first make sure that the tuner inherits from ``nni.tuner.Tuner`` and correctly implements the required APIs. For more information on implementing a custom tuner, please refer to `here <https://nni.readthedocs.io/en/stable/Tuner/CustomizeTuner.html>`_. Next, perform the following steps:
 
 
-#. Install the custom tuner with command ``nnictl algo register``. Check `this document <https://nni.readthedocs.io/en/stable/Tutorial/Nnictl.html>`_ for details. 
+#. Install the custom tuner via the command ``nnictl algo register``. Check `this document <https://nni.readthedocs.io/en/stable/Tutorial/Nnictl.html>`_ for details. 
 #. In ``./nni/frameworks.yaml``\ , add a new framework extending the base framework NNI. Make sure that the parameter ``tuner_type`` corresponds to the "builtinName" of tuner installed in step 1.
 #. Run the following command
 
 .. code-block:: bash
 
       ./runbenchmark_nni.sh new-tuner-builtinName
 
+The benchmark will automatically find and match the tuner newly added to your NNI installation.
+
 A Benchmark Example 
 ^^^^^^^^^^^^^^^^^^^
 
-As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DNGOTuner". As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner using a single CPU core. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. For binary and multi-class classification tasks, the metric "auc" and "logloss" were used for evaluation, while for regression, "r2" and "rmse" were used. 
+As an example, we ran the "nnismall" benchmark on the following 8 tuners: "TPE", "Random", "Anneal", "Evolution", "SMAC", "GPTuner", "MetisTuner", "DNGOTuner". As some of the tasks contains a considerable amount of training data, it took about 2 days to run the whole benchmark on one tuner. For a more detailed description of the tasks, please check ``/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall_description.txt``. For binary and multi-class classification tasks, the metric "auc" and "logloss" were used for evaluation, while for regression, "r2" and "rmse" were used. 
 
 After the script finishes, the final scores of each tuner are summarized in the file ``results[time]/reports/performances.txt``. Since the file is large, we only show the following screenshot and summarize other important statistics instead. 
 
 .. image:: ../img/hpo_benchmark/performances.png
    :target: ../img/hpo_benchmark/performances.png
    :alt: 
 
-In addition, when the results are parsed, the tuners are ranked based on their final performance. ``results[time]/reports/rankings.txt`` presents the average ranking of the tuners for each metric (logloss, rmse, auc). Here we present the data in the first three tables. Also, for every tuner, their performance for each type of metric is summarized (another view of the same data). We present this statistics in the fourth table. 
+When the results are parsed, the tuners are also ranked based on their final performance. The following three tables show the average ranking of the tuners for each metric (logloss, rmse, auc). 
 
-Average rankings for metric rmse:
+
+Also, for every tuner, their performance for each type of metric is summarized (another view of the same data). We present this statistics in the fourth table. Note that this information can be found at ``results[time]/reports/rankings.txt``. 
+
+Average rankings for metric rmse (for regression tasks). We found that Anneal performs the best among all NNI built-in tuners. 
 
 .. list-table::
    :header-rows: 1
@@ -96,7 +101,7 @@ Average rankings for metric rmse:
    * - MetisTuner
      - 4.94
 
-Average rankings for metric auc:
+Average rankings for metric auc (for classification tasks). We found that SMAC performs the best among all NNI built-in tuners.
 
 .. list-table::
    :header-rows: 1
@@ -120,7 +125,7 @@ Average rankings for metric auc:
    * - DNGOTuner
      - 5.33
 
-Average rankings for metric logloss:
+Average rankings for metric logloss (for classification tasks). We found that Random performs the best among all NNI built-in tuners. 
 
 .. list-table::
    :header-rows: 1
@@ -144,7 +149,7 @@ Average rankings for metric logloss:
    * - MetisTuner
      - 5.93
 
-Average rankings for tuners:
+To view the same data in another way, for each tuner, we present the average rankings on different types of metrics. From the table, we can find that, for example, the DNGOTuner performs better for the tasks whose metric is "logloss" than for the tasks with metric "auc". We hope this information can to some extent guide the choice of tuners given some knowledge of task types. 
 
 .. list-table::
    :header-rows: 1
@@ -186,7 +191,7 @@ Average rankings for tuners:
      - 5.33
      - 3.50
 
-Besides these reports, our script also generates two graphs for each fold of each task. The first graph presents the best score seen by each tuner until trial x, and the second graph shows the scores of each tuner in trial x. These two graphs can give some information regarding how the tuners are "converging". We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.
+Besides these reports, our script also generates two graphs for each fold of each task: one graph presents the best score received by each tuner until trial x, and another graph shows the score that each tuner receives in trial x. These two graphs can give some information regarding how the tuners are "converging" to their final solution. We found that for "nnismall", tuners on the random forest model with search space defined in ``/examples/trials/benchmarking/automlbenchmark/nni/extensions/NNI/architectures/run_random_forest.py`` generally converge to the final solution after 40 to 60 trials. As there are too much graphs to incldue in a single report (96 graphs in total), we only present 10 graphs here.
 
 .. image:: ../img/hpo_benchmark/car_fold1_1.jpg
    :target: ../img/hpo_benchmark/car_fold1_1.jpg
@@ -197,7 +202,7 @@ Besides these reports, our script also generates two graphs for each fold of eac
    :target: ../img/hpo_benchmark/car_fold1_2.jpg
    :alt: 
 
-For example, the previous two graphs are generated for fold 1 of the task "car". In the first graph, we can observe that most tuners find a relatively good solution within 40 trials. In this experiment, among all tuners, the DNGOTuner converges fastest to the best solution (within 10 trials). Its score improved three times in the entire experiment. In the second graph, we observe that most tuners have their score flucturate between 0.8 and 1 throughout the experiment duration. However, it seems that the Anneal tuner (green line) is more unstable (having more fluctuations) while the GPTuner has a more stable pattern. Regardless, although this pattern can to some extent be interpreted as a tuner's position on the explore-exploit tradeoff, it cannot be used for a comprehensive evaluation of a tuner's effectiveness. 
+The previous two graphs are generated for fold 1 of the task "car". In the first graph, we observe that most tuners find a relatively good solution within 40 trials. In this experiment, among all tuners, the DNGOTuner converges fastest to the best solution (within 10 trials). Its best score improved for three times in the entire experiment. In the second graph, we observe that most tuners have their score flucturate between 0.8 and 1 throughout the experiment. However, it seems that the Anneal tuner (green line) is more unstable (having more fluctuations) while the GPTuner has a more stable pattern. This may be interpreted as the Anneal tuner explores more aggressively than the GPTuner and thus its scores for different trials vary a lot. Regardless, although this pattern can to some extent hint a tuner's position on the explore-exploit tradeoff, it is not a comprehensive evaluation of a tuner's effectiveness. 
 
 .. image:: ../img/hpo_benchmark/christine_fold0_1.jpg
    :target: ../img/hpo_benchmark/christine_fold0_1.jpg

diff --git a/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-binary.yaml b/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-binary.yaml
@@ -0,0 +1,29 @@
+---
+- name: __defaults__
+  folds: 2
+  cores: 2
+  max_runtime_seconds: 300
+
+- name: Australian
+  openml_task_id: 146818
+
+- name: blood-transfusion
+  openml_task_id: 10101
+
+- name: christine
+  openml_task_id: 168908
+
+- name: credit-g
+  openml_task_id: 31
+
+- name: kc1
+  openml_task_id: 3917
+
+- name: kr-vs-kp
+  openml_task_id: 3
+
+- name: phoneme
+  openml_task_id: 9952
+
+- name: sylvine
+  openml_task_id: 168912
diff --git a/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-multiclass.yaml b/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-multiclass.yaml
@@ -0,0 +1,29 @@
+---
+- name: __defaults__
+  folds: 2
+  cores: 2
+  max_runtime_seconds: 300
+
+- name: car
+  openml_task_id: 146821
+
+- name: cnae-9
+  openml_task_id: 9981
+
+- name: dilbert
+  openml_task_id: 168909
+
+- name: fabert
+  openml_task_id: 168910
+
+- name: jasmine
+  openml_task_id: 168911
+
+- name: mfeat-factors
+  openml_task_id: 12
+
+- name: segment
+  openml_task_id: 146822
+
+- name: vehicle
+  openml_task_id: 53
diff --git a/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-regression.yaml b/examples/trials/benchmarking/automlbenchmark/nni/benchmarks/nnismall-regression.yaml
@@ -0,0 +1,30 @@
+---
+- name: __defaults__
+  folds: 2
+  cores: 2
+  max_runtime_seconds: 300
+
+- name: cholesterol
+  openml_task_id: 2295
+
+- name: liver-disorders
+  openml_task_id: 52948
+
+- name: kin8nm
+  openml_task_id: 2280
+
+- name: cpu_small
+  openml_task_id: 4883
+
+- name: titanic_2
+  openml_task_id: 211993
+
+- name: boston
+  openml_task_id: 4857
+
+- name: stock
+  openml_task_id: 2311
+
+- name: space_ga
+  openml_task_id: 4835
+