[hotfix] Fix the issue mentioned in #352 #354

nabenabe0928 · 2021-12-06T14:30:11Z

Types of changes

Bug fix (non-breaking change which fixes an issue)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.

Have you checked to ensure there aren't other open Pull Requests for the same update/change?
Have you added an explanation of what your changes do and why you'd like us to include them?
Have you written new tests for your core changes, as applicable?
Have you successfully ran tests with your changes locally?

Description

This is the hotfix for the issue#352.
Since the inference of the task type in sklearn does not properly handle the continuous, I added a feature to filter continuous (i.e. regression task) from multiclass.

Motivation and Context

see issue#352..

How has this been tested?

It is not tested by proper pytest, but I checked by the following locally.
Note that data.csv is available in issue#352.

from autoPyTorch.api.tabular_regression import TabularRegressionTask
import pandas as pd
from sklearn.model_selection import train_test_split


input = [
    "P1",
    "P5p1",
    "P6p2",
    "P11p4",
    "P14p9",
    "P15p1",
    "P15p3",
    "P16p2",
    "P18p2",
    "P27p4",
    "H2p2",
    "H8p2",
    "H10p1",
    "H13p1",
    "H18pA",
    "H40p4",
]
output = ["price"]

if __name__ == '__main__':
    # read dataframe
    df = pd.read_csv("data.csv")

    # split data
    X = df[input]
    y = df[output]
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, train_size=0.8, random_state=1
    )

    # train model
    api = TabularRegressionTask(n_jobs=7)
    api.search(
        X_train=X_train,
        y_train=y_train,
        X_test=X_test.copy(),
        y_test=y_test.copy(),
        optimize_metric="r2",
        total_walltime_limit=120,
    )

Note that the original code yields the following error by the same code:

Traceback (most recent call last):
  File "bug_test.py", line 39, in <module>
    api.search(
  File "/home/Auto-PyTorch/autoPyTorch/api/tabular_regression.py", line 300, in search
    return self._search(
  File "/home/Auto-PyTorch/autoPyTorch/api/base_task.py", line 876, in _search
    raise ValueError("Incompatible dataset entered for current task,"
ValueError: Incompatible dataset entered for current task,expected dataset to have task type :tabular_regression got :tabular_classification

…oml#334) * [feat] Support statistics print by adding results manager object * [refactor] Make SearchResults extract run_history at __init__ Since the search results should not be kept in eternally, I made this class to take run_history in __init__ so that we can implicitly call extraction inside. From this change, the call of extraction from outside is not recommended. However, you can still call it from outside and to prevent mixup of the environment, self.clear() will be called. * [fix] Separate those changes into PR#336 * [fix] Fix so that test_loss includes all the metrics * [enhance] Strengthen the test for sprint and SearchResults * [fix] Fix an issue in documentation * [enhance] Increase the coverage * [refactor] Separate the test for results_manager to organize the structure * [test] Add the test for get_incumbent_Result * [test] Remove the previous test_get_incumbent and see the coverage * [fix] [test] Fix reversion of metric and strengthen the test cases * [fix] Fix flake8 issues and increase coverage * [fix] Address Ravin's comments * [enhance] Increase the coverage * [fix] Fix a flake8 issu

* Create release workflow and CITATION.cff and update README, setup.py * fix bug in pypy token * fix documentation formatting * TODO for docker image * accept suggestions from shuhei * add further options for disable_file_output documentation * remove from release.yml

…utoml#136)

* [doc] Add workflow of the AutoPytorch * [doc] Address Ravin's comment

* set verbose=False in catboost * fix flake

* fix formatting in docs * Update examples/40_advanced/example_resampling_strategy.py

* [feat] Add an object that realizes the perf over time viz * [fix] Modify TODOs and add comments to avoid complications * [refactor] [feat] Format visualizer API and integrate this feature into BaseTask * [refactor] Separate a shared raise error process as a function * [refactor] Gather params in Dataclass to look smarter * [refactor] Merge extraction from history to the result manager Since this feature was added in a previous PR, we now rely on this feature to extract the history. To handle the order by the start time issue, I added the sort by endtime feature. * [feat] Merge the viz in the latest version * [fix] Fix nan --> worst val so that we can always handle by number * [fix] Fix mypy issues * [test] Add test for get_start_time * [test] Add test for order by end time * [test] Add tests for ensemble results * [test] Add tests for merging ensemble results and run history * [test] Add the tests in the case of ensemble_results is None * [fix] Alternate datetime to timestamp in tests to pass universally Since the mapping of timestamp to datetime variates on machine, the tests failed in the previous version. In this version, we changed the datetime in the tests to the fixed timestamp so that the tests will pass universally. * [fix] Fix status_msg --> status_type because it does not need to be str * [fix] Change the name for the homogeniety * [fix] Fix based on the file name change * [test] Add tests for set_plot_args * [test] Add tests for plot_perf_over_time in BaseTask * [refactor] Replace redundant lines by pytest parametrization * [test] Add tests for _get_perf_and_time * [fix] Remove viz attribute based on Ravin's comment * [fix] Fix doc-string based on Ravin's comments * [refactor] Hide color label settings extraction in dataclass Since this process makes the method in BaseTask redundant and this was pointed out by Ravin, I made this process a method of dataclass so that we can easily fetch this information. Note that since the color and label information always depend on the optimization results, we always need to pass metric results to ensure we only get related keys. * [test] Add tests for color label dicts extraction * [test] Add tests for checking if plt.show is called or not * [refactor] Address Ravin's comments and add TODO for the refactoring * [refactor] Change KeyError in EnsembleResults to empty Since it is not convenient to not be able to instantiate EnsembleResults in the case when we do not have any histories, I changed the functionality so that we can still instantiate even when the results are empty. In this case, we have empty arrays and it also matches the developers intuition. * [refactor] Prohibit external updates to make objects more robust * [fix] Remove a member variable _opt_scores since it is confusing Since opt_scores are taken from cost in run_history and metric_dict takes from additional_info, it was confusing for me where I should refer to what. By removing this, we can always refer to additional_info when fetching information and metrics are always available as a raw value. Although I changed a lot, the functionality did not change and it is easier to add any other functionalities now. * [example] Add an example how to plot performance over time * [fix] Fix unexpected train loss when using cross validation * [fix] Remove __main__ from example based on the Ravin's comment * [fix] Move results_xxx to utils from API * [enhance] Change example for the plot over time to save fig Since the plt.show() does not work on some environments, I changed the example so that everyone can run at least this example.

* cleanup of simple_imputer * Fixed doc and typo * Fixed docs * Made changes, added test * Fixed init statement * Fixed docs * Flake'd

codecov · 2021-12-06T14:51:45Z

Codecov Report

Merging #354 (8545155) into development (8f9e9f6) will increase coverage by 0.01%.
The diff coverage is 80.00%.

@@               Coverage Diff               @@
##           development     #354      +/-   ##
===============================================
+ Coverage        82.62%   82.64%   +0.01%     
===============================================
  Files              154      154              
  Lines             9055     9076      +21     
  Branches          1592     1598       +6     
===============================================
+ Hits              7482     7501      +19     
- Misses            1106     1107       +1     
- Partials           467      468       +1

Impacted Files	Coverage Δ
autoPyTorch/datasets/base_dataset.py	`80.00% <80.00%> (-0.80%)`	⬇️
autoPyTorch/api/base_task.py	`83.60% <0.00%> (ø)`
autoPyTorch/utils/results_visualizer.py	`99.05% <0.00%> (+0.05%)`	⬆️
...peline/components/training/trainer/base_trainer.py	`96.82% <0.00%> (+1.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f9e9f6...8545155. Read the comment docs.

Since we use BaseDataset without labels when we would like to instantiate it for the prediction, we remove this raise error.

nabenabe0928 · 2022-02-17T04:07:47Z

This change is the core of this PR, but for now since the original author does not response to the issue anymore, we close this PR.

Note that the issue happened because the sklearn task inference for multiclass is incorrect, i.e. if the target is only int, it will mess up the target with multiclass even if it is really continuou.

nabenabe0928 and others added 14 commits November 21, 2021 09:43

[enhance] Increase the coverage (automl#336)

2d2f6d1

[feat] Add templates for issue and PR with the Ravin's suggestions (a…

7e67e56

…utoml#136)

[doc] Add the workflow of the Auto-Pytorch (automl#285)

10df650

* [doc] Add workflow of the AutoPytorch * [doc] Address Ravin's comment

[FIX] Silence catboost (automl#338)

751b9ce

* set verbose=False in catboost * fix flake

change worst possible result of r2 (automl#340)

9d93085

Update README.md with link for master branch

f7e9cf8

[FIX formatting in docs (automl#342)

5b65244

* fix formatting in docs * Update examples/40_advanced/example_resampling_strategy.py

Update README.md, remove cat requirements.txt

ef8d21a

Cleanup of simple_imputer (automl#346)

8f9e9f6

* cleanup of simple_imputer * Fixed doc and typo * Fixed docs * Made changes, added test * Fixed init statement * Fixed docs * Flake'd

[hotfix] Fix the issue mentioned in automl#352

6580465

[fix] Fix a flake8 issue and add TODOs

2e72a6d

nabenabe0928 added the first priority PRs to be checked as a priority label Dec 6, 2021

nabenabe0928 added 2 commits December 7, 2021 15:26

[fix] Remove raise error in the case of no labels

1084db2

Since we use BaseDataset without labels when we would like to instantiate it for the prediction, we remove this raise error.

[fix] Move if statements to not cause mypy issues

8545155

ravinkohli self-requested a review December 13, 2021 10:31

nabenabe0928 requested review from ravinkohli and removed request for ravinkohli December 20, 2021 22:46

ravinkohli force-pushed the development branch from 00e2b96 to e3aeb55 Compare December 21, 2021 15:24

nabenabe0928 closed this Feb 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hotfix] Fix the issue mentioned in #352 #354

[hotfix] Fix the issue mentioned in #352 #354

nabenabe0928 commented Dec 6, 2021 •

edited

Loading

codecov bot commented Dec 6, 2021 •

edited

Loading

nabenabe0928 commented Feb 17, 2022

[hotfix] Fix the issue mentioned in #352 #354

[hotfix] Fix the issue mentioned in #352 #354

Conversation

nabenabe0928 commented Dec 6, 2021 • edited Loading

Types of changes

Checklist:

Description

Motivation and Context

How has this been tested?

codecov bot commented Dec 6, 2021 • edited Loading

Codecov Report

nabenabe0928 commented Feb 17, 2022

nabenabe0928 commented Dec 6, 2021 •

edited

Loading

codecov bot commented Dec 6, 2021 •

edited

Loading