Skip to content

Commit

Permalink
Cocktail fixes time debug (#286)
Browse files Browse the repository at this point in the history
* preprocess inside data validator

* add time debug statements

* Add fixes for categorical data

* add fit_ensemble

* add arlind fix for swa and se

* fix bug in trainer choice fit

* fix ensemble bug

* Correct bug in cleanup

* Cleanup for removing time debug statements

* ablation for adversarial

* shuffle false in dataloader

* drop last false in dataloader

* fix bug for validation set, and cutout and cutmix

* shuffle = False

* Shake Shake updates (#287)

* To test locally

* fix bug in trainer choice fit

* fix ensemble bug

* Correct bug in cleanup

* To test locally

* Cleanup for removing time debug statements

* ablation for adversarial

* shuffle false in dataloader

* drop last false in dataloader

* fix bug for validation set, and cutout and cutmix

* To test locally

* shuffle = False

* To test locally

* updates to search space

* updates to search space

* update branch with search space

* undo search space update

* fix bug in shake shake flag

* limit to shake-even

* restrict to even even

* Add even even and others for shake-drop also

* fix bug in passing alpha beta method

* restrict to only even even

* fix silly bug:

* remove imputer and ordinal encoder for categorical transformer in feature validator

* Address comments from shuhei

* fix issues with ensemble fitting post hoc

* Address comments on the PR

* Fix flake and mypy errors

* Address comments from PR #286

* fix bug in embedding

* Update autoPyTorch/api/tabular_classification.py

Co-authored-by: nabenabe0928 <[email protected]>

* Update autoPyTorch/datasets/base_dataset.py

Co-authored-by: nabenabe0928 <[email protected]>

* Update autoPyTorch/datasets/base_dataset.py

Co-authored-by: nabenabe0928 <[email protected]>

* Update autoPyTorch/pipeline/components/training/trainer/base_trainer.py

Co-authored-by: nabenabe0928 <[email protected]>

* Address comments from shuhei

* adress comments from shuhei

* fix flake and mypy

* Update autoPyTorch/pipeline/components/training/trainer/RowCutMixTrainer.py

Co-authored-by: nabenabe0928 <[email protected]>

* Update autoPyTorch/pipeline/tabular_classification.py

Co-authored-by: nabenabe0928 <[email protected]>

* Update autoPyTorch/pipeline/components/setup/network_backbone/utils.py

Co-authored-by: nabenabe0928 <[email protected]>

* Update autoPyTorch/pipeline/components/setup/network_backbone/utils.py

Co-authored-by: nabenabe0928 <[email protected]>

* Update autoPyTorch/pipeline/components/setup/network_backbone/utils.py

Co-authored-by: nabenabe0928 <[email protected]>

* Apply suggestions from code review

Co-authored-by: nabenabe0928 <[email protected]>

* increase threads_per_worker

* fix bug in rowcutmix

* Enhancement for the tabular validator. (#291)

* Initial try at an enhancement for the tabular validator

* Adding a few type annotations

* Fixing bugs in implementation

* Adding wrongly deleted code part during rebase

* Fix bug in _get_args

* Fix bug in _get_args

* Addressing Shuhei's comments

* Address Shuhei's comments

* Refactoring code

* Refactoring code

* Typos fix and additional comments

* Replace nan in categoricals with simple imputer

* Remove unused function

* add comment

* Update autoPyTorch/data/tabular_feature_validator.py

Co-authored-by: nabenabe0928 <[email protected]>

* Update autoPyTorch/data/tabular_feature_validator.py

Co-authored-by: nabenabe0928 <[email protected]>

* Adding unit test for only nall columns in the tabular feature categorical evaluator

* fix bug in remove all nan columns

* Bug fix for making tests run by arlind

* fix flake errors in feature validator

* made typing code uniform

* Apply suggestions from code review

Co-authored-by: nabenabe0928 <[email protected]>

* address comments from shuhei

* address comments from shuhei (2)

Co-authored-by: Ravin Kohli <[email protected]>
Co-authored-by: Ravin Kohli <[email protected]>
Co-authored-by: nabenabe0928 <[email protected]>

* Apply suggestions from code review

Co-authored-by: nabenabe0928 <[email protected]>

* resolve code issues with new versions

* Address comments from shuhei

* make run_traditional_ml function

* implement suggestion from shuhei and fix bug in rowcutmixtrainer

* fix return type docstring

* add better documentation and fix bug in shake_drop_get_bl

* Apply suggestions from code review

Co-authored-by: nabenabe0928 <[email protected]>

* add test for comparator and other improvements based on PR comments

* fix bug in test

* [fix] Fix the condition in the raising error of all_nan_columns

* [refactor] Unite name conventions of numpy array and pandas dataframe

* [doc] Add the description about the tabular feature transformation

* [doc] Add the description of the tabular feature transformation

* address comments from arlind

* address comments from arlind

* change to as_tensor and address comments from arlind

* correct description for functions in data module

Co-authored-by: nabenabe0928 <[email protected]>
Co-authored-by: Arlind Kadra <[email protected]>
Co-authored-by: nabenabe0928 <[email protected]>
  • Loading branch information
4 people authored Oct 20, 2021
1 parent d37d4a5 commit 23466f0
Show file tree
Hide file tree
Showing 35 changed files with 1,130 additions and 527 deletions.
337 changes: 280 additions & 57 deletions autoPyTorch/api/base_task.py

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions autoPyTorch/api/tabular_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,8 @@ def search(
y_test=y_test,
dataset_name=dataset_name)

if self.dataset is None:
raise ValueError("`dataset` in {} must be initialized, but got None".format(self.__class__.__name__))
return self._search(
dataset=self.dataset,
optimize_metric=optimize_metric,
Expand Down
2 changes: 2 additions & 0 deletions autoPyTorch/api/tabular_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,8 @@ def search(
y_test=y_test,
dataset_name=dataset_name)

if self.dataset is None:
raise ValueError("`dataset` in {} must be initialized, but got None".format(self.__class__.__name__))
return self._search(
dataset=self.dataset,
optimize_metric=optimize_metric,
Expand Down
83 changes: 56 additions & 27 deletions autoPyTorch/data/base_feature_validator.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import logging
import typing
from typing import List, Optional, Set, Tuple, Union

import numpy as np

Expand All @@ -12,8 +12,8 @@
from autoPyTorch.utils.logging_ import PicklableClientLogger


SUPPORTED_FEAT_TYPES = typing.Union[
typing.List,
SUPPORTED_FEAT_TYPES = Union[
List,
pd.DataFrame,
np.ndarray,
scipy.sparse.bsr_matrix,
Expand All @@ -35,60 +35,61 @@ class BaseFeatureValidator(BaseEstimator):
List of the column types found by this estimator during fit.
data_type (str):
Class name of the data type provided during fit.
encoder (typing.Optional[BaseEstimator])
encoder (Optional[BaseEstimator])
Host a encoder object if the data requires transformation (for example,
if provided a categorical column in a pandas DataFrame)
enc_columns (typing.List[str])
enc_columns (List[str])
List of columns that were encoded.
"""
def __init__(self,
logger: typing.Optional[typing.Union[PicklableClientLogger, logging.Logger
]] = None,
logger: Optional[Union[PicklableClientLogger, logging.Logger
]
] = None,
) -> None:
# Register types to detect unsupported data format changes
self.feat_type = None # type: typing.Optional[typing.List[str]]
self.data_type = None # type: typing.Optional[type]
self.dtypes = [] # type: typing.List[str]
self.column_order = [] # type: typing.List[str]
self.feat_type: Optional[List[str]] = None
self.data_type: Optional[type] = None
self.dtypes: List[str] = []
self.column_order: List[str] = []

self.encoder = None # type: typing.Optional[BaseEstimator]
self.enc_columns = [] # type: typing.List[str]
self.encoder: Optional[BaseEstimator] = None
self.enc_columns: List[str] = []

self.logger: typing.Union[
self.logger: Union[
PicklableClientLogger, logging.Logger
] = logger if logger is not None else logging.getLogger(__name__)

# Required for dataset properties
self.num_features = None # type: typing.Optional[int]
self.categories = [] # type: typing.List[typing.List[int]]
self.categorical_columns: typing.List[int] = []
self.numerical_columns: typing.List[int] = []
# column identifiers may be integers or strings
self.null_columns: typing.Set[str] = set()
self.num_features: Optional[int] = None
self.categories: List[List[int]] = []
self.categorical_columns: List[int] = []
self.numerical_columns: List[int] = []

self.all_nan_columns: Optional[Set[Union[int, str]]] = None

self._is_fitted = False

def fit(
self,
X_train: SUPPORTED_FEAT_TYPES,
X_test: typing.Optional[SUPPORTED_FEAT_TYPES] = None,
X_test: Optional[SUPPORTED_FEAT_TYPES] = None,
) -> BaseEstimator:
"""
Validates and fit a categorical encoder (if needed) to the features.
The supported data types are List, numpy arrays and pandas DataFrames.
CSR sparse data types are also supported
Arguments:
Args:
X_train (SUPPORTED_FEAT_TYPES):
A set of features that are going to be validated (type and dimensionality
checks) and a encoder fitted in the case the data needs encoding
X_test (typing.Optional[SUPPORTED_FEAT_TYPES]):
X_test (Optional[SUPPORTED_FEAT_TYPES]):
A hold out set of data used for checking
"""

# If a list was provided, it will be converted to pandas
if isinstance(X_train, list):
X_train, X_test = self.list_to_dataframe(X_train, X_test)
X_train, X_test = self.list_to_pandas(X_train, X_test)

self._check_data(X_train)

Expand All @@ -114,14 +115,15 @@ def _fit(
X: SUPPORTED_FEAT_TYPES,
) -> BaseEstimator:
"""
Arguments:
Args:
X (SUPPORTED_FEAT_TYPES):
A set of features that are going to be validated (type and dimensionality
checks) and a encoder fitted in the case the data needs encoding
Returns:
self:
The fitted base estimator
"""

raise NotImplementedError()

def _check_data(
Expand All @@ -131,19 +133,20 @@ def _check_data(
"""
Feature dimensionality and data type checks
Arguments:
Args:
X (SUPPORTED_FEAT_TYPES):
A set of features that are going to be validated (type and dimensionality
checks) and a encoder fitted in the case the data needs encoding
"""

raise NotImplementedError()

def transform(
self,
X: SUPPORTED_FEAT_TYPES,
) -> np.ndarray:
"""
Arguments:
Args:
X_train (SUPPORTED_FEAT_TYPES):
A set of features, whose categorical features are going to be
transformed
Expand All @@ -152,4 +155,30 @@ def transform(
np.ndarray:
The transformed array
"""

raise NotImplementedError()

def list_to_pandas(
self,
X_train: SUPPORTED_FEAT_TYPES,
X_test: Optional[SUPPORTED_FEAT_TYPES] = None,
) -> Tuple[pd.DataFrame, Optional[pd.DataFrame]]:
"""
Converts a list to a pandas DataFrame. In this process, column types are inferred.
If test data is provided, we proactively match it to train data
Args:
X_train (SUPPORTED_FEAT_TYPES):
A set of features that are going to be validated (type and dimensionality
checks) and a encoder fitted in the case the data needs encoding
X_test (Optional[SUPPORTED_FEAT_TYPES]):
A hold out set of data used for checking
Returns:
pd.DataFrame:
transformed train data from list to pandas DataFrame
pd.DataFrame:
transformed test data from list to pandas DataFrame
"""

raise NotImplementedError()
52 changes: 27 additions & 25 deletions autoPyTorch/data/base_target_validator.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import logging
import typing
from typing import List, Optional, Union, cast

import numpy as np

Expand All @@ -12,8 +12,8 @@
from autoPyTorch.utils.logging_ import PicklableClientLogger


SUPPORTED_TARGET_TYPES = typing.Union[
typing.List,
SUPPORTED_TARGET_TYPES = Union[
List,
pd.Series,
pd.DataFrame,
np.ndarray,
Expand All @@ -35,48 +35,50 @@ class BaseTargetValidator(BaseEstimator):
is_classification (bool):
A bool that indicates if the validator should operate in classification mode.
During classification, the targets are encoded.
encoder (typing.Optional[BaseEstimator]):
encoder (Optional[BaseEstimator]):
Host a encoder object if the data requires transformation (for example,
if provided a categorical column in a pandas DataFrame)
enc_columns (typing.List[str])
enc_columns (List[str])
List of columns that where encoded
"""
def __init__(self,
is_classification: bool = False,
logger: typing.Optional[typing.Union[PicklableClientLogger, logging.Logger
]] = None,
logger: Optional[Union[PicklableClientLogger,
logging.Logger
]
] = None,
) -> None:
self.is_classification = is_classification

self.data_type = None # type: typing.Optional[type]
self.data_type: Optional[type] = None

self.encoder = None # type: typing.Optional[BaseEstimator]
self.encoder: Optional[BaseEstimator] = None

self.out_dimensionality = None # type: typing.Optional[int]
self.type_of_target = None # type: typing.Optional[str]
self.out_dimensionality: Optional[int] = None
self.type_of_target: Optional[str] = None

self.logger: typing.Union[
self.logger: Union[
PicklableClientLogger, logging.Logger
] = logger if logger is not None else logging.getLogger(__name__)

# Store the dtype for remapping to correct type
self.dtype = None # type: typing.Optional[type]
self.dtype: Optional[type] = None

self._is_fitted = False

def fit(
self,
y_train: SUPPORTED_TARGET_TYPES,
y_test: typing.Optional[SUPPORTED_TARGET_TYPES] = None,
y_test: Optional[SUPPORTED_TARGET_TYPES] = None,
) -> BaseEstimator:
"""
Validates and fit a categorical encoder (if needed) to the targets
The supported data types are List, numpy arrays and pandas DataFrames.
Arguments:
Args:
y_train (SUPPORTED_TARGET_TYPES)
A set of targets set aside for training
y_test (typing.Union[SUPPORTED_TARGET_TYPES])
y_test (Union[SUPPORTED_TARGET_TYPES])
A hold out set of data used of the targets. It is also used to fit the
categories of the encoder.
"""
Expand All @@ -95,8 +97,8 @@ def fit(
np.shape(y_test)
))
if isinstance(y_train, pd.DataFrame):
y_train = typing.cast(pd.DataFrame, y_train)
y_test = typing.cast(pd.DataFrame, y_test)
y_train = cast(pd.DataFrame, y_train)
y_test = cast(pd.DataFrame, y_test)
if y_train.columns.tolist() != y_test.columns.tolist():
raise ValueError(
"Train and test targets must both have the same columns, yet "
Expand Down Expand Up @@ -127,24 +129,24 @@ def fit(
def _fit(
self,
y_train: SUPPORTED_TARGET_TYPES,
y_test: typing.Optional[SUPPORTED_TARGET_TYPES] = None,
y_test: Optional[SUPPORTED_TARGET_TYPES] = None,
) -> BaseEstimator:
"""
Arguments:
Args:
y_train (SUPPORTED_TARGET_TYPES)
The labels of the current task. They are going to be encoded in case
of classification
y_test (typing.Optional[SUPPORTED_TARGET_TYPES])
y_test (Optional[SUPPORTED_TARGET_TYPES])
A holdout set of labels
"""
raise NotImplementedError()

def transform(
self,
y: typing.Union[SUPPORTED_TARGET_TYPES],
y: Union[SUPPORTED_TARGET_TYPES],
) -> np.ndarray:
"""
Arguments:
Args:
y (SUPPORTED_TARGET_TYPES)
A set of targets that are going to be encoded if the current task
is classification
Expand All @@ -161,8 +163,8 @@ def inverse_transform(
"""
Revert any encoding transformation done on a target array
Arguments:
y (typing.Union[np.ndarray, pd.DataFrame, pd.Series]):
Args:
y (Union[np.ndarray, pd.DataFrame, pd.Series]):
Target array to be transformed back to original form before encoding
Returns:
np.ndarray:
Expand Down
4 changes: 2 additions & 2 deletions autoPyTorch/data/base_validator.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ def fit(
+ Checks for dimensionality as well as missing values are performed.
+ If performing a classification task, the data is going to be encoded
Arguments:
Args:
X_train (SUPPORTED_FEAT_TYPES):
A set of features that are going to be validated (type and dimensionality
checks). If this data contains categorical columns, an encoder is going to
Expand Down Expand Up @@ -102,7 +102,7 @@ def transform(
"""
Transform the given target or features to a numpy array
Arguments:
Args:
X (SUPPORTED_FEAT_TYPES):
A set of features to transform
y (typing.Optional[SUPPORTED_TARGET_TYPES]):
Expand Down
Loading

0 comments on commit 23466f0

Please sign in to comment.