Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADD] Calculate memory of dataset after one hot encoding (pytorch embedding) #437

Conversation

ravinkohli
Copy link
Contributor

This PR aims to improve the approximate memory usage of a dataset by considering the dataset after transforming with one-hot encoding. Based on our experiments (reg cocktails ablation study), we have observed that columns with a high cardinality of their categorical features tend to explode when they are one-hot encoded. Moreover, even with the addition of PyTorch embedding (removing the need to one hot encode all categorical columns), we observe that excessive memory is used while building the neural network.

@ravinkohli ravinkohli changed the base branch from reg_cocktails_apt1.0+reg_cocktails_pytorch_embedding to reg_cocktails-pytorch_embedding June 15, 2022 13:21
@ravinkohli ravinkohli force-pushed the reg_cocktails_apt1.0+reg_cocktails_pytorch_embedding_debug branch from c2a98c9 to f2f5f72 Compare June 15, 2022 13:23
Comment on lines 59 to 60
port=X['logger_port'
] if 'logger_port' in X else logging.handlers.DEFAULT_TCP_LOGGING_PORT,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
port=X['logger_port'
] if 'logger_port' in X else logging.handlers.DEFAULT_TCP_LOGGING_PORT,
port=X.get('logger_port', logging.handlers.DEFAULT_TCP_LOGGING_PORT),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we don't need this code to be merged, I'll remove this.

Comment on lines +493 to +494
else:
multipliers.append(arr_dtypes[col].itemsize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in one-hot encoding when num_cat is larger than MIN_CATEGORIES_FOR_EMBEDDING_MAX

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are not one hot encoded but rather sent to the PyTorch embedding module where there is implicit one-hot encoding.

autoPyTorch/data/utils.py Outdated Show resolved Hide resolved
raise ValueError(err_msg)
for col, num_cat in zip(categorical_columns, n_categories_per_cat_column):
if num_cat < MIN_CATEGORIES_FOR_EMBEDDING_MAX:
multipliers.append(num_cat * arr_dtypes[col].itemsize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it already guaranteed that all columns are non-object?
Otherwise, we should check it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes its guaranteed that all columns are not object, moreover, they are also guaranteed to be np arrays, as this code is run after we have transformed the data using the tabular feature validator.

autoPyTorch/data/utils.py Outdated Show resolved Hide resolved
autoPyTorch/data/utils.py Outdated Show resolved Hide resolved
if len(categorical_columns) > 0:
if n_categories_per_cat_column is None:
raise ValueError(err_msg)
for col, num_cat in zip(categorical_columns, n_categories_per_cat_column):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use sum(...) same as below. (optional)

Copy link
Collaborator

@theodorju theodorju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the meeting, I reviewed the changes. Everything looks good to me, I'm just adding a minor suggestion as a comment.

@ravinkohli ravinkohli merged commit 95a5969 into reg_cocktails-pytorch_embedding Jul 16, 2022
ravinkohli added a commit that referenced this pull request Aug 16, 2022
…edding) (#437)

* add updates for apt1.0+reg_cocktails

* debug loggers for checking data and network memory usage

* add support for pandas, test for data passing, remove debug loggers

* remove unwanted changes

* :

* Adjust formula to account for embedding columns

* Apply suggestions from code review

Co-authored-by: nabenabe0928 <[email protected]>

* remove unwanted additions

* Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py

Co-authored-by: nabenabe0928 <[email protected]>
ravinkohli added a commit that referenced this pull request Oct 25, 2022
…edding) (#437)

* add updates for apt1.0+reg_cocktails

* debug loggers for checking data and network memory usage

* add support for pandas, test for data passing, remove debug loggers

* remove unwanted changes

* :

* Adjust formula to account for embedding columns

* Apply suggestions from code review

Co-authored-by: nabenabe0928 <[email protected]>

* remove unwanted additions

* Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py

Co-authored-by: nabenabe0928 <[email protected]>
@ravinkohli ravinkohli deleted the reg_cocktails_apt1.0+reg_cocktails_pytorch_embedding_debug branch March 16, 2023 13:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants