Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Vectorize" ColumnGeneratorCachedByIndex #80

Open
MischaPanch opened this issue Feb 27, 2024 · 2 comments
Open

"Vectorize" ColumnGeneratorCachedByIndex #80

MischaPanch opened this issue Feb 27, 2024 · 2 comments

Comments

@MischaPanch
Copy link
Collaborator

The ColumnGeneratorCachedByIndex is recommended for new cached column generators, but it can be significantly slower than the not-recommended way of first creating a ColumnGenerator and then adding cache by wrapping with IndexCachedColumnGenerator.

The reason is that IndexCachedColumnGenerator will find all non-cached values and then process them at once (i.e., batch-wise), whereas the ColumnGeneratorCachedByIndex will always loop through all values. Thus, for an initial filling of the cache this can be much slower.

Not sure what to do here - one would need to redesign the ColumnGeneratorCachedByIndex to not use _generate_value, but that's a breaking change. Another way would be to write a new class a la VectorizedColumnGeneratorCachedByIndex, but I honestly feel like batch-wise processing of missing values should be the default behavior

@opcode81
Copy link
Owner

You are assuming that a vectorised computation mechanism for the value generation exists. In many applications, this won't be the case, and the reason for "ColumnGeneratorCachedByIndex is encouraged" is that it provides a more friendly interface for the case where the computation mechanism operates on a per item/row basis, i.e. creating one value given one row of input.

I think what we need to change is primarily the recommendation, i.e. it should be qualified.

@MischaPanch
Copy link
Collaborator Author

Ok, I guess that's reasonable, since using the other way works well with vectorized columngens. Do you want to adjust the docstring? I feel like you usually don't like my style of writing too much ^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants