[Datasets] `MultiHotEncoder` doesn't work with Arrow datasets #31353

bveeramani · 2022-12-28T22:52:40Z

What happened + What you expected to happen

Tried to transform a dataset with MultiHotEncoder, but got a TypeError:

Traceback (most recent call last):
  File "/Users/balaji/Documents/GitHub/ray/temp.py", line 7, in <module>
    encoder.fit_transform(dataset)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/preprocessor.py", line 120, in fit_transform
    self.fit(dataset)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/preprocessor.py", line 105, in fit
    return self._fit(dataset)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/preprocessors/encoder.py", line 318, in _fit
    self.stats_ = _get_unique_value_indices(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/preprocessors/encoder.py", line 550, in _get_unique_value_indices
    value_counts = dataset.map_batches(get_pd_value_counts, batch_format="pandas")
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 659, in map_batches
    return Dataset(plan, self._epoch, self._lazy)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 223, in __init__
    self._plan.execute(allow_clear_input_blocks=False)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/plan.py", line 321, in execute
    blocks, stage_info = stage(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/plan.py", line 688, in __call__
    blocks = compute._apply(
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 154, in _apply
    raise e from None
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 138, in _apply
    results = map_bar.fetch_until_complete(refs)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/progress_bar.py", line 75, in fetch_until_complete
    for ref, result in zip(done, ray.get(done)):
  File "/Users/balaji/Documents/GitHub/ray/python/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/_private/worker.py", line 2347, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(TypeError): ray::_map_block_split() (pid=82040, ip=127.0.0.1)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/_internal/compute.py", line 459, in _map_block_split
    for new_block in block_fn(blocks, *fn_args, **fn_kwargs):
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 637, in transform
    yield from process_next_batch(batch)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/dataset.py", line 601, in process_next_batch
    batch = batch_fn(batch, *fn_args, **fn_kwargs)
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/preprocessors/encoder.py", line 543, in get_pd_value_counts
    result[col] = get_pd_value_counts_per_column(df[col])
  File "/Users/balaji/Documents/GitHub/ray/python/ray/data/preprocessors/encoder.py", line 536, in get_pd_value_counts_per_column
    return Counter(col.value_counts(dropna=False).to_dict())
  File "/Users/balaji/Documents/GitHub/ray/.venv/lib/python3.10/site-packages/pandas/core/series.py", line 1895, in to_dict
    return into_c((k, maybe_box_native(v)) for k, v in self.items())
TypeError: unhashable type: 'numpy.ndarray'

Versions / Dependencies

ray: f40ac95
pyarrow: 10.0.1

Reproduction script

import ray
from ray.data.preprocessors import MultiHotEncoder

dataset = ray.data.from_items([{"column": ["spam", "ham", "eggs"]}])
print(dataset)
encoder = MultiHotEncoder(columns=["column"])
encoder.fit_transform(dataset)

Issue Severity

Medium: It is a significant difficulty but I can work around it.

The text was updated successfully, but these errors were encountered:

amogkam · 2022-12-31T01:37:31Z

Does from_items return an arrow dataset? I thought it returns a simple dataset?

amogkam · 2022-12-31T01:40:51Z

Ah nvm, seems that if the item is a dictionary it is treated as an arrow block...https://sourcegraph.com/github.com/ray-project/ray/-/blob/python/ray/data/_internal/delegating_block_builder.py?L19-20.

@clarkzinzow should dictionaries be treated as a pandas block?

bveeramani added bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks air data Ray Data-related issues labels Dec 28, 2022

bveeramani self-assigned this Dec 28, 2022

bveeramani mentioned this issue Dec 30, 2022

[Datasets] Allow MultiHotEncoder to encode arrays #31365

Merged

7 tasks

amogkam closed this as completed in #31365 Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] `MultiHotEncoder` doesn't work with Arrow datasets #31353

[Datasets] `MultiHotEncoder` doesn't work with Arrow datasets #31353

bveeramani commented Dec 28, 2022

amogkam commented Dec 31, 2022

amogkam commented Dec 31, 2022

[Datasets] MultiHotEncoder doesn't work with Arrow datasets #31353

[Datasets] MultiHotEncoder doesn't work with Arrow datasets #31353

Comments

bveeramani commented Dec 28, 2022

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

amogkam commented Dec 31, 2022

amogkam commented Dec 31, 2022

[Datasets] `MultiHotEncoder` doesn't work with Arrow datasets #31353

[Datasets] `MultiHotEncoder` doesn't work with Arrow datasets #31353