Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] GroupedDataset.map_groups() Doesn't Support Callable Classes #26244

Closed
thatcort opened this issue Jul 1, 2022 · 1 comment
Closed
Assignees
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P2 Important issue, but not time-critical

Comments

@thatcort
Copy link

thatcort commented Jul 1, 2022

What happened + What you expected to happen

The GroupedDataset method map_groups() is missing support for callable classes.

Attempting to use a callable class with map_groups() results in the following error:

Map Progress (1 actors 1 pending):  88%|████████▊ | 7/8 [00:36<00:04,  4.69s/it]2022-06-30 20:43:20,980	ERROR worker.py:95 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::BlockWorker.map_block_nosplit() (pid=73870, ip=172.28.0.2, repr=<ray.data.impl.compute.BlockWorker object at 0x7ffb4136d5d0>)
  File "/usr/local/lib/python3.7/dist-packages/ray/data/impl/compute.py", line 185, in map_block_nosplit
    return _map_block_nosplit(block, fn, input_files)
  File "/usr/local/lib/python3.7/dist-packages/ray/data/impl/compute.py", line 341, in _map_block_nosplit
    for new_block in fn(block):
  File "/usr/local/lib/python3.7/dist-packages/ray/data/dataset.py", line 355, in transform
    applied = fn(view)
  File "/usr/local/lib/python3.7/dist-packages/ray/data/grouped_dataset.py", line 259, in group_fn
    applied = fn(group)
TypeError: __init__() takes 1 positional argument but 2 were given

https://ray-distributed.slack.com/archives/C01DLHZHRBJ/p1656626491260859?thread_ts=1656616666.457479&cid=C01DLHZHRBJ

Versions / Dependencies

This is using Ray 1.13.0 and Pyarrow 6.0.1 on Python 3.7.13

Reproduction script

class SentenceSplitter:
  def __init__(self):
    self.nlp = spacy.load("en_core_web_sm")
  
  def __call__(self, batch):
    doc = self.nlp(batch['text'][0])
    docIds = [batch['docId'][0] for i in range(len(sents))]
    return pa.Table.from_pydict({'docId': docIds, 'text': list(doc.sents)})

ds = ray.data.read_parquet(fileName)
ds = ds.groupby('docId').map_groups(SentenceSplitter, compute='actors', batch_format="pyarrow")

Issue Severity

No response

@thatcort thatcort added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 1, 2022
@clarkzinzow clarkzinzow added P2 Important issue, but not time-critical data Ray Data-related issues and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 8, 2022
@jianoaix jianoaix changed the title [data] GroupedDataset.map_groups() Doesn't Support Callable Classes [Datasets] GroupedDataset.map_groups() Doesn't Support Callable Classes Aug 29, 2022
@ShulinChen ShulinChen self-assigned this Sep 19, 2022
@jianoaix
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't data Ray Data-related issues P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

5 participants