Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closes #221 | Add Dataloader NUS SMS Corpus #596

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

akhdanfadh
Copy link
Collaborator

Closes #221

I implemented one config per language/subset. Thus, configs will look like this: nus_sms_corpus_eng_source, nus_sms_corpus_cmn_seacrowd_ssp, etc. When testing, pass nus_sms_corpus_<subset> to the --subset_id parameter.

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Copy link
Collaborator

@raileymontalan raileymontalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @akhdanfadh,

  • Please run make check_file to fix the small spacing issues.
  • I am getting the error message KeyError: '$' when trying to load the dataset. Please advise.

Copy link
Collaborator

@jensan-1 jensan-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @akhdanfadh .
Tested and LGTM here. Please respond to the comments by @raileymontalan, at least make sure to run the make check_file command to ensure all the space problems are cleared.

@akhdanfadh
Copy link
Collaborator Author

I've run the make check_file, please double-check.

I am getting the error message KeyError: '$' when trying to load the dataset. Please advise.

@raileymontalan could you give your test result?

@raileymontalan
Copy link
Collaborator

I've run the make check_file, please double-check.

I am getting the error message KeyError: '$' when trying to load the dataset. Please advise.

@raileymontalan could you give your test result?

Hi @akhdanfadh, I am using a MacBook, so the issue could be related to this. Please see the error message here:

(env-seacrowd) raileymontalan@Raileys-MacBook-Pro-2023 seacrowd-datahub % python -m tests.test_seacrowd seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py --subset_id="nus_sms_corpus_eng"
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py', schema=None, subset_id='nus_sms_corpus_eng', data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py
INFO:__main__:self.SUBSET_ID: nus_sms_corpus_eng
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.nus_sms_corpus.nus_sms_corpus
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.SELF_SUPERVISED_PRETRAINING: 'SSP'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SSP'}
INFO:__main__:schemas_to_check: {'SSP'}
INFO:__main__:Checking load_dataset with config name nus_sms_corpus_eng_source
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:2483: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for nus_sms_corpus contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating train split: 0 examples [00:01, ? examples/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1743, in _prepare_split_single
    example = self.info.features.encode_example(record) if self.info.features is not None else record
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1878, in encode_example
    return encode_nested_example(self, example)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1244, in <dictcomp>
    k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1244, in <dictcomp>
    k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in <dictcomp>
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 326, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 326, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: '$'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/raileymontalan/Documents/seacrowd-datahub/tests/test_seacrowd.py", line 134, in setUp
    self.dataset_source = datasets.load_dataset(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1767, in _download_and_prepare
    super()._download_and_prepare(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1605, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1762, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

----------------------------------------------------------------------
Ran 1 test in 3.052s

FAILED (errors=1)

@akhdanfadh
Copy link
Collaborator Author

@raileymontalan I'm not sure about the macbook issue since I able to test the code in both Ubuntu and MacOS as well (see image below). Since the error is KeyError, I'm guessing it is about the python itself(?), or something in your environment.

image

@holylovenia
Copy link
Contributor

Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server.

@holylovenia
Copy link
Contributor

Hi @raileymontalan, a friendly reminder to review once you have the time. 👍

@holylovenia
Copy link
Contributor

Hi @raileymontalan, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @akhdanfadh

@sabilmakbar
Copy link
Collaborator

sabilmakbar commented May 31, 2024

Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server.
Do you have a different versions of datasets on the Mac vs Server? prob that was the case

in my end, the data generated has the key of $ generated iteratively, which is a bit unexpected to the feature list.
image

Prob adding additional conditions of creating $ cols only if the element.text is available (not None) is a best workaround for now

Comment on lines +169 to +183
def xml_element_to_dict(self, element: ET.Element) -> Dict:
"""Converts an xml element to a dictionary."""
element_dict = {}

# add text with key '$', attributes with '@' prefix
element_dict["$"] = element.text
for attrib, value in element.attrib.items():
element_dict[f"@{attrib}"] = value

# recursively
for child in element:
child_dict = self.xml_element_to_dict(child)
element_dict[child.tag] = child_dict

return element_dict
Copy link
Collaborator

@sabilmakbar sabilmakbar May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something like this:

Suggested change
def xml_element_to_dict(self, element: ET.Element) -> Dict:
"""Converts an xml element to a dictionary."""
element_dict = {}
# add text with key '$', attributes with '@' prefix
element_dict["$"] = element.text
for attrib, value in element.attrib.items():
element_dict[f"@{attrib}"] = value
# recursively
for child in element:
child_dict = self.xml_element_to_dict(child)
element_dict[child.tag] = child_dict
return element_dict
def xml_element_to_dict(self, element: ET.Element, root=True) -> Dict:
"""Converts an xml element to a dictionary."""
element_dict = {}
# add text with key '$', attributes with '@' prefix
if element.text: #avoiding appending None text which will alter the schema
element_dict["$"] = element.text
for attrib, value in element.attrib.items():
element_dict[f"@{attrib}"] = value
# recursively
for child in element:
child_dict = self.xml_element_to_dict(child, root=False)
element_dict[child.tag] = child_dict
return element_dict
Suggested change
def xml_element_to_dict(self, element: ET.Element) -> Dict:
"""Converts an xml element to a dictionary."""
element_dict = {}
# add text with key '$', attributes with '@' prefix
element_dict["$"] = element.text
for attrib, value in element.attrib.items():
element_dict[f"@{attrib}"] = value
# recursively
for child in element:
child_dict = self.xml_element_to_dict(child)
element_dict[child.tag] = child_dict
return element_dict
def xml_element_to_dict(self, element: ET.Element) -> Dict:
"""Converts an xml element to a dictionary."""
element_dict = {}
# add text with key '$', attributes with '@' prefix
element_dict["$"] = element.text
for attrib, value in element.attrib.items():
element_dict[f"@{attrib}"] = value
# recursively
for child in element:
child_dict = self.xml_element_to_dict(child)
element_dict[child.tag] = child_dict
return element_dict

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sabilmakbar How about, as a simple but ugly workaround, we just add $ attribute to each key?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually can't test anything because everything seems to be working on my end, both mac and ubuntu. So I guess I need to pass this to someone.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm prob the issue wasn't about the platform, but to the datasets versions. If I remember it correctly, newer datasets version needs assertions of schema generated from _generate_examples vs defined in _info

@sabilmakbar
Copy link
Collaborator

update: works for eng subset, but still looking the cause for cmn subset

@raileymontalan
Copy link
Collaborator

Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server.
Do you have a different versions of datasets on the Mac vs Server? prob that was the case

in my end, the data generated has the key of $ generated iteratively, which is a bit unexpected to the feature list. image

Prob adding additional conditions of creating $ cols only if the element.text is available (not None) is a best workaround for now

Still getting the same issues as before when testing on Mac. My datasets version is 2.16.1

@holylovenia
Copy link
Contributor

Hi @akhdanfadh, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

PS: If the issue still persists on MacOS and we cannot find a workaround, should we just wrap it up and add a note that it's only usable for Linux in the _DESCRIPTION?

cc: @raileymontalan @sabilmakbar

@akhdanfadh
Copy link
Collaborator Author

Hi @holylovenia , I would love to continue on remaining tasks. It is not an OS problem AFAIK for now. I'll try testing different package version some time later and see. Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create dataset loader for NUS SMS Corpus
5 participants