-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closes #221 | Add Dataloader NUS SMS Corpus #596
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @akhdanfadh,
- Please run
make check_file
to fix the small spacing issues. - I am getting the error message
KeyError: '$'
when trying to load the dataset. Please advise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @akhdanfadh .
Tested and LGTM here. Please respond to the comments by @raileymontalan, at least make sure to run the make check_file command to ensure all the space problems are cleared.
I've run the
@raileymontalan could you give your test result? |
Hi @akhdanfadh, I am using a MacBook, so the issue could be related to this. Please see the error message here:
|
@raileymontalan I'm not sure about the macbook issue since I able to test the code in both Ubuntu and MacOS as well (see image below). Since the error is KeyError, I'm guessing it is about the python itself(?), or something in your environment. |
Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server. |
Hi @raileymontalan, a friendly reminder to review once you have the time. 👍 |
Hi @raileymontalan, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then. cc: @akhdanfadh |
in my end, the data generated has the key of Prob adding additional conditions of creating |
def xml_element_to_dict(self, element: ET.Element) -> Dict: | ||
"""Converts an xml element to a dictionary.""" | ||
element_dict = {} | ||
|
||
# add text with key '$', attributes with '@' prefix | ||
element_dict["$"] = element.text | ||
for attrib, value in element.attrib.items(): | ||
element_dict[f"@{attrib}"] = value | ||
|
||
# recursively | ||
for child in element: | ||
child_dict = self.xml_element_to_dict(child) | ||
element_dict[child.tag] = child_dict | ||
|
||
return element_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something like this:
def xml_element_to_dict(self, element: ET.Element) -> Dict: | |
"""Converts an xml element to a dictionary.""" | |
element_dict = {} | |
# add text with key '$', attributes with '@' prefix | |
element_dict["$"] = element.text | |
for attrib, value in element.attrib.items(): | |
element_dict[f"@{attrib}"] = value | |
# recursively | |
for child in element: | |
child_dict = self.xml_element_to_dict(child) | |
element_dict[child.tag] = child_dict | |
return element_dict | |
def xml_element_to_dict(self, element: ET.Element, root=True) -> Dict: | |
"""Converts an xml element to a dictionary.""" | |
element_dict = {} | |
# add text with key '$', attributes with '@' prefix | |
if element.text: #avoiding appending None text which will alter the schema | |
element_dict["$"] = element.text | |
for attrib, value in element.attrib.items(): | |
element_dict[f"@{attrib}"] = value | |
# recursively | |
for child in element: | |
child_dict = self.xml_element_to_dict(child, root=False) | |
element_dict[child.tag] = child_dict | |
return element_dict |
def xml_element_to_dict(self, element: ET.Element) -> Dict: | |
"""Converts an xml element to a dictionary.""" | |
element_dict = {} | |
# add text with key '$', attributes with '@' prefix | |
element_dict["$"] = element.text | |
for attrib, value in element.attrib.items(): | |
element_dict[f"@{attrib}"] = value | |
# recursively | |
for child in element: | |
child_dict = self.xml_element_to_dict(child) | |
element_dict[child.tag] = child_dict | |
return element_dict | |
def xml_element_to_dict(self, element: ET.Element) -> Dict: | |
"""Converts an xml element to a dictionary.""" | |
element_dict = {} | |
# add text with key '$', attributes with '@' prefix | |
element_dict["$"] = element.text | |
for attrib, value in element.attrib.items(): | |
element_dict[f"@{attrib}"] = value | |
# recursively | |
for child in element: | |
child_dict = self.xml_element_to_dict(child) | |
element_dict[child.tag] = child_dict | |
return element_dict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sabilmakbar How about, as a simple but ugly workaround, we just add $
attribute to each key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually can't test anything because everything seems to be working on my end, both mac and ubuntu. So I guess I need to pass this to someone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm prob the issue wasn't about the platform, but to the datasets
versions. If I remember it correctly, newer datasets
version needs assertions of schema generated from _generate_examples
vs defined in _info
update: works for |
Still getting the same issues as before when testing on Mac. My |
Hi @akhdanfadh, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪 Thanks again! PS: If the issue still persists on MacOS and we cannot find a workaround, should we just wrap it up and add a note that it's only usable for Linux in the |
Hi @holylovenia , I would love to continue on remaining tasks. It is not an OS problem AFAIK for now. I'll try testing different package version some time later and see. Cheers! |
Closes #221
I implemented one config per language/subset. Thus, configs will look like this:
nus_sms_corpus_eng_source
,nus_sms_corpus_cmn_seacrowd_ssp
, etc. When testing, passnus_sms_corpus_<subset>
to the--subset_id
parameter.Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.