Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split pragmatics into presuppositions and scalar implicatures #2938

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

raileymontalan
Copy link
Contributor

No description provided.

@raileymontalan raileymontalan marked this pull request as draft August 16, 2024 09:14
@raileymontalan raileymontalan marked this pull request as ready for review September 6, 2024 14:44
@raileymontalan
Copy link
Contributor Author

Hi @weiqipedia, for your info.

Copy link
Collaborator

@yifanmai yifanmai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Note that you have to change schema_bhasa.yaml to reflect changes (but that can be done in a separate pull request).

src/helm/benchmark/scenarios/bhasa_scenario.py Outdated Show resolved Hide resolved
instruction=instruction.format(row["choices_translated"]),
)
# Split "True or False" into ["True", "or", "False"]
choices = row["choices"].split()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: For English, you can do row["choices"].split(" or ")

src/helm/benchmark/scenarios/bhasa_scenario.py Outdated Show resolved Hide resolved
)
# Split "True or False" into ["True", "or", "False"]
choices = row["choices"].split()
choices_translated = row["choices_translated"].split()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work consistently across every (supported) language?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good question! For now we only have Indonesian (and Tamil), and this splitting and taking the first and third index of the list does work for both languages. But just FYI, this will not work for Thai because of the lack of spaces, and we'll have to use something more similar to your suggestion of " or " (but we will not be having Thai any time soon)


export HF_HOME=/mnt/fs-arf-01/railey4/cache
export HF_DATASETS_CACHE=/mnt/fs-arf-01/railey4/cache
export HF_TOKEN=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
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Careful with exposing secrets to the public. You should invalidate this token and avoid adding other tokens to the pull request.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd like to add bash scripts to the git, could you:

  1. put this in the scripts/bhasa or scripts/aisingapore folder and
  2. add comments to the script that explains the purpose of the script?

@@ -606,14 +607,14 @@ def get_lindsea_pragmatics_pragmatic_reasoning_single_spec(language="id") -> Run
scenario_spec=scenario_spec,
adapter_spec=adapter_spec,
metric_specs=get_exact_match_metric_specs(),
groups=["bhasa_linguistic", f"lindsea_pragmatics_pragmatic_reasoning_single_{language}"],
groups=["bhasa_linguistic", f"lindsea_pragmatics_presuppositions_{subset}_{language}"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least one of these strings has to match the group name in schema_bhasa.yaml, which is currently "lindsea_pragmatics_presuppositions_id". I'd suggest doing:

groups=["bhasa_linguistic", f"lindsea_pragmatics_presuppositions_{language}", f"lindsea_pragmatics_presuppositions_{subset}_{language}"],

if self.language not in self.prompts.keys():
raise (Exception(f"Unsupported language {self.language} - supported languages are {self.prompts.keys()}"))
else:
self.prompt_components = self.prompts[self.language]

def download_dataset(self, output_path: str):
BASE_URL = "https://raw.githubusercontent.com/aisingapore/BHASA/main/lindsea/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: You can pin this to a specific commit githash so that future changes to the git won't cause this scenario to change. e.g.

BASE_URL = "https://raw.githubusercontent.com/aisingapore/BHASA/10e34008e8142bef400cf8ffab15b2b6aaf3aa7f/lindsea/"

if self.language not in self.prompts.keys():
raise (Exception(f"Unsupported language {self.language} - supported languages are {self.prompts.keys()}"))
else:
self.prompt_componets = self.prompts[self.language]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prompt_componets is misspelled - it should be prompt_components

question = self.prompt_components["single_question"]
instruction = self.prompt_components["single_instruction"]

passage = "{question}\nPernyataan: {text}\n{instruction}".format(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move Pernyataan into prompt components?

instruction = self.prompt_components["pair_instruction"]
label = self.prompt_components[str(row["label"])]

passage = "Situasi: {premise}\n{question}\nPernyataan: {conclusion}\n{instruction}".format(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move Situasi into prompt components.

question = self.prompt_components["single_question"]
instruction = self.prompt_components["single_instruction"]

passage = "{question}\nPernyataan: {text}\n{instruction}".format(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move Pernyataan into prompt components.

@@ -171,7 +171,7 @@ def __init__(self, language: str):
super().__init__()
self.language = language
self.splits = {"train": TRAIN_SPLIT, "test": TEST_SPLIT}
self.map = {
self.prompts = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename to self.language_to_prompt_components.

Same below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants