Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running the code on custom dataset without the FEVER DB file #2

Open
vnik18 opened this issue Jun 14, 2021 · 11 comments
Open

Running the code on custom dataset without the FEVER DB file #2

vnik18 opened this issue Jun 14, 2021 · 11 comments

Comments

@vnik18
Copy link

vnik18 commented Jun 14, 2021

Hi,

Is it possible to run the masker-corrector module of this code, without using the FEVER sqlite3 database file, in the following code file src/error_correction/modelling/error_correction_module.py?

I have my own dataset with the evidence text already retrieved, so I am hoping to avoid the step of retrieving information from the FEVER database. By any chance, are any intermediate output files generated after text has been retrieved from the FEVER database, that I can look at?

Thank you!

@j6mes
Copy link
Owner

j6mes commented Jun 14, 2021

Hi, the intermediate output from the maskers (with IR selected evidence) are added to the Google drive folder. With IR evidence, they don't actually need the FEVER database, but the dataset loader opens the database connection anyway. An easy fix is to comment out line 27 on the mask_based_correction_reader file. I'll make a change to only load it if needed soon.

@vnik18
Copy link
Author

vnik18 commented Jun 14, 2021

@j6mes Okay. Considering this example from the file heuristic_gold_dev_genre_50_2.jsonl :

{"mutated": "Exercise is bad for heart health.", "original": "Exercise is good for heart health.", "mutation": "substitute_similar", "claim_id": 3518, "original_id": 3517, "sentence_id": 1542, "verdict": "REFUTES", "evidence": [{"annotation_id": 11271, "verdict_id": 14203, "page": "Heart", "line": 19}], "pipeline_text": [["Physical exercise", "non-pharmaceutical sleep aid to treat diseases such as insomnia , help promote or maintain positive self-esteem , improve mental health , maintain steady digestion and treat constipation and gas , regulate fertility health , and augment an individual 's sex appeal or body image , which has been found to"], ["Physical exercise", "be linked with higher levels of self-esteem . Childhood obesity is a growing global concern , and physical exercise may help decrease some of the effects of childhood and adult obesity . Some care providers call exercise the `` miracle '' or `` wonder '' drug -- alluding to the"]], "original_claim": "Exercise is bad for heart health .", "master_explanation": [2, 3, 4]}

Is the evidence text stored in the field pipeline_text of the intermediate masker output file?

If yes, does this mean that the fields claim_id, original_id, sentence_id and evidence (which includes annotation_id, verdict_id, page, line) are not useful once the evidence text has been extracted and written into the pipeline_text field?
Can these fields be removed or replaced with dummy values?

@j6mes
Copy link
Owner

j6mes commented Jun 14, 2021

if pipeline_evidence is set (a list of 2-tuples (page name, + text)), the evidence field isn't used by the dataset loader

@vnik18
Copy link
Author

vnik18 commented Jun 14, 2021

@j6mes Do you mean pipeline_text and not pipeline_evidence? In the above example, it is already in the form of a list of lists with 2 elements each: page name and text, so I will try using the same format for my own data.

Also, what about the fields claim_id, original_id and sentence_id? Are they used by the dataset loader?

@j6mes
Copy link
Owner

j6mes commented Jun 15, 2021

Yes, i meant pipeline_text. I think any extra values are just passed through to the metadata field and are ignored by the model

@vnik18
Copy link
Author

vnik18 commented Jun 15, 2021

Thank you!

@vnik18 vnik18 closed this as completed Jun 15, 2021
@vnik18
Copy link
Author

vnik18 commented Jun 25, 2021

@j6mes Hi, I have a couple of questions about the data format of your model. In the following example,

{"prediction": "correction: Penguin Books revolutionized publishing in the 1940s.", "actual": "correction: Penguin Books revolutionized publishing in the 1920s.", 
"metadata": {"source": "Penguin Books [MASK] publishing in the [MASK] .", "target": "Penguin Books revolutionized publishing in the 1920s .", 
"evidence": "title: Penguin Books context: Penguin Books is a British publishing house . Penguin revolutionised publishing in the 1930s through its inexpensive paperbacks , sold through Woolworths ### title: Penguin Books context: '' , now the `` Big Five '' . Penguin Books is a British publishing house . It was founded in 1935 by Sir Allen Lane as a line of the publishers The Bodley Head , only becoming a separate company the following year . Penguin revolutionised publishing in the", 
"mutation_type": "substitute_similar", "veracity": "REFUTES"}}

What does the field "actual: correction" mean? I assumed it would be the correct statement that the model should have generated, but instead it is the mutated, incorrect version of the correct statement.

Also, the 'source' field contains the masked sentence that is input to the model. But the 'target' field contains the incorrect, mutated sentence and not the correct sentence that the model is supposed to learn to generate. In this case, does the model never see the correct version of the mutated/masked statement, except in the evidence?

Thank you.

@j6mes
Copy link
Owner

j6mes commented Jun 25, 2021

There's a few caveats to this. For the distant supervision objective, it's assume that the model doesn't have access to the reference correction, instead, it's trying to recover the input sentence as an auto-encoder. For scoring, we have to use the info in the metadata to compare against what was predicted and what the claim was before correction. I'll see if i can make this clearer in the documentation. I had to do a lot of cleaning before making the repo public and perhaps there's an easier way i can present all this info and ensure that it's consistent

@vnik18
Copy link
Author

vnik18 commented Jun 25, 2021

@j6mes I see. So does this mean that both the "actual: correction" field and 'target' field from the above example contain the incorrect, mutated version of the input statement? If I have access to the correct reference statement, can I provide it as input to the model as part of training? If yes, how could I do that?

@vnik18 vnik18 reopened this Jun 25, 2021
@j6mes
Copy link
Owner

j6mes commented Jun 25, 2021

There's a supervised version as well which doesn't use any masking (see finetune_supervised.sh and finetune_supervised_pipeline.sh) if you want to mix supervision and masks, you could either train a supervised model first, then fine-tune on masks. or combine the supervised and mask_based_reader from this folder to make a reader that understands both tasks https://github.com/j6mes/2021-acl-factual-error-correction/blob/main/src/error_correction/modelling/reader/supervised_correction_reader.py

@vnik18
Copy link
Author

vnik18 commented Jun 28, 2021

@j6mes Thank you for replying. I will look into the supervised version. Regarding the mask_based_correction_reader.py file, I have the following question regarding the below code snippet:

claim_tokens = instance["original_claim"].split()
masked_claim = (
            instance["master_explanation"]
            if "master_explanation" in instance
            else instance["claim_tokens"]
        )
a = {
            "source": " ".join(
                [
                    token if idx not in masked_claim else "[MASK]"
                    for idx, token in enumerate(claim_tokens)
                ]
            ),
            "target": " ".join(claim_tokens),}

Both the source and the target fields of the training data are coming from the variable instance['original_claim'], which in turn contains the mutated version of the input sentence.

So it seems that the model being trained in the masked version never has access to the correct reference sentence. In such a case, could you please clarify how it could make a correction to a masked input sentence at test time? Would it just use information from the evidence for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants