[Bug Fixed] Difficulties to reproduce the KnowEdit results in the survey paper #390

StarLooo · 2024-10-17T02:45:50Z

Hello! This EasyEdit framework and your survey paper "A Comprehensive Study of Knowledge Editing for Large Language Models" are really valuable works. However, I still had some difficulties reproducing the results on the KnowEdit benchmark, and below are some of my questions:

concern about the generation config
I noticed that in the EasyEdit code, the generation-related code does not explicitly specify the specific generation_config settings (including do_sample, temperature, top_p, etc.). This may result in the default generation method not using greedy decoding, potentially affecting the reproducibility of the results.

The text was updated successfully, but these errors were encountered:

StarLooo · 2024-10-17T02:48:07Z

perhaps bug in the summary_metrics function
According to the discussions in this issue the computing methods of Edit succ, Portability, Locality, Fluency #279 , the summary method of all_metrics can be elaborated as below (based on my understanding, take the post eidt metrics on Wikidata_recent for example) :
for the Edit Succ.: compute the average rewrite_acc of all edit samples.
for the Portability: first compute the average Subject_Aliasing_acc&reasoning_acc&Logical_Generalization_acc of each edit sample; then average the three sub-scores to get the Portability of each edit sample; finally compute the average Portability of all edit samples.
for the Locality: first compute the average Relation_Specificity_acc&Forgetfulness_acc of each edit sample; then average the two sub-scores to get the Locality of each edit sample; finally compute the average Locality of all edit samples.
for the Fluency: compute the average ngram_entropy of all edit samples.
If the above summarization method is correct, the implementation of the summary_metrics function may have a minor bug, but it will cause ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. This is due to this line of code: mean_metrics[eval][key][lkey] = np.mean(metrics). when summarizing Portability, this line of code will try to compute np.mean() over a list of sub-lists with variable length (considering the number of Portability evaluation samples is not the same for each edit), which is not allowed. I think the correct modification is to change the original code metrics = [metric[eval][key][lkey] for metric in all_metrics if lkey in metric[eval][key].keys()] into metrics = [np.mean(metric[eval][key][lkey]) for metric in all_metrics if lkey in metric[eval][key].keys()]. The modified code will first compute the average sub-score of each edit and then use mean_metrics[eval][key][lkey] = np.mean(metrics) to summary.
Besides, I find that the summary_metrics function do not summary the Fluency metric and the magnitude of ngram_entropy is too low compared to the reported Fluency metric (5.x~6.x vs ~500).

StarLooo · 2024-10-17T02:48:15Z

the meaning of pre-edit metrics
I'm uncertain about the meaning of the pre-edit metrics. According to my speculation: we can compare the Fluency before and after the edit to evaluate whether the edit affects the generation ability of the model; we can also compare the Portability before and after the edit to evaluate whether the edit can be more or less portable relatively; and the Locality of pre-edit is meaningless. However, I don't understand the meaning of pre-edit rewrite_acc, how to compute it and why it can be greater than 0 before we conduct any editing.

StarLooo · 2024-10-17T02:48:19Z

can not reproduce the KnowEdit results when using AdaLoRA or ROME
I try to reproduce the KnowEdit results in your survey paper following your tutorial with the same hyper-parameters and command. I use Llama2-7b-chat as the base and AdaLoRA/ROME as the edit method. For AdaLoRA, it's very strange that I always get a post_rewrite_acc=1.0 in Wikibio/Wikidata_recent/Wikidata_counterfact, and the values on other metrics are also different from the paper. For ROME, I meet the same error as RuntimeError: probability tensor contains either inf, nan or element < 0 #290 when editing Wikidata_recent, and escape from this error when running Wikidata_counterfact. However, the ROME results on Wikidata_counterfact are also different from the survey paper with post_rewrite_acc=0.9857. I'm not sure it's due to the fp16 setting as mentioned in RuntimeError: probability tensor contains either inf, nan or element < 0 #290 and I'll try it agian using fp16=false.

If you need more details on the experiments such as the results.json or log file, Please tell me and I'll upload them. I'm looking forward to your reply! Thanks!

XeeKee · 2024-10-17T04:44:18Z

Thank you very much for your attention to EasyEdit. We will address your issue shortly.

zxlzr · 2024-10-17T05:02:18Z

can not reproduce the KnowEdit results when using AdaLoRA or ROME
I try to reproduce the KnowEdit results in your survey paper following your tutorial with the same hyper-parameters and command. I use Llama2-7b-chat as the base and AdaLoRA/ROME as the edit method. For AdaLoRA, it's very strange that I always get a post_rewrite_acc=1.0 in Wikibio/Wikidata_recent/Wikidata_counterfact, and the values on other metrics are also different from the paper. For ROME, I meet the same error as RuntimeError: probability tensor contains either inf, nan or element < 0 #290 when editing Wikidata_recent, and escape from this error when running Wikidata_counterfact. However, the ROME results on Wikidata_counterfact are also different from the survey paper with post_rewrite_acc=0.9857. I'm not sure it's due to the fp16 setting as mentioned in RuntimeError: probability tensor contains either inf, nan or element < 0 #290 and I'll try it agian using fp16=false.

If you need more details on the experiments such as the results.json or log file, Please tell me and I'll upload them. I'm looking forward to your reply! Thanks!

Apologies for the issue regarding the results. EasyEdit has been under maintenance to improve its editing performance, and we have optimized the EasyEdit code, which has led to better results (possibly an improvement for AdaLoRA). Due to limited computational resources recently, we will update the paper as soon as possible (aiming for 1-2 weeks, and we will notify you). As for the other issues, we will address them as soon as possible.

XeeKee · 2024-10-17T08:05:11Z

Hello! This EasyEdit framework and your survey paper "A Comprehensive Study of Knowledge Editing for Large Language Models" are really valuable works. However, I still had some difficulties reproducing the results on the KnowEdit benchmark, and below are some of my questions:

concern about the generation config
I noticed that in the EasyEdit code, the generation-related code does not explicitly specify the specific generation_config settings (including do_sample, temperature, top_p, etc.). This may result in the default generation method not using greedy decoding, potentially affecting the reproducibility of the results.

You can check the code here:

EasyEdit/easyeditor/evaluate/evaluate_utils.py

Line 80 in b6462be

    
           def test_prediction_acc(model, tok, hparams, prompts, targets, device, locality=False, vanilla_generation=False):

. We did not pass in any additional parameters. According to https://huggingface.co/docs/transformers/generation_strategies, Hugging Face uses greedy decoding by default. Additionally, I noticed that on our local server, the model's generation_config does not include parameters for top_p and temperature. Therefore, I believe our code is reproducible. Of course, your issue has also made us aware that this part of the code is not perfect, and we will set the generation parameters to greedy decoding in the near future.

XeeKee · 2024-10-17T08:07:35Z

perhaps bug in the summary_metrics function
According to the discussions in this issue the computing methods of Edit succ, Portability, Locality, Fluency #279 , the summary method of all_metrics can be elaborated as below (based on my understanding, take the post eidt metrics on Wikidata_recent for example) :
for the Edit Succ.: compute the average rewrite_acc of all edit samples.
for the Portability: first compute the average Subject_Aliasing_acc&reasoning_acc&Logical_Generalization_acc of each edit sample; then average the three sub-scores to get the Portability of each edit sample; finally compute the average Portability of all edit samples.
for the Locality: first compute the average Relation_Specificity_acc&Forgetfulness_acc of each edit sample; then average the two sub-scores to get the Locality of each edit sample; finally compute the average Locality of all edit samples.
for the Fluency: compute the average ngram_entropy of all edit samples.
If the above summarization method is correct, the implementation of the summary_metrics function may have a minor bug, but it will cause ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. This is due to this line of code: mean_metrics[eval][key][lkey] = np.mean(metrics). when summarizing Portability, this line of code will try to compute np.mean() over a list of sub-lists with variable length (considering the number of Portability evaluation samples is not the same for each edit), which is not allowed. I think the correct modification is to change the original code metrics = [metric[eval][key][lkey] for metric in all_metrics if lkey in metric[eval][key].keys()] into metrics = [np.mean(metric[eval][key][lkey]) for metric in all_metrics if lkey in metric[eval][key].keys()]. The modified code will first compute the average sub-score of each edit and then use mean_metrics[eval][key][lkey] = np.mean(metrics) to summary.
Besides, I find that the summary_metrics function do not summary the Fluency metric and the magnitude of ngram_entropy is too low compared to the reported Fluency metric (5.x~6.x vs ~500).

Thank you very much for pointing out this issue. There is indeed a small problem here because the method of calculating the average can lead to slight discrepancies in the results. For the Fluency metric, we multiplied the results by 100 for better presentation.

XeeKee · 2024-10-17T08:13:31Z

the meaning of pre-edit metrics
I'm uncertain about the meaning of the pre-edit metrics. According to my speculation: we can compare the Fluency before and after the edit to evaluate whether the edit affects the generation ability of the model; we can also compare the Portability before and after the edit to evaluate whether the edit can be more or less portable relatively; and the Locality of pre-edit is meaningless. However, I don't understand the meaning of pre-edit rewrite_acc, how to compute it and why it can be greater than 0 before we conduct any editing.

Clarification of Misunderstanding

There has been some misunderstanding regarding our evaluation metrics. Here's a breakdown:

Fluency
This metric evaluates whether the smoothness of the output has changed after the model performs an edit.
Portability
This assesses the robustness of the knowledge model after the edit, checking how well the model retains its overall stability.
Locality
This metric examines whether the edit affects unrelated knowledge, ensuring that the changes remain localized.
Rewrite Accuracy (rewrite_acc)
This measures the reliability of the edit.

Why Rewrite Accuracy Can Be Greater Than 0 Before Editing

For example, suppose we want to change the name of the President of the United States from "Joe Biden" to "Trump Biden." When calculating the accuracy, the target_new is "Trump Biden," but the original response was "Joe Biden." Since both names share the surname "Biden," the token-level accuracy for that part will match. Therefore, even before the edit, the accuracy may not be 0.

XeeKee · 2024-10-17T16:49:38Z

Thank you very much for your interest in EasyEdit. If it's convenient, could you add me on WeChat? That way, we can communicate promptly if any issues arise.
My WeChat ID: xi1786594371

littlefive5 · 2024-10-17T22:48:09Z

Hello,
I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework.
Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements.
The results you currently get is right.
I am in the process of running additional experiments to gather the correct data.
Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

As to the error you met in ROME, I will try to reproduce it and will let you know.

Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email ([email protected]) or WeChat (yaoxiaoyun12).

StarLooo · 2024-10-18T01:20:31Z

You can check the code here:

EasyEdit/easyeditor/evaluate/evaluate_utils.py

Line 80 in b6462be

def test_prediction_acc(model, tok, hparams, prompts, targets, device, locality=False, vanilla_generation=False):

. We did not pass in any additional parameters. According to https://huggingface.co/docs/transformers/generation_strategies, Hugging Face uses greedy decoding by default. Additionally, I noticed that on our local server, the model's generation_config does not include parameters for top_p and temperature. Therefore, I believe our code is reproducible. Of course, your issue has also made us aware that this part of the code is not perfect, and we will set the generation parameters to greedy decoding in the near future.

I got it, by the default value of do_samples=false, the EasyEdit uses greedy decoding by default setting, but as you mentioned, if the user's local generation_config.json file contains some specific generation hyper-params, it may cause warnings or unexpected generation strategy. Thanks for your timely reply！

StarLooo · 2024-10-18T01:21:18Z

Clarification of Misunderstanding

There has been some misunderstanding regarding our evaluation metrics. Here's a breakdown:

Fluency
This metric evaluates whether the smoothness of the output has changed after the model performs an edit.

Portability
This assesses the robustness of the knowledge model after the edit, checking how well the model retains its overall stability.

Locality
This metric examines whether the edit affects unrelated knowledge, ensuring that the changes remain localized.

Rewrite Accuracy (rewrite_acc)
This measures the reliability of the edit.

Why Rewrite Accuracy Can Be Greater Than 0 Before Editing

For example, suppose we want to change the name of the President of the United States from "Joe Biden" to "Trump Biden." When calculating the accuracy, the target_new is "Trump Biden," but the original response was "Joe Biden." Since both names share the surname "Biden," the token-level accuracy for that part will match. Therefore, even before the edit, the accuracy may not be 0.

Thanks for your clarification! I previously mistakenly thought that rewrite_acc is a metric based on the exact match. By the way, why the token level accuracy was chosen instead of the exact match when designing the Edit Succ. metric? Was it for a smoother measurement?

StarLooo · 2024-10-18T01:22:11Z

Thank you very much for pointing out this issue. There is indeed a small problem here because the method of calculating the average can lead to slight discrepancies in the results. For the Fluency metric, we multiplied the results by 100 for better presentation.

Thanks for your timely reply! So, is my understanding of the method of averaging metrics and the bug reporting in the summary_metrics function, along with my attempted modifications, correct?

StarLooo · 2024-10-18T01:41:22Z

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.
Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email ([email protected]) or WeChat (yaoxiaoyun12).

Thanks for your continuous efforts in updating and looking forward to your new progress. I will also try to read through the relevant codes and get the latest and correct results based on the current EasyEdit. I just took a quick look at the updated run_knowedit_llama2.py, and the eval function seems to be largely consistent with the modified summary_metrics method I mentioned earlier.
Besides, can you further explain or locate what updates (bug fix, code refactoring...) in EasyEdit lead to the edit performance change?

StarLooo · 2024-10-18T01:42:41Z

Thank you very much for your interest in EasyEdit. If it's convenient, could you add me on WeChat? That way, we can communicate promptly if any issues arise. My WeChat ID: xi1786594371

Thanks for your enthusiasm. If further discussion is needed, I will contact you via WeChat.

StarLooo · 2024-10-18T02:21:17Z

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

I just find a little bug in the eval function in the updated run_knowedit_llama2.py, before appending np.mean(case_list) to the Portability_list (and Locality_list for robustness), we should first check whether the case_list is empty, or the np.mean(case_list) will cause RuntimeWarning: Mean of empty slice and produce nan in Portability_list for some dataset such as Wikidata_counterfact. I guess the reason is that some edit samples do not have a related portability evaluation sample.

littlefive5 · 2024-10-18T02:42:05Z

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

I just find a little bug in the eval function in the updated run_knowedit_llama2.py, before appending np.mean(case_list) to the Portability_list (and Locality_list for robustness), we should first check whether the case_list is empty, or the np.mean(case_list) will cause RuntimeWarning: Mean of empty slice and produce nan in Portability_list for some dataset such as Wikidata_counterfact. I guess the reason is that some edit samples do not have a related portability evaluation sample.

You're right, I have updated the code here.

littlefive5 · 2024-10-18T05:14:40Z

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.
Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email ([email protected]) or WeChat (yaoxiaoyun12).

Thanks for your continuous efforts in updating and looking forward to your new progress. I will also try to read through the relevant codes and get the latest and correct results based on the current EasyEdit. I just took a quick look at the updated run_knowedit_llama2.py, and the eval function seems to be largely consistent with the modified summary_metrics method I mentioned earlier. Also, can you explain more or locate what updates (bug fix, code refactoring...) in EasyEdit lead to the edit performance change?

Hello, the main update in the code is the loss. We previously followed the ROME's code to compute the loss, which uses the last token representation to calculate the loss on the target sequence (in the FT-L setting). We then update it to use the conditional output to compute the loss (FT-M). We have changed the loss computation for both the FT and AdaLoRA methods.

StarLooo · 2024-10-20T01:28:46Z

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

Hello! I'm pleased to inform you that I have retested some of these settings (AdaLoRA, ROME and FT-L on ZSRE) and achieved metrics quite similar to your updated results. Again, looking forward to your next update and greatly appreciating your ongoing efforts.

StarLooo · 2024-10-20T01:35:56Z

Hello, the main update in the code is the loss. We previously followed the ROME's code to compute the loss, which uses the last token representation to calculate the loss on the target sequence (in the FT-L setting). We then update it to use the conditional output to compute the loss (FT-M). We have changed the loss computation for both the FT and AdaLoRA methods.

Thanks for your explanation, I'll check the relevant codes and papers for more details. I noticed that though the Edit Succ. of AdaLoRA increases (from 69.86 to 100.0 on ZSRE) after this update, the locality of it also decreases a lot (from 72.21 to 35.16 on ZSRE). Is this phenomenon explainable or predictable?

littlefive5 · 2024-10-20T05:25:57Z

Hello, the main update in the code is the loss. We previously followed the ROME's code to compute the loss, which uses the last token representation to calculate the loss on the target sequence (in the FT-L setting). We then update it to use the conditional output to compute the loss (FT-M). We have changed the loss computation for both the FT and AdaLoRA methods.

Thanks for your explanation, I'll check the relevant codes and papers for more details. I noticed that though the Edit Succ. of AdaLoRA increases (from 69.86 to 100.0 on ZSRE) after this update, the locality of it also decreases a lot (from 72.21 to 35.16 on ZSRE). Is this phenomenon explainable or predictable?

From my point of view, the original loss failed to learn the updated knowledge and left the model unchanged that much. But it's truly an interesting phenomenon as the FT-L and FT-M do not show this trend. Maybe we need more tests here.

littlefive5 · 2024-10-20T23:11:08Z

Hello, I have updated the run_llama_knowedit.py script to include the evaluation of results within the eval function for the KnowEdit framework. Regarding the results presented in our paper, we apologize for any inaccuracies. We recognize that the data in the published version on arXiv became outdated following recent bug fixes and computational improvements. The results you currently get is right. I am in the process of running additional experiments to gather the correct data. Once the new results are finalized, I will promptly update this issue thread with the latest findings and ensure that the arXiv listing is updated accordingly.

DataSet Metric AdaLoRA ROME MEMIT FT-L
WikiDatarecent Edit Succ.
↑
100.00 97.40 97.05 55.74
Portability
↑
65.51 40.21 41.49 40.78
Locality
↑
53.71 37.33 35.57 43.70
Fluency
↑
577.26 578.72 573.72 524.83
ZsRE Edit Succ.
↑
100.00 97.22 95.33 53.93
Portability
↑
58.03 51.99 52.70 45.64
Locality
↑
35.16 65.66 63.49 38.07
Fluency
↑
548.08 571.60 563.31 492.77
WikiBio Edit Succ.
↑
100.00 96.05 95.07 66.33
Locality
↑
38.49 67.99 68.77 40.27
Fluency
↑
617.70 617.33 617.22 606.58
WikiDatacounterfact Edit Succ.
↑
100 98.63 98.02 45.15
Portability
↑
73.04 42.81 45.72 33.57
Locality
↑
64.36 28.08 25.32 50.50
Fluency
↑
580.29 582.46 575.96 527.58
As to the error you met in ROME, I will try to reproduce it and will let you know.

Thank you once again for bringing this matter to our attention. Should you have any additional questions or require further assistance, please feel free to reach out via email ([email protected]) or WeChat (yaoxiaoyun12).

Hello,
I have updated these results and the arxiv paper will be updated this week.
If you meet other problems or bugs, leave a message. Thanks again for your help!

StarLooo · 2024-10-21T01:30:58Z

Hello, I have updated these results and the arxiv paper will be updated this week. If you meet other problems or bugs, leave a message. Thanks again for your help!

Thanks for your timely results update. I have also conducted experiments based on the latest code, and here are my results:

Dataset	Metric	AdaLoRA	ROME	FT-L	FT-M
ZSRE	Edit Succ.	100.0	96.61	53.93	99.99
	Portability	56.52	52.65	45.67	60.31
	Locality	38.05	65.26	38.05	45.29
	Fluency	563.57	573.44	493.01	552.26
Wikidata_counterfact	Edit Succ.	100.0	98.58	45.15	100.0
	Portability	69.89	43.72	33.60	74.37
	Locality	34.02	67.62	25.67	36.59
	Fluency	584.68	584.05	528.28	575.55

As you can see, the results on the ZSRE are quite similar to yours; however, there are notable differences on the Wikidata_counterfact, particularly in the Locality metric.

StarLooo · 2024-10-21T01:32:21Z

From my point of view, the original loss failed to learn the updated knowledge and left the model unchanged that much. But it's truly an interesting phenomenon as the FT-L and FT-M do not show this trend. Maybe we need more tests here.

Certainly, this is an interesting phenomenon worthy of study.

StarLooo · 2024-10-21T02:30:25Z

While reviewing the code, I came across a confusing section in the collate_fn method of the KnowEditDataset class within KnowEdit.py. Specifically, In this line of code: it seems that the relation_specificity and forgetfulness are mutually exclusive when constructing the batch. Although to my understanding, the collate_fn method has no effect when we do a normal single edit just using the default setting, and perhaps also has no effect on the sequential edit setting, but it can indeed influence the batch edit setting. I don't understand the usage of this line of code.

littlefive5 · 2024-10-21T03:24:51Z

Hello, I have updated these results and the arxiv paper will be updated this week. If you meet other problems or bugs, leave a message. Thanks again for your help!

Thanks for your timely results update. I have also conducted experiments based on the latest code, and here are my results:

Dataset Metric AdaLoRA ROME FT-L FT-M
ZSRE Edit Succ. 100.0 96.61 53.93 99.99
Portability 56.52 52.65 45.67 60.31
Locality 38.05 65.26 38.05 45.29
Fluency 563.57 573.44 493.01 552.26
Wikidatacounterfact Edit Succ. 100.0 98.58 45.15 100.0
Portability 69.89 43.72 33.60 74.37
Locality 34.02 67.62 25.67 36.59
Fluency 584.68 584.05 528.28 575.55
As you can see, the results on the ZSRE are quite similar to yours; however, there are notable differences on the Wikidata_counterfact, particularly in the Locality metric.

Quite weird, I will double-check my results to see if I made a mistake when I typed the results in the table.

littlefive5 · 2024-10-21T03:28:07Z

While reviewing the code, I came across a confusing section in the collate_fn method of the KnowEditDataset class within KnowEdit.py. Specifically, In this line of code: it seems that the relation_specificity and forgetfulness are mutually exclusive when constructing the batch. Although to my understanding, the collate_fn method has no effect when we do a normal single edit just using the default setting, and perhaps also has no effect on the sequential edit setting, but it can indeed influence the batch edit setting. I don't understand the usage of this line of code.

The collate function is not used, I will delete it. I construct the input here

littlefive5 · 2024-10-21T03:33:09Z

Hello, I have updated these results and the arxiv paper will be updated this week. If you meet other problems or bugs, leave a message. Thanks again for your help!

Thanks for your timely results update. I have also conducted experiments based on the latest code, and here are my results:

Dataset Metric AdaLoRA ROME FT-L FT-M
ZSRE Edit Succ. 100.0 96.61 53.93 99.99
Portability 56.52 52.65 45.67 60.31
Locality 38.05 65.26 38.05 45.29
Fluency 563.57 573.44 493.01 552.26
Wikidatacounterfact Edit Succ. 100.0 98.58 45.15 100.0
Portability 69.89 43.72 33.60 74.37
Locality 34.02 67.62 25.67 36.59
Fluency 584.68 584.05 528.28 575.55
As you can see, the results on the ZSRE are quite similar to yours; however, there are notable differences on the Wikidata_counterfact, particularly in the Locality metric.

Is it ok for you to share the results.json? Maybe I can check where the discrepancy comes from?

StarLooo · 2024-10-21T03:39:34Z

Is it ok for you to share the results.json? Maybe I can check where the discrepancy comes from?

Yes, here are the AdaLoRA ROME's results.json files on Wikidata_counterfact
LoRA_counterfact_7b-chat_results.json
ROME_counterfact_7b-chat_results.json

littlefive5 · 2024-11-04T01:23:55Z

Yes, I remember before I update the enironment, the ROME's result is also good, but it seems that after I update the environment, the results is not that good.
Can you provide me your current environment, including the torch and python verison.

Also, how do you compute the covirance of wikipedia?

StarLooo · 2024-11-04T01:43:09Z

Yes, I remember before I update the enironment, the ROME's result is also good, but it seems that after I update the environment, the results is not that good. Can you provide me your current environment, including the torch and python verison.

Also, how do you compute the covirance of wikipedia?

Here is my current environment derived by pip freeze
myrequirements.txt
Sorry, but I do not understand the meaning of "the covirance (covariance) of wikipedia". What's the usage of it and which part of codes are related to it?

StarLooo · 2024-11-04T02:31:43Z

Yes, I remember before I update the enironment, the ROME's result is also good, but it seems that after I update the environment, the results is not that good. Can you provide me your current environment, including the torch and python verison.
Also, how do you compute the covirance of wikipedia?

Here is my current environment derived by pip freeze
myrequirements.txt

Sorry, but I do not understand the meaning of "the covirance (covariance) of wikipedia". What's the usage of it and which part of codes are related to it?

In fact, I am just a beginner in the field of knowledge Editing. I happened to be reading the ROME paper today, and I have just reached the relevant section of ROME ("Here W is the original matrix, C = KK^T is a constant that we pre-cache by estimating the uncentered covariance of k from a sample of Wikipedia text"). :)

littlefive5 · 2024-11-04T02:44:46Z

Well, I can reproduce your results using the provided environments!
So the reason here is the environment.

StarLooo · 2024-11-04T02:46:57Z

Well, I can reproduce your results using the provided environments! So the reason here is the environment.

Can we determine which package has the greatest impact? I guess the transformers and peft package may be the reason. A control variate method combined with "binary search" may help us find these package(s).

littlefive5 · 2024-11-04T06:54:48Z

Yeah, the peft would influence the performance of LoRA. I change from 0.7.0 to 0.12.0 and can get your results.
For ROME, I will check again.

littlefive5 · 2024-11-04T10:45:23Z

I finally found the reason for the difference in ROME and MEMIT.
I guess that in your code at the editor.py, when loading the LlamaTokenizer, the code is

self.tok = AutoTokenizer.from_pretrained(self.model_name)

while in the current code, it is

self.tok = AutoTokenizer.from_pretrained(self.model_name, use_fast=False)

Well, in ROME and MEMIT, we need to set use_fast=True, which is the default setting here, as we need to keep the blank for this step of computing:

in compute_v in ROME.

While we set it as false, the tokenizer would remove the blank here, which leads to the false computing as:

Hence, when it is set as True,
the computing is correct:

So, we will add a discrimination here to make sure the tokenizer is set correctly for different knowledge editing methods. I think your results should be correct, and I will update the results soon.

Thanks for your help, I really need to sleep now🤣.

StarLooo · 2024-11-04T10:58:39Z

Well, I can reproduce your results using the provided environments! So the reason here is the environment.

I can also reproduce the Adalora result on wikidata_counterfact with a new environment following your requirement.txt, but the result of ROME is also different, and I'll check the latest reason you just mentioned. Anyhow, wish you a good sleep

StarLooo · 2024-11-04T11:08:51Z

I finally found the reason for the difference in ROME and MEMIT. I guess that in your code at the editor.py, when loading the LlamaTokenizer, the code is
self.tok = AutoTokenizer.from_pretrained(self.model_name)
while in the current code, it is
self.tok = AutoTokenizer.from_pretrained(self.model_name, use_fast=False)
Well, in ROME and MEMIT, we need to set use_fast=True, which is the default setting here, as we need to keep the blank for this step of computing: in compute_v in ROME.

While we set it as false, the tokenizer would remove the blank here, which leads to the false computing as:

Hence, when it is set as True, the computing is correct:

So, we will add a discrimination here to make sure the tokenizer is set correctly for different knowledge editing methods. I think your results should be correct, and I will update the results soon.

Thanks for your help, I really need to sleep now🤣.

Perhaps bad information is that in my codes, the initialization of llama tokenizer is also set with use_fast=False in editor.py:

            elif 'llama' in self.model_name.lower():
                self.model = AutoModelForCausalLM.from_pretrained(self.model_name, **model_kwargs)
                self.tok = AutoTokenizer.from_pretrained(self.model_name, use_fast=False)
                self.tok.pad_token_id = self.tok.eos_token_id

So, I do not know if it is the only reason for these differences, and I'll try ROME by setting use_fast=True this night.
Another question is that when I try to rerun using use_fast=True or change the package version, should I remove the original pre-edit files?

StarLooo · 2024-11-04T11:20:03Z

To help you better find potential bugs, I think these .py code files (I may add some comments to help myself read and check the codes, but as far as I remember, I do not change the essence) should be helpful.
some_of_my_py_files.zip

littlefive5 · 2024-11-04T19:20:53Z

I finally found the reason for the difference in ROME and MEMIT. I guess that in your code at the editor.py, when loading the LlamaTokenizer, the code is
self.tok = AutoTokenizer.from_pretrained(self.model_name)
while in the current code, it is
self.tok = AutoTokenizer.from_pretrained(self.model_name, use_fast=False)
Well, in ROME and MEMIT, we need to set use_fast=True, which is the default setting here, as we need to keep the blank for this step of computing: in compute_v in ROME.
While we set it as false, the tokenizer would remove the blank here, which leads to the false computing as:
Hence, when it is set as True, the computing is correct:
So, we will add a discrimination here to make sure the tokenizer is set correctly for different knowledge editing methods. I think your results should be correct, and I will update the results soon.
Thanks for your help, I really need to sleep now🤣.
Perhaps bad information is that in my codes, the initialization of llama tokenizer is also set with use_fast=False in editor.py:
            elif 'llama' in self.model_name.lower():
                self.model = AutoModelForCausalLM.from_pretrained(self.model_name, **model_kwargs)
                self.tok = AutoTokenizer.from_pretrained(self.model_name, use_fast=False)
                self.tok.pad_token_id = self.tok.eos_token_id
So, I do not know if it is the only reason for these differences, and I'll try ROME by setting use_fast=True this night. Another question is that when I try to rerun using use_fast=True or change the package version, should I remove the original pre-edit files?

You donot need to remove the pre_edit.
I need your rome log file and your new output.

StarLooo · 2024-11-05T02:35:40Z

You donot need to remove the pre_edit. I need your rome log file and your new output.

I neglected to rerun the experiment last night. Today, I plan to assess whether setting use_fast=True yields a different outcome. Upon obtaining the new results, I will promptly reach out to you.

UPDATE: Due to limited computation resources, I still need more time to check ROME.

littlefive5 · 2024-11-05T06:09:41Z

I can get your ROME results when I make sure the tokenizer can correctly build the prompt in compute_v.
I think you can just check whether your code adds the blank here. If it's setting true, the results is great but if it failed to build the prompt, the results is not that goo.

StarLooo · 2024-11-05T11:00:01Z

I can get your ROME results when I make sure the tokenizer can correctly build the prompt in compute_v. I think you can just check whether your code adds the blank here. If it's setting true, the results is great but if it failed to build the prompt, the results is not that goo.

That is inspiring news， and may I have a short summary of the discussion here:
After:

update the environment to new version especially the version of peft package
make sure left pad in test_prediction_acc
make sure the tokenizer can correctly build the prompt in compute_v of ROME

we can finally get the same results on FT-L & AdaLoRA & ROME as I reported.

littlefive5 · 2024-11-05T18:20:55Z

Yes

StarLooo · 2024-11-06T02:46:34Z

Hello, I think the BUG in ROME has nothing to do with the fast tokenizer, and here is my experiment:

Firstly, I changed these codes in compute_v for better debug:
from

    rewriting_prompts, kl_prompts = [
        context.format(request["prompt"]) + tok.decode(target_ids[:-1])
        for context in context_templates
    ], ["{} is a"]
    all_prompts = rewriting_prompts + kl_prompts

to

kl_prompts = ["{} is a"]
rewriting_prompts = []
for context in context_templates:
    part_1 = context.format(request["prompt"])
    part_2 = tok.decode(target_ids[:-1])
    rewriting_prompt = context.format(request["prompt"]) + tok.decode(target_ids[:-1])
    print(f"use fast tokenizer? {tok.is_fast}")
    print(f"target_ids[:-1]:\n{target_ids[:-1]}")
    print(f"context:\n#{context}#")
    print(f"part_1:\n#{part_1}#")
    print(f"part_2:\n#{part_2}#")
    print(f"rewriting_prompt:\n{rewriting_prompt}")
    input("press any key to confirm")
    rewriting_prompts.append(rewriting_prompt)
all_prompts = rewriting_prompts + kl_prompts

After I run with use_fast=False, I get this output information:

As you can see:

the is_fast of tok returns False, which verifies that we are not using Fast Tokenizer
the token id '29871' plays the role of blank, which is we need
the ' Syria' enclosed in the red box originally has a blank

So I think the problem should be in the string (input) level but not the model/tokenizer level.

Besides, there are also some discussions about the llama tokenizer on token id '29871'.

littlefive5 · 2024-11-06T03:35:20Z

Interesting,
if I run your code in your provided environment,

When use_fast is False,
it looks like this.

StarLooo · 2024-11-06T03:45:07Z

Interesting, if I run your code in your provided environment, When use_fast is False, it looks like this.

It's quite weird. Your target_ids[-1] is also [29871, 8713], but after decoding, the blank disappears.

But if I test directly decode the [29871, 8713]:

    print(tokenizer.is_fast)
    x = [29871, 8713]
    print(f"#{tokenizer.decode(x)}#")

the result also includes the blank:

Could it be an issue with the console font display?

littlefive5 · 2024-11-06T04:33:24Z

It's not the reason, I will check the tokenizer config to see what happened, maybe something wrong with the tokenizer setting.

littlefive5 · 2024-11-06T06:19:20Z

It turns out that the reason is from legacy, a property for the origin LLAMAtokenizer.
In my tokenizer, the legacy is set to true (I don't know why, maybe our group's tokenizer config is too old and missing.)
Since the LLAMATokenizerFast do not have the legacy feature, when I set it as true, it would not influence the behavior.

StarLooo · 2024-11-06T06:35:34Z

It turns out that the reason is from legacy, a property for the origin LLAMAtokenizer. In my tokenizer, the legacy is set to true (I don't know why, maybe our group's tokenizer config is too old and missing.) Since the LLAMATokenizerFast do not have the legacy feature, when I set it as true, it would not influence the behavior.

Good, It seems that we have resolved all the issues.
Another concern of mine is whether other methods like SERAC, GRACE, and scenarios considering sequential editing such as WISE are also affected. I noticed that in the run_knowedit_llama2.py, it imports MENDHyperParams and SERACHparams, but doesn't contain any process about these two edit methods.

littlefive5 · 2024-11-06T07:41:45Z

Well, we didn't conduct experiments on these two methods under knowedit. But from my view, this would not affect the performance as they would not use the peft module and the tokenizer also does not involve decoding in their methods.
Also, the padding_side in these methods is setting as 'left'

StarLooo · 2024-11-06T08:32:07Z

Well, we didn't conduct experiments on these two methods under knowedit. But from my view, this would not affect the performance as they would not use the peft module and the tokenizer also does not involve decoding in their methods. Also, the padding_side in these methods is setting as 'left'

Well, I plan to read through the codes as well as papers of these used methods recently, and then I'll try to integrate MEND and SERAC in the current framework.

littlefive5 · 2024-11-06T18:10:49Z

Great, we will update the new results in arxiv this week.

zxlzr · 2024-11-07T01:09:14Z

Dear StarLooo,

Thank you very much for raising this issue. After a long and thorough debugging process, the problem has finally been identified, and we will work on an update as soon as possible.

EasyEdit will continue to optimize, improve, and add new features. We look forward to more collaboration and communication with you.

EasyEdit Team

littlefive5 · 2024-11-08T22:59:00Z

Update some results using your provided environment. You can check again, I think a little difference is acceptable.

WikiData_recent

Method	Edit Succ.	Portability	Locality	Fluency
AdaLoRA	100.00	64.69	56.42	579.57
ROME	97.18	55.25	54.77	579.66
MEMIT	97.05	56.37	52.15	573.89
FT-L	55.75	40.86	43.70	529.24
FT-M	100.00	65.44	64.33	574.32

ZsRE

Method	Edit Succ.	Portability	Locality	Fluency
AdaLoRA	100.00	58.03	75.76	563.56
ROME	96.77	52.63	53.67	573.75
MEMIT	95.37	52.67	48.32	563.56
FT-L	53.93	45.64	73.42	493.01
FT-M	99.98	60.31	89.78	552.26

WikiData_counterfact

Method	Edit Succ.	Portability	Locality	Fluency
AdaLoRA	100.00	69.89	70.31	580.29
ROME	98.57	55.92	51.97	584.04
MEMIT	98.05	58.56	46.62	575.96
FT-L	45.15	33.60	50.48	528.26
FT-M	100.00	74.36	76.76	575.62

WikiBio

Method	Edit Succ.	Locality	Fluency
AdaLoRA	100.00	81.28	618.45
ROME	96.08	62.74	617.69
MEMIT	94.40	61.51	616.65
FT-L	66.33	79.86	606.95
FT-M	100.00	93.38	612.69

StarLooo · 2024-11-09T02:05:06Z

Update some results using your provided environment. You can check again, I think a little difference is acceptable.

Within the allowable randomness range, these AdaLoRA/ROME/FT-L/FT-M results are consistent with what I previously reproduced on Wikidata_recent&Wikidata_counterfact&ZSRE. I'll also try other models and datasets.
Thanks for your tinely update.

littlefive5 · 2024-11-11T02:45:50Z

I provisionally close this issue and you can reopen if you meet other problems.

zxlzr added the question Further information is requested label Oct 17, 2024

zxlzr changed the title ~~[Fixing bug in progress] Difficulties to reproduce the KnowEdit results in the survey paper~~ [Bug Fixed] Difficulties to reproduce the KnowEdit results in the survey paper Nov 7, 2024

littlefive5 closed this as completed Nov 11, 2024

[Bug Fixed] Difficulties to reproduce the KnowEdit results in the survey paper #390

[Bug Fixed] Difficulties to reproduce the KnowEdit results in the survey paper #390

Comments

StarLooo commented Oct 17, 2024

StarLooo commented Oct 17, 2024

StarLooo commented Oct 17, 2024

StarLooo commented Oct 17, 2024

XeeKee commented Oct 17, 2024

zxlzr commented Oct 17, 2024

XeeKee commented Oct 17, 2024

XeeKee commented Oct 17, 2024

XeeKee commented Oct 17, 2024 • edited Loading

Clarification of Misunderstanding

Why Rewrite Accuracy Can Be Greater Than 0 Before Editing

XeeKee commented Oct 17, 2024

littlefive5 commented Oct 17, 2024 • edited Loading

StarLooo commented Oct 18, 2024

StarLooo commented Oct 18, 2024

Clarification of Misunderstanding

Why Rewrite Accuracy Can Be Greater Than 0 Before Editing

StarLooo commented Oct 18, 2024

StarLooo commented Oct 18, 2024

StarLooo commented Oct 18, 2024

StarLooo commented Oct 18, 2024

littlefive5 commented Oct 18, 2024

littlefive5 commented Oct 18, 2024

StarLooo commented Oct 20, 2024

StarLooo commented Oct 20, 2024

littlefive5 commented Oct 20, 2024 • edited Loading

littlefive5 commented Oct 20, 2024 • edited Loading

StarLooo commented Oct 21, 2024

StarLooo commented Oct 21, 2024

StarLooo commented Oct 21, 2024

littlefive5 commented Oct 21, 2024

littlefive5 commented Oct 21, 2024

littlefive5 commented Oct 21, 2024

StarLooo commented Oct 21, 2024

littlefive5 commented Nov 4, 2024 • edited Loading

StarLooo commented Nov 4, 2024

StarLooo commented Nov 4, 2024

littlefive5 commented Nov 4, 2024

StarLooo commented Nov 4, 2024 • edited Loading

littlefive5 commented Nov 4, 2024

littlefive5 commented Nov 4, 2024 • edited Loading

StarLooo commented Nov 4, 2024

StarLooo commented Nov 4, 2024 • edited Loading

StarLooo commented Nov 4, 2024

littlefive5 commented Nov 4, 2024 • edited Loading

StarLooo commented Nov 5, 2024 • edited Loading

littlefive5 commented Nov 5, 2024

StarLooo commented Nov 5, 2024

littlefive5 commented Nov 5, 2024

StarLooo commented Nov 6, 2024 • edited Loading

littlefive5 commented Nov 6, 2024

StarLooo commented Nov 6, 2024 • edited Loading

littlefive5 commented Nov 6, 2024

littlefive5 commented Nov 6, 2024

StarLooo commented Nov 6, 2024 • edited Loading

littlefive5 commented Nov 6, 2024

StarLooo commented Nov 6, 2024

littlefive5 commented Nov 6, 2024

zxlzr commented Nov 7, 2024

littlefive5 commented Nov 8, 2024 • edited Loading

WikiData_recent

ZsRE

WikiData_counterfact

WikiBio

StarLooo commented Nov 9, 2024 • edited Loading

littlefive5 commented Nov 11, 2024

XeeKee commented Oct 17, 2024 •

edited

Loading

littlefive5 commented Oct 17, 2024 •

edited

Loading

littlefive5 commented Oct 20, 2024 •

edited

Loading

littlefive5 commented Oct 20, 2024 •

edited

Loading

littlefive5 commented Nov 4, 2024 •

edited

Loading

StarLooo commented Nov 4, 2024 •

edited

Loading

littlefive5 commented Nov 4, 2024 •

edited

Loading

StarLooo commented Nov 4, 2024 •

edited

Loading

littlefive5 commented Nov 4, 2024 •

edited

Loading

StarLooo commented Nov 5, 2024 •

edited

Loading

StarLooo commented Nov 6, 2024 •

edited

Loading

StarLooo commented Nov 6, 2024 •

edited

Loading

StarLooo commented Nov 6, 2024 •

edited

Loading

littlefive5 commented Nov 8, 2024 •

edited

Loading

StarLooo commented Nov 9, 2024 •

edited

Loading