file not found #2

amdongyang · 2021-12-30T01:53:28Z

I can't find the file named /utility/WikiExtractor.py used in initialize.sh. The file seems to be important for synthetic pre-training

davda54 · 2021-12-30T18:27:28Z

Hi, you can get the file here, for example: https://github.com/nawnoes/data-preprocess/blob/master/WikiExtractor.py

Note that you actually don't have to download, extract and process the wiki dumps -- we have also released the processed dumps used to train our system here: https://github.com/ufal/multilexnorm2021/releases/tag/v1.0.0

amdongyang · 2021-12-31T03:18:04Z

Thanks a lot for your help. I have another question.

After synthetic pre-training, i need to load the saved checkpoint, and fine-tuning the synthetic-pretraining checkpoint with hand-annotated traing data.

This procedure is right or not? Now i fine-tune the byt5 model directly with hand-annotated traing data, and i can only get ERR with 70.15 on En language.

davda54 · 2022-01-03T10:29:31Z

That sounds alright. I'm not sure what validation dataset you use, but reducing the error by 70% seems good to me :)

amdongyang · 2022-01-04T02:35:38Z

As for the validation dataset, i simply use the test file under path (/data/multilexnorm/test/eval/test/intrinsic_evaluation/en/test.norm.masked), and i am tring to achieve the performance reported in the paper (73.8 on En language)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

file not found #2

file not found #2

amdongyang commented Dec 30, 2021

davda54 commented Dec 30, 2021

amdongyang commented Dec 31, 2021

davda54 commented Jan 3, 2022

amdongyang commented Jan 4, 2022

file not found #2

file not found #2

Comments

amdongyang commented Dec 30, 2021

davda54 commented Dec 30, 2021

amdongyang commented Dec 31, 2021

davda54 commented Jan 3, 2022

amdongyang commented Jan 4, 2022