Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect dictionary format #166

Open
abdullahkhilji opened this issue Sep 14, 2020 · 3 comments
Open

Incorrect dictionary format #166

abdullahkhilji opened this issue Sep 14, 2020 · 3 comments

Comments

@abdullahkhilji
Copy link

I have matched the dictionary generated using XLM code and the sample given here at MASS, though the format matches it still gives an error.

Traceback (most recent call last):
  File "/home/abdullahkhilji/miniconda3/envs/mass/bin/fairseq-preprocess", line 8, in <module>
    sys.exit(cli_main())
  File "/home/abdullahkhilji/miniconda3/envs/mass/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 267, in cli_main
    main(args)
  File "/home/abdullahkhilji/miniconda3/envs/mass/lib/python3.8/site-packages/fairseq_cli/preprocess.py", line 80, in main
    src_dict = task.load_dictionary(args.srcdict)
  File "/home/abdullahkhilji/miniconda3/envs/mass/lib/python3.8/site-packages/fairseq/tasks/cross_lingual_lm.py", line 82, in load_dictionary
    return MaskedLMDictionary.load(filename)
  File "/home/abdullahkhilji/miniconda3/envs/mass/lib/python3.8/site-packages/fairseq/data/dictionary.py", line 176, in load
    return cls.load(fd)
  File "/home/abdullahkhilji/miniconda3/envs/mass/lib/python3.8/site-packages/fairseq/data/dictionary.py", line 192, in load
    raise ValueError("Incorrect dictionary format, expected '<token> <cnt>'")
ValueError: Incorrect dictionary format, expected '<token> <cnt>'

@abdullahkhilji
Copy link
Author

I have created the dictionary as created by GLoVe it works but takes a lot of time, is it required to keep a bar on the number of words in the dictionary? Else it consumes a lot of time.

@StillKeepTry
Copy link
Contributor

As introduced in error, you should keep the format of the dictionary as . For example:

A 10000
B 10000

The value of cnt is no matter, but it must be provided.

@abdullahkhilji
Copy link
Author

I was following the same format.
The error got fixed after I reduced the size of dict.en.txt it was around 800MB. Reducing the file below 10MB after considering the fine tune data only worked. Will have to set a threshold for a better solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants