Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Output size smaller than original #13

Open
zharenkov opened this issue Jul 2, 2019 · 5 comments
Open

Output size smaller than original #13

zharenkov opened this issue Jul 2, 2019 · 5 comments

Comments

@zharenkov
Copy link

zharenkov commented Jul 2, 2019

Hi, @mfaruqui
I'm passing to retrofit.py glove's embedding file 840B.300d. Its size is about 5,5gb, but result file's size is 3.7gb (for wordnet and for paraphrase). Is it correct behaviour? If yes - can you please explain why size is decresing so significantly?

Thanks!

@alina-le
Copy link

alina-le commented Jul 14, 2019

hi @zharenkov, hi @mfaruqui,
i'm having the same issue: when comparing the original embeddings file to the retrofitted one, around 5% of the lines are lost and i'm wondering why.
(retrofitting the same word embedding with different lexicons results in the exact same decreased number of lines for each lexicon)
cheers!

@mfaruqui
Copy link
Owner

Line #44 in the code is truncating the float to only 4 digits after decimal. If the total number of words in the input and output are same, this is fine.

@alina-le
Copy link

thanks for the answer @mfaruqui! figured out it was due to words in the original file being contained in upper as well as in lowercase, while the retrofitted embeddings are all lowercase

@japleengulati
Copy link

I'm losing around 3% of vectors when retrofitted.
I've checked for the 56 missing vectors out of 2070 input vectors and it's not a case of lowercase-uppercase duplicates. Can you please advise on what this possibly could be? Cheers!

@japleengulati
Copy link

To detail on my issue and clarify - the 56 vectors themselves aren't missing but they're missing dimensions!
I input 2070 vectors of 300 dimensions each. In the output I received the same number of vectors but 56 of them with missing dimensions so they had dimensions like 296,294, etc.
It does not seem like a case of formatting gone wrong either, I've checked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants