Output size smaller than original #13

zharenkov · 2019-07-02T13:21:49Z

Hi, @mfaruqui
I'm passing to retrofit.py glove's embedding file 840B.300d. Its size is about 5,5gb, but result file's size is 3.7gb (for wordnet and for paraphrase). Is it correct behaviour? If yes - can you please explain why size is decresing so significantly?

Thanks!

alina-le · 2019-07-14T12:07:17Z

hi @zharenkov, hi @mfaruqui,
i'm having the same issue: when comparing the original embeddings file to the retrofitted one, around 5% of the lines are lost and i'm wondering why.
(retrofitting the same word embedding with different lexicons results in the exact same decreased number of lines for each lexicon)
cheers!

mfaruqui · 2019-07-14T13:34:49Z

Line #44 in the code is truncating the float to only 4 digits after decimal. If the total number of words in the input and output are same, this is fine.

alina-le · 2019-07-14T15:44:53Z

thanks for the answer @mfaruqui! figured out it was due to words in the original file being contained in upper as well as in lowercase, while the retrofitted embeddings are all lowercase

japleengulati · 2022-06-13T14:51:00Z

I'm losing around 3% of vectors when retrofitted.
I've checked for the 56 missing vectors out of 2070 input vectors and it's not a case of lowercase-uppercase duplicates. Can you please advise on what this possibly could be? Cheers!

japleengulati · 2022-06-13T18:39:35Z

To detail on my issue and clarify - the 56 vectors themselves aren't missing but they're missing dimensions!
I input 2070 vectors of 300 dimensions each. In the output I received the same number of vectors but 56 of them with missing dimensions so they had dimensions like 296,294, etc.
It does not seem like a case of formatting gone wrong either, I've checked.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Output size smaller than original #13

Output size smaller than original #13

zharenkov commented Jul 2, 2019 •

edited

Loading

alina-le commented Jul 14, 2019 •

edited

Loading

mfaruqui commented Jul 14, 2019

alina-le commented Jul 14, 2019

japleengulati commented Jun 13, 2022

japleengulati commented Jun 13, 2022

Output size smaller than original #13

Output size smaller than original #13

Comments

zharenkov commented Jul 2, 2019 • edited Loading

alina-le commented Jul 14, 2019 • edited Loading

mfaruqui commented Jul 14, 2019

alina-le commented Jul 14, 2019

japleengulati commented Jun 13, 2022

japleengulati commented Jun 13, 2022

zharenkov commented Jul 2, 2019 •

edited

Loading

alina-le commented Jul 14, 2019 •

edited

Loading