-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix return dtype for matutils.unitvec
according to input dtype. Fix #1722
#1992
Conversation
As requested, I have edited the fix to ignore dtype size. I use np.issubtype to check input type and handle appropriately before return to ensure non-integer output.
Tests to ensure float output for both float and integer inputs.
gensim/matutils.py
Outdated
Parameters | ||
---------- | ||
vec : {numpy.ndarray, scipy.sparse, list of (int, float)} | ||
Input vector in any format | ||
norm : {'l1', 'l2'}, optional | ||
Normalization that will be used. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please return empty lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
gensim/matutils.py
Outdated
if scipy.sparse.issparse(vec): | ||
vec = vec.tocsr() | ||
if norm == 'l1': | ||
veclen = np.sum(np.abs(vec.data)) | ||
if norm == 'l2': | ||
veclen = np.sqrt(np.sum(vec.data ** 2)) | ||
if veclen > 0.0: | ||
return vec / veclen | ||
if np.issubdtype(vec.dtype, np.int) == True: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to == True
gensim/matutils.py
Outdated
else: | ||
return vec | ||
|
||
if isinstance(vec, np.ndarray): | ||
vec = np.asarray(vec, dtype=float) | ||
vec = np.asarray(vec, dtype=vec.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this have no sense (because later you'll cast it again)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree - this pretty much seems like a no-op, effectively. Does vec = np.asarray(vec, dtype=vec.dtype)
have any effect on vec
at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, I forgot about this - I will remove it on the next commit.
gensim/test_unitvec.py
Outdated
class UnitvecTestCase(unittest.TestCase): | ||
|
||
def manual_unitvec(self, vec): | ||
self.vec = vec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use 4 spaces for indentation
gensim/test_unitvec.py
Outdated
@@ -0,0 +1,27 @@ | |||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_matutils.py is more suitable place for this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created pull request there
gensim/test_unitvec.py
Outdated
self.vec /= np.sqrt(sum_vec_squared) | ||
return self.vec | ||
|
||
def test_unitvec(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's about different vectors (sparse) + different types (more floats + int too)?
@o-P-o also, please resolve merge-conflict |
CC: @jayantj please have a look |
gensim/matutils.py
Outdated
@@ -667,45 +667,54 @@ def ret_log_normalize_vec(vec, axis=1): | |||
|
|||
def unitvec(vec, norm='l2'): | |||
"""Scale a vector to unit length. | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't make unrelated changes, this empty line(s) is correct by docstring convention.
gensim/matutils.py
Outdated
else: | ||
return vec | ||
|
||
if isinstance(vec, np.ndarray): | ||
vec = np.asarray(vec, dtype=float) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this line is needed especially for blas_*
calls, please return it
gensim/test/test_matutils.py
Outdated
@@ -0,0 +1,187 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks strange, can you merge fresh develop
to current branch please?
gensim/test/test_matutils.py
Outdated
msg = "dirichlet_expectation_2d failed for dtype={}".format(dtype) | ||
self.assertTrue(np.allclose(known_good, test_values), msg) | ||
|
||
class UnitvecTestCase(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's a name?
gensim/test/test_matutils.py
Outdated
return self.vec | ||
|
||
def test_inputs(self): | ||
input_dtypes = [np.float32, np.float64, np.int32, np.int64] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add float
and int
(not numpy, python default)
gensim/test/test_matutils.py
Outdated
|
||
def test_inputs(self): | ||
input_dtypes = [np.float32, np.float64, np.int32, np.int64] | ||
input_arrtypes = ['sparse', 'normal'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dense
better name (instead of normal
)
gensim/test/test_matutils.py
Outdated
input_vector = np.random.randint(10, size=5).astype(dtype_) | ||
unit_vector = unitvec_with_bug.unitvec(input_vector) | ||
man_unit_vector = self.manual_unitvec(input_vector) | ||
self.assertTrue(np.allclose(unit_vector, man_unit_vector)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should check the dtype for this case too (you know that this will be default)
@o-P-o please resolve merge-conflict first before you start to change a code according to my review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too many PEP8 errors, please fix it https://travis-ci.org/RaRe-Technologies/gensim/jobs/358455050#L614
gensim/matutils.py
Outdated
@@ -668,7 +668,7 @@ def ret_log_normalize_vec(vec, axis=1): | |||
|
|||
def unitvec(vec, norm='l2', return_norm=False): | |||
"""Scale a vector to unit length. | |||
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no leading spaces (here and everywhere)
gensim/matutils.py
Outdated
if scipy.sparse.issparse(vec): | ||
vec = vec.tocsr() | ||
if norm == 'l1': | ||
veclen = np.sum(np.abs(vec.data)) | ||
if norm == 'l2': | ||
veclen = np.sqrt(np.sum(vec.data ** 2)) | ||
if veclen > 0.0: | ||
if return_norm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
revert existing code please (you shouldn't remove it)
Removed leading spaces, which is the source of the PEP8/travis errors. Sorry, only just learnt from you what these actually are :) Also updated the code to include 'if return_norm' statement from the sparse array case. (I can't remember why I actually removed this in the first place.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @o-P-o thanks a lot for submitting the PR! Good work.
I've left a few comments, could you please fix the issues raised so that we can go ahead and merge?
gensim/test/test_matutils.py
Outdated
self.vec /= np.sqrt(sum_vec_squared) | ||
return self.vec | ||
|
||
def test_inputs(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO we should split this test into multiple tests (one per combination of (arr_type, dtype)
maybe) so the logic is simpler and so that it's possible to look at a failing test log and see exactly what kind of input caused the test to fail.
gensim/test/test_matutils.py
Outdated
@@ -141,6 +142,43 @@ def testDirichletExpectation(self): | |||
self.assertTrue(np.allclose(known_good, test_values), msg) | |||
|
|||
|
|||
class UnitvecTestCase(unittest.TestCase): | |||
# test unitvec | |||
def manual_unitvec(self, vec): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely should be simplified - why use self
at all?
IMO should just be -
vec = vec.astype(np.float)
if sparse.issparse(vec):
vec_sum_of_squares = vec.multiply(vec)
unit = 1. / np.sqrt(vec_sum_of_squares.sum())
return vec.multiply(unit)
else:
sum_vec_squared = np.sum(vec ** 2)
vec /= np.sqrt(sum_vec_squared)
return vec
gensim/matutils.py
Outdated
if norm == 'l1': | ||
veclen = np.sum(np.abs(vec)) | ||
if norm == 'l2': | ||
veclen = blas_nrm2(vec) | ||
if veclen > 0.0: | ||
if return_norm: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire construct (and a similar construct above) seems to have some unnecessary redundant code. We could simplify to something like -
if veclen > 0.0:
if np.issubdtype(vec.dtype, np.int):
vec = vec.astype(np.float)
if return_norm:
return blas_scal(1.0 / veclen, vec).astype(vec.dtype), veclen
else:
return blas_scal(1.0 / veclen, vec).astype(vec.dtype)
This is what I have done based on Jayanti's suggestion of redundant code. Let me know if I have misunderstood.
Simplified tha manual_unitvec method and created a separate test for each `arrtype, dtype` pair, as suggested.
gensim/matutils.py
Outdated
vec = vec.astype(np.float) / veclen | ||
else: | ||
vec /= veclen | ||
vec = vec.astype(vec.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly confusing. For consistency, I'd prefer -
if np.issubdtype(vec.dtype, np.int):
vec = vec.astype(np.float) / veclen
else:
vec = vec.astype(vec.dtype) / veclen
Does that make sense?
In fact, if dividing by veclen
cannot change the dtype (is this the case?), even something like -
if np.issubdtype(vec.dtype, np.int):
vec = vec.astype(np.float)
vec /= veclen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, sorry I'm really nitpicking here :)
But small things like this cause a slow decline in overall code quality over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately dividing by veclen
actually changes the dtype, causing the original problem to manifest itself (i.e. float32 inputs outputting float64). Even trying something like this:
if np.issubdtype(vec.dtype, np.int):
vec = vec.astype(np.float) / veclen
else:
vec = vec.astype(vec.dtype) / veclen.astype(vec.dtype)
causes the same problem. This is why I divide by veclen
and then cast the dtype. Do you have any suggestions to get around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I really like your second suggestion, and it passes all tests:
if np.issubdtype(vec.dtype, np.int):
vec = vec.astype(np.float)
vec /= veclen
That's very neat indeed, I'll commit it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, don't worry about the nitpicking! When I am experienced enough to pick up on such things I'm sure I will be nitpicking like this haha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix and the explanation! Looks good.
return vec | ||
|
||
|
||
class UnitvecTestCase(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reorganizing the tests! Looks much better now IMO
Congratulation @o-P-o with first contribution 🥇 |
matutils.unitvec
according to input dtype. Fix #1722
Thanks for assisting my first ever merge guys! |
…iskvorky#1722 (piskvorky#1992) * matutils.unitvec bug As requested, I have edited the fix to ignore dtype size. I use np.issubtype to check input type and handle appropriately before return to ensure non-integer output. * matutils.unitvec fix tests Tests to ensure float output for both float and integer inputs. * unitvec equal input and return types * Update and rename test_unitvec to test_unitvec.py * Update matutils.py * Update matutils.py * Update test_unitvec.py * Update and rename gensim/test_unitvec.py to gensim/test/test_matutils.py * Update matutils.py * Update test_matutils.py * Update test_matutils.py * Update following review Removed leading spaces, which is the source of the PEP8/travis errors. Sorry, only just learnt from you what these actually are :) Also updated the code to include 'if return_norm' statement from the sparse array case. (I can't remember why I actually removed this in the first place.) * Update: attempt to solve Travis errors * Update test_matutils.py * Update matutils.py * Update matutils.py * Update test_matutils.py * Addressing travis errors * Remove unnecessary dtype assignment * return_norm statements for array instance case * Update test_matutils.py * Reduce line repetition * Reduce repeated lines * Update test_matutils.py * Remove some redundant code from unitvec This is what I have done based on Jayanti's suggestion of redundant code. Let me know if I have misunderstood. * UnitvecTestCase update Simplified tha manual_unitvec method and created a separate test for each `arrtype, dtype` pair, as suggested. * Small typo fix * Trailing white-space fix for Travis * Improve code quality and remove no-op
This returns the float32 dtypes for float32 inputs for both normal and sparse arrays. The output unit vectors are also the same as those outputted from doing manual unit vector calculations.
How does this look? Any further suggestions for improvement or why this isn't acceptable?