Fix return dtype for `matutils.unitvec` according to input dtype. Fix #1722 #1992

o-P-o · 2018-03-21T19:42:24Z

This returns the float32 dtypes for float32 inputs for both normal and sparse arrays. The output unit vectors are also the same as those outputted from doing manual unit vector calculations.

How does this look? Any further suggestions for improvement or why this isn't acceptable?

As requested, I have edited the fix to ignore dtype size. I use np.issubtype to check input type and handle appropriately before return to ensure non-integer output.

Tests to ensure float output for both float and integer inputs.

menshikh-iv · 2018-03-22T06:50:41Z

gensim/matutils.py

    Parameters
    ----------
    vec : {numpy.ndarray, scipy.sparse, list of (int, float)}
        Input vector in any format
    norm : {'l1', 'l2'}, optional
        Normalization that will be used.
-


Please return empty lines

menshikh-iv · 2018-03-22T06:52:03Z

gensim/matutils.py

    if scipy.sparse.issparse(vec):
        vec = vec.tocsr()
        if norm == 'l1':
            veclen = np.sum(np.abs(vec.data))
        if norm == 'l2':
            veclen = np.sqrt(np.sum(vec.data ** 2))
        if veclen > 0.0:
-            return vec / veclen
+            if np.issubdtype(vec.dtype, np.int) == True:


No need to == True

menshikh-iv · 2018-03-22T06:52:57Z

gensim/matutils.py

        else:
            return vec

    if isinstance(vec, np.ndarray):
-        vec = np.asarray(vec, dtype=float)
+        vec = np.asarray(vec, dtype=vec.dtype)


maybe this have no sense (because later you'll cast it again)

I agree - this pretty much seems like a no-op, effectively. Does vec = np.asarray(vec, dtype=vec.dtype) have any effect on vec at all?

Ah yes, I forgot about this - I will remove it on the next commit.

menshikh-iv · 2018-03-22T06:53:21Z

gensim/test_unitvec.py

+class UnitvecTestCase(unittest.TestCase):
+
+	def manual_unitvec(self, vec):
+		self.vec = vec


please use 4 spaces for indentation

menshikh-iv · 2018-03-22T06:54:02Z

gensim/test_unitvec.py

@@ -0,0 +1,27 @@
+import numpy as np


https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/test/test_matutils.py is more suitable place for this test

Created pull request there

menshikh-iv · 2018-03-22T06:55:49Z

gensim/test_unitvec.py

+			self.vec /= np.sqrt(sum_vec_squared)
+			return self.vec
+
+	def test_unitvec(self):


what's about different vectors (sparse) + different types (more floats + int too)?

menshikh-iv · 2018-03-22T06:58:29Z

@o-P-o also, please resolve merge-conflict

menshikh-iv · 2018-03-22T06:58:37Z

CC: @jayantj please have a look

menshikh-iv · 2018-03-24T20:40:30Z

gensim/matutils.py

@@ -667,45 +667,54 @@ def ret_log_normalize_vec(vec, axis=1):

 def unitvec(vec, norm='l2'):
    """Scale a vector to unit length.
-
+    


please don't make unrelated changes, this empty line(s) is correct by docstring convention.

menshikh-iv · 2018-03-24T20:41:52Z

gensim/matutils.py

        else:
            return vec

    if isinstance(vec, np.ndarray):
-        vec = np.asarray(vec, dtype=float)


I think that this line is needed especially for blas_* calls, please return it

menshikh-iv · 2018-03-24T20:42:43Z

gensim/test/test_matutils.py

@@ -0,0 +1,187 @@
+#!/usr/bin/env python


looks strange, can you merge fresh develop to current branch please?

menshikh-iv · 2018-03-24T20:42:59Z

gensim/test/test_matutils.py

+                msg = "dirichlet_expectation_2d failed for dtype={}".format(dtype)
+                self.assertTrue(np.allclose(known_good, test_values), msg)
+
+class UnitvecTestCase(unittest.TestCase):


what's a name?

menshikh-iv · 2018-03-24T20:44:08Z

gensim/test/test_matutils.py

+	    return self.vec
+
+    def test_inputs(self):
+	input_dtypes = [np.float32, np.float64, np.int32, np.int64]


please add float and int (not numpy, python default)

menshikh-iv · 2018-03-24T20:44:30Z

gensim/test/test_matutils.py

+
+    def test_inputs(self):
+	input_dtypes = [np.float32, np.float64, np.int32, np.int64]
+	input_arrtypes = ['sparse', 'normal']


dense better name (instead of normal)

menshikh-iv · 2018-03-24T20:45:20Z

gensim/test/test_matutils.py

+		        input_vector = np.random.randint(10, size=5).astype(dtype_)
+			unit_vector = unitvec_with_bug.unitvec(input_vector)
+			man_unit_vector = self.manual_unitvec(input_vector)
+			self.assertTrue(np.allclose(unit_vector, man_unit_vector))


you should check the dtype for this case too (you know that this will be default)

menshikh-iv · 2018-03-24T20:46:21Z

@o-P-o please resolve merge-conflict first before you start to change a code according to my review.

menshikh-iv

Too many PEP8 errors, please fix it https://travis-ci.org/RaRe-Technologies/gensim/jobs/358455050#L614

menshikh-iv · 2018-03-27T21:21:40Z

gensim/matutils.py

@@ -668,7 +668,7 @@ def ret_log_normalize_vec(vec, axis=1):

 def unitvec(vec, norm='l2', return_norm=False):
    """Scale a vector to unit length.
-
+    


no leading spaces (here and everywhere)

menshikh-iv · 2018-03-27T21:22:22Z

gensim/matutils.py

    if scipy.sparse.issparse(vec):
        vec = vec.tocsr()
        if norm == 'l1':
            veclen = np.sum(np.abs(vec.data))
        if norm == 'l2':
            veclen = np.sqrt(np.sum(vec.data ** 2))
        if veclen > 0.0:
-            if return_norm:


revert existing code please (you shouldn't remove it)

Removed leading spaces, which is the source of the PEP8/travis errors. Sorry, only just learnt from you what these actually are :) Also updated the code to include 'if return_norm' statement from the sparse array case. (I can't remember why I actually removed this in the first place.)

jayantj

Hi @o-P-o thanks a lot for submitting the PR! Good work.
I've left a few comments, could you please fix the issues raised so that we can go ahead and merge?

jayantj · 2018-04-10T10:14:51Z

gensim/test/test_matutils.py

+            self.vec /= np.sqrt(sum_vec_squared)
+            return self.vec
+
+    def test_inputs(self):


IMO we should split this test into multiple tests (one per combination of (arr_type, dtype) maybe) so the logic is simpler and so that it's possible to look at a failing test log and see exactly what kind of input caused the test to fail.

jayantj · 2018-04-10T10:17:41Z

gensim/test/test_matutils.py

@@ -141,6 +142,43 @@ def testDirichletExpectation(self):
                self.assertTrue(np.allclose(known_good, test_values), msg)


+class UnitvecTestCase(unittest.TestCase):
+    # test unitvec
+    def manual_unitvec(self, vec):


Definitely should be simplified - why use self at all?
IMO should just be -

vec = vec.astype(np.float) if sparse.issparse(vec): vec_sum_of_squares = vec.multiply(vec) unit = 1. / np.sqrt(vec_sum_of_squares.sum()) return vec.multiply(unit) else: sum_vec_squared = np.sum(vec ** 2) vec /= np.sqrt(sum_vec_squared) return vec

jayantj · 2018-04-10T10:22:58Z

gensim/matutils.py

        if norm == 'l1':
            veclen = np.sum(np.abs(vec))
        if norm == 'l2':
            veclen = blas_nrm2(vec)
        if veclen > 0.0:
-            if return_norm:


This entire construct (and a similar construct above) seems to have some unnecessary redundant code. We could simplify to something like -

if veclen > 0.0: if np.issubdtype(vec.dtype, np.int): vec = vec.astype(np.float) if return_norm: return blas_scal(1.0 / veclen, vec).astype(vec.dtype), veclen else: return blas_scal(1.0 / veclen, vec).astype(vec.dtype)

This is what I have done based on Jayanti's suggestion of redundant code. Let me know if I have misunderstood.

Simplified tha manual_unitvec method and created a separate test for each `arrtype, dtype` pair, as suggested.

jayantj · 2018-04-13T03:43:39Z

gensim/matutils.py

+                vec = vec.astype(np.float) / veclen
+            else:
+                vec /= veclen
+                vec = vec.astype(vec.dtype)


Slightly confusing. For consistency, I'd prefer -

if np.issubdtype(vec.dtype, np.int): vec = vec.astype(np.float) / veclen else: vec = vec.astype(vec.dtype) / veclen

Does that make sense?
In fact, if dividing by veclen cannot change the dtype (is this the case?), even something like -

if np.issubdtype(vec.dtype, np.int): vec = vec.astype(np.float) vec /= veclen

Also, sorry I'm really nitpicking here :)
But small things like this cause a slow decline in overall code quality over time.

Unfortunately dividing by veclen actually changes the dtype, causing the original problem to manifest itself (i.e. float32 inputs outputting float64). Even trying something like this:

if np.issubdtype(vec.dtype, np.int): vec = vec.astype(np.float) / veclen else: vec = vec.astype(vec.dtype) / veclen.astype(vec.dtype)

causes the same problem. This is why I divide by veclen and then cast the dtype. Do you have any suggestions to get around this?

Actually, I really like your second suggestion, and it passes all tests:

if np.issubdtype(vec.dtype, np.int): vec = vec.astype(np.float) vec /= veclen

That's very neat indeed, I'll commit it.

Also, don't worry about the nitpicking! When I am experienced enough to pick up on such things I'm sure I will be nitpicking like this haha.

Thanks for the fix and the explanation! Looks good.

jayantj · 2018-04-13T03:45:29Z

gensim/test/test_matutils.py

+        return vec
+
+
+class UnitvecTestCase(unittest.TestCase):


Thanks for reorganizing the tests! Looks much better now IMO

menshikh-iv · 2018-04-16T03:49:07Z

Congratulation @o-P-o with first contribution 🥇

o-P-o · 2018-04-16T11:26:30Z

Thanks for assisting my first ever merge guys!

…iskvorky#1722 (piskvorky#1992) * matutils.unitvec bug As requested, I have edited the fix to ignore dtype size. I use np.issubtype to check input type and handle appropriately before return to ensure non-integer output. * matutils.unitvec fix tests Tests to ensure float output for both float and integer inputs. * unitvec equal input and return types * Update and rename test_unitvec to test_unitvec.py * Update matutils.py * Update matutils.py * Update test_unitvec.py * Update and rename gensim/test_unitvec.py to gensim/test/test_matutils.py * Update matutils.py * Update test_matutils.py * Update test_matutils.py * Update following review Removed leading spaces, which is the source of the PEP8/travis errors. Sorry, only just learnt from you what these actually are :) Also updated the code to include 'if return_norm' statement from the sparse array case. (I can't remember why I actually removed this in the first place.) * Update: attempt to solve Travis errors * Update test_matutils.py * Update matutils.py * Update matutils.py * Update test_matutils.py * Addressing travis errors * Remove unnecessary dtype assignment * return_norm statements for array instance case * Update test_matutils.py * Reduce line repetition * Reduce repeated lines * Update test_matutils.py * Remove some redundant code from unitvec This is what I have done based on Jayanti's suggestion of redundant code. Let me know if I have misunderstood. * UnitvecTestCase update Simplified tha manual_unitvec method and created a separate test for each `arrtype, dtype` pair, as suggested. * Small typo fix * Trailing white-space fix for Travis * Improve code quality and remove no-op

o-P-o added 5 commits January 31, 2018 20:59

matutils.unitvec bug

82b8d17

As requested, I have edited the fix to ignore dtype size. I use np.issubtype to check input type and handle appropriately before return to ensure non-integer output.

matutils.unitvec fix tests

834b042

Tests to ensure float output for both float and integer inputs.

unitvec equal input and return types

cf463b2

Update and rename test_unitvec to test_unitvec.py

5656835

Update matutils.py

406ed66

menshikh-iv suggested changes Mar 22, 2018

View reviewed changes

o-P-o added 2 commits March 22, 2018 21:25

Update matutils.py

e71afcd

Update test_unitvec.py

fe36408

tmylk mentioned this pull request Mar 23, 2018

Update test_matutils.py for matutils.unitvec bug fix #1996

Closed

Update and rename gensim/test_unitvec.py to gensim/test/test_matutils.py

50d011c

menshikh-iv suggested changes Mar 24, 2018

View reviewed changes

o-P-o added 4 commits March 26, 2018 16:02

Merge branch 'develop' into develop

f1a40ac

Update matutils.py

ead451f

Update test_matutils.py

ae03291

Update test_matutils.py

cab90a8

menshikh-iv suggested changes Mar 27, 2018

View reviewed changes

o-P-o added 11 commits March 28, 2018 00:18

Update: attempt to solve Travis errors

218fe42

Update test_matutils.py

5fd1004

Update matutils.py

80628c0

Update matutils.py

768226b

Update test_matutils.py

438f763

Addressing travis errors

f73076a

Remove unnecessary dtype assignment

cd50529

return_norm statements for array instance case

11b0dde

Update test_matutils.py

2e86529

Reduce line repetition

30d6284

o-P-o added 2 commits March 29, 2018 18:57

Reduce repeated lines

d9cfb0d

Update test_matutils.py

2e8eca1

jayantj suggested changes Apr 10, 2018

View reviewed changes

o-P-o added 4 commits April 12, 2018 19:58

Remove some redundant code from unitvec

6beec75

This is what I have done based on Jayanti's suggestion of redundant code. Let me know if I have misunderstood.

UnitvecTestCase update

365a722

Simplified tha manual_unitvec method and created a separate test for each `arrtype, dtype` pair, as suggested.

Small typo fix

9327f68

Trailing white-space fix for Travis

a42cf38

jayantj reviewed Apr 13, 2018

View reviewed changes

Improve code quality and remove no-op

1078bdd

jayantj approved these changes Apr 16, 2018

View reviewed changes

menshikh-iv approved these changes Apr 16, 2018

View reviewed changes

menshikh-iv changed the title ~~Bug-fix attempt (matutils.unitvec): input float32 returning different dtype~~ Fix return dtype for matutils.unitvec according to input dtype. Fix #1722 Apr 16, 2018

menshikh-iv merged commit 8daace2 into piskvorky:develop Apr 16, 2018

		@@ -667,45 +667,54 @@ def ret_log_normalize_vec(vec, axis=1):

		def unitvec(vec, norm='l2'):
		"""Scale a vector to unit length.

		@@ -668,7 +668,7 @@ def ret_log_normalize_vec(vec, axis=1):

		def unitvec(vec, norm='l2', return_norm=False):
		"""Scale a vector to unit length.

Fix return dtype for matutils.unitvec according to input dtype. Fix #1722 #1992

Fix return dtype for matutils.unitvec according to input dtype. Fix #1722 #1992

Conversation

o-P-o commented Mar 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Mar 22, 2018

menshikh-iv commented Mar 22, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Mar 24, 2018

menshikh-iv left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jayantj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

o-P-o Apr 13, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

menshikh-iv commented Apr 16, 2018

o-P-o commented Apr 16, 2018

Fix return dtype for `matutils.unitvec` according to input dtype. Fix #1722 #1992

Fix return dtype for `matutils.unitvec` according to input dtype. Fix #1722 #1992

o-P-o Apr 13, 2018 •

edited

Loading