Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4797] Replace breezeSquaredDistance #3643

Closed
wants to merge 11 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Dec 9, 2014

This PR replaces slow breezeSquaredDistance.

@viirya viirya changed the title Replace breezeSquaredDistance [SPARK-4797] Replace breezeSquaredDistance Dec 9, 2014
@jkbradley
Copy link
Member

Hi, it looks like this may be faster for dense vectors but not for sparse. SparseVector.toArray will create a dense vector, making it much slower if the vector is very sparse. You will probably need separate cases for DenseVector and SparseVector.

@viirya
Copy link
Member Author

viirya commented Dec 10, 2014

Thanks. I add the consideration for different cases of SparseVector and DenseVector.

var squaredDistance = 0.0
(v1, v2) match {
case (v1: SparseVector, v2: SparseVector) =>
v1.indices.intersect(v2.indices).foreach((idx) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.intersect does not know that the indices are sorted, so it may be much faster to compute the intersection explicitly. That is done in BLAS.dot; I would recommend following that pattern. That will keep you from having 3 separate iterations over the indices.

In general, I wonder if these methods would be much faster if you iterated over counters & used while loops, rather than built-in iterator methods like intersect/diff/foreach.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I will modify this PR to follow the pattern used in BLAS.dot.

Seems that while-loop has better performance advantage than foreach. I add a commit to replace foreach with while-loop.

@jkbradley
Copy link
Member

Thanks for the update! I added some comments. One more about public APIs: The new method for vectorSquaredDistance is public, which requires some careful thought about API design. I'd recommend either:
(a) making the method private, or
(b) discussing a good place for distance metrics to go + a good API for them.
If you choose (a), then a later PR could always do (b).

@viirya
Copy link
Member Author

viirya commented Dec 11, 2014

Thanks for that. I add new commit to make the methods private now.

@viirya
Copy link
Member Author

viirya commented Dec 11, 2014

Hi, intersect, diff and foreach are all replaced with while-loop in the new commit to follow BLAS.dot pattern. Please see if there is any problem. Thanks.

var kv1 = 0
var kv2 = 0
var score = 0.0
while (kv1 < nnzv1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this loop logic correct? What if v1 runs out of indices, but v2 has some left. You'll need a check for this after the loop. But a better way to write it might be to have the outer loop check kv1 < nnzv1 || kv2 < nnzv2 and handle 1 non-zero index per iteration of the loop. (That would also let you have only 1 line updating squaredDistance

@jkbradley
Copy link
Member

@viirya Thanks for the updates! I made some inline comments, one of them major. Please let me know when to check again.

/**
* Returns the squared distance between DenseVector and SparseVector.
*/
private[util] def vectorSquaredDistance(v1: SparseVector, v2: DenseVector): Double = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Modified in later commit.

@viirya
Copy link
Member Author

viirya commented Dec 16, 2014

@jkbradley Thanks. The codes are modified for your comments. The test is also expanded to test the case of the major comment you mentioned. Please check it again.

} else if (kv1 >= nnzv1 || (kv2 < nnzv2 && v2Indices(kv2) < v1Indices(kv1))) {
score = v2Values(kv2)
kv2 += 1
} else if ((kv1 < nnzv1 && kv2 < nnzv2) && v1Indices(kv1) == v2Indices(kv2)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to check (kv1 < nnzv1 && kv2 < nnzv2) here since it will always be true (because of the previous 2 if statements).

@jkbradley
Copy link
Member

Just added more comments, including 1 small bug

@mengxr
Copy link
Contributor

mengxr commented Dec 19, 2014

A high-level comment: We recently merged couple PRs that optimize linear algebra performance because breeze is slow at certain operations. But I always think it would be nice if we can optimize breeze directly unless we want to provide a full-functional linear algebra package inside mllib. If this optimization could easily go into breeze, I would suggest that direction.

@viirya
Copy link
Member Author

viirya commented Dec 19, 2014

Thanks @mengxr. Not quite familiar with breeze. But as I roughly go through distance metric implementations of breeze. They are following the same pattern that employs zip operation. This way can address distance metric problem in a consistent approach. But the efficiency might be not its main consideration. That seems to be matching with the goal of breeze project to be generic, clean, and powerful without sacrificing (much) efficiency. To adopt this optimization in breeze will inevitably contradict breeze's principle and design. As I can see, this optimization may not easily go into breeze. That is my rough idea.

@SparkQA
Copy link

SparkQA commented Dec 22, 2014

Test build #552 has finished for PR 3643 at commit 91849d0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

It looks like there are bugs which show up in the pyspark tests. Have you run them locally?

@viirya
Copy link
Member Author

viirya commented Dec 23, 2014

I have not run pyspark tests. Fixed in update.

@viirya
Copy link
Member Author

viirya commented Dec 23, 2014

Please test it again.

@SparkQA
Copy link

SparkQA commented Dec 23, 2014

Test build #553 has finished for PR 3643 at commit ba34422.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Dec 29, 2014

@jkbradley Is there any problem you concern? Is this ready to merge? Thanks.

@mengxr
Copy link
Contributor

mengxr commented Dec 30, 2014

@viirya We recently added some linear algebra utility functions to linalg.Vectors. That should be a good place for this distance function. We use MATLAB naming convention there. Could you rebase master and move the implementation there? I suggest using sqdist as the method name and making it public.

@viirya
Copy link
Member Author

viirya commented Dec 31, 2014

@mengxr The implementation is renamed and moved to linalg.Vectors. Would you like to test it again?

@mengxr
Copy link
Contributor

mengxr commented Dec 31, 2014

add to whitelist

@mengxr
Copy link
Contributor

mengxr commented Dec 31, 2014

test this please

@jkbradley
Copy link
Member

I ran some quick tests with random sparsity patterns. Averaged over 1000 iterations, it's definitely faster:

length v1 sparsity v2 sparsity new time old time speedup
1000 1 0.5 9.42E-06 6.73E-04 71.44
1000 1 0.1 1.69E-06 5.50E-05 32.43
1000 1 0.01 1.90E-06 3.30E-05 17.40
1000 0.5 0.1 9.89E-06 7.17E-05 7.25
1000 0.5 0.01 2.54E-06 5.80E-05 22.80
1000 0.1 0.01 1.95E-06 5.82E-05 29.84
10000 1 0.5 1.11E-05 2.30E-04 20.73
10000 1 0.1 1.03E-05 2.54E-04 24.54
10000 1 0.01 8.69E-06 3.90E-04 44.92
10000 0.5 0.1 1.47E-05 3.90E-04 26.63
10000 0.5 0.01 8.63E-06 4.03E-04 46.76
10000 0.1 0.01 1.81E-06 5.96E-04 329.01
100000 1 0.5 9.27E-05 0.004039351 43.60
100000 1 0.1 9.06E-05 0.001540544 17.01
100000 1 0.01 8.71E-05 0.002636216 30.25
100000 0.5 0.1 1.15E-04 0.003777669 32.76
100000 0.5 0.01 9.61E-05 0.004879063 50.79
100000 0.1 0.01 1.89E-05 0.003148419 166.29
1000000 1 0.5 0.001017196 0.05418411 53.27

@jkbradley
Copy link
Member

@viirya Thanks for the updates! LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Dec 31, 2014

Test build #24964 has finished for PR 3643 at commit f28b275.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 06a9aa5 Dec 31, 2014
@mengxr
Copy link
Contributor

mengxr commented Dec 31, 2014

Merged into master. Thanks! (minor TODO: Though sqdist is touched in MLUtilsSuite, it would be nice to add unit tests to VectorsSuite.)

@viirya
Copy link
Member Author

viirya commented Jan 1, 2015

Thanks. I will add the unit tests in a later PR soon.

asfgit pushed a commit that referenced this pull request Jan 6, 2015
Related to #3643. Follow the previous suggestion to add unit test for `sqdist` in `VectorsSuite`.

Author: Liang-Chi Hsieh <[email protected]>

Closes #3869 from viirya/sqdist_test and squashes the following commits:

fb743da [Liang-Chi Hsieh] Modified for comment and fix bug.
90a08f3 [Liang-Chi Hsieh] Modified for comment.
39a3ca6 [Liang-Chi Hsieh] Take care of special case.
b789f42 [Liang-Chi Hsieh] More proper unit test with random sparsity pattern.
c36be68 [Liang-Chi Hsieh] Add unit test for sqdist.
@viirya viirya deleted the faster_squareddistance branch December 27, 2023 18:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants