-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inf points causes corruption with BallTree #78
Comments
Hmm... in another case I observed the calculated nearest neighbor distance vary between runs even though the index was correct. Couldn't get it to reproduce reliably. However, here's a reproduction of the above:
|
Seems like something is wrong for |
Just for the record, I realized that the |
Hello! I'm experiencing similar problems with BallTrees on perfectly finite data; it seems to happen also when some of the points from which the tree is constructed are either very close together, almost equal or equal. Should I try to get the exact data that trigger the error? (it's pretty deep now) Thanks |
Having a small reproducer would make it easier to get to the bottom of the issue here. |
OK, will try ASAP (the tooling around is a bit complicated) |
I was trying to reproduce the issue generating random data, but It seems fine so far. Can you at least say the number of points and dimensions you've had a problem with? I noticed before the size of the array can influence the likelihood to trigger the problem. |
Hi -- sorry, I kindof missed my reminder and now I see it's already a year :D The data was around 30 dimensions, roughly 1000s of points, not more. You might have luck generating the same issue with same or almost-same points, try making a balltree out of a random cloud where all values are divided by 10^9, 10^10, etc.., or so. I'll add this to the TODOs for the next week and will let you know. |
I may have a MWE for this issue:
Example output:
A number or dimensions as low as 1 works. The number of points is a little more finicky, but with 11 it's pretty reliable to cause the problem. With 10 or 12 you don't see it. I believe with more points you need more "infs" to cause the issue. Most of the times either the first point or the last gets picked, but you can get points within the valid range as well, as in the example above. With a single "Inf" there's no issue, you need at least 2. |
Interesting, I guess my issue is a bit different then, I'm pretty sure I had no |
I've looked more into this, but I'm not very knowledgeable of ball trees. I think the issue with Not sure about @exaexa 's report, I couldn't find issues with regular numbers so far, so maybe it's some subtle float arithmetics thing? |
Could we just be a bit careful so instead of getting a center at Inf with a NaN radius, you get a ball at |
I'm not sure there's a way to make it work naturally... If you think about it, going for a neat solution, moving a single point to infinity should maybe cause the hypersphere to turn into a hyperplane touching the first inline point of the set in the normal direction. With multiple Inf, things start getting complicated, though. And this does not seem to go in line with the idea that making points Inf we are kind of "discarding" them. They may not be returned in the query, for sure, but this seems to be contradict very strongly the assumptions of a BallTree. I'm not sure what can be done, though, should we ignore points with Inf or just leave it up to the user?... We might be able to check when the mean is computed, but perhaps the best would be to perform a first pass looking for points to be discarded. |
To implement an exclusive nearest neighbor algorithm, I 'removed' points from my points array by setting their values to
Inf
and rebuilding theBallTree
from the edited array. I did this to avoid copying the array each time and to preserve indexing.However, the results output by
knn
are incorrect and unstable. Once the tree is created, it will always produce the same wrong output. However, regenerating the tree on the same data and tryingknn
again can produce different wrong output. The indexes are often some absurdly large integer such as 874275200 with anInf
distance. This is despite the fact that there are definitely points with non-infinite values in the tree that should be chosen as nearest neighbors.The same dataset produces correct and consistent results when using
BruteTree
andKDTree
.I'll see if I can provide a reproduction. The issue didn't start to occur until ~half the points were
Inf
.The text was updated successfully, but these errors were encountered: