-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow, even with small dataset #3
Comments
I have plans to further optimize the |
Batch processing was the first thing that came to my mind. We just pulled your library down yesterday, so I think it's current. Maybe we (@odinolav) could help further things with you? |
Thanks, I am very open to collaborating with other developers. The latest release is |
For what it's worth, I'm first trying to speed up the batch test I have going, and I've found (unsurprisingly) that preemptively making sure at least two characters in any order match up can help quite a bit. At first I was thinking matching by two consecutive characters could be safe... but then there are names like Jim => James. 2 non-consecutive seems safe though. |
Agreed, I think one or more matching consonant letters would also be a safe candidate filter. I will have to run some tests on my international names dataset to confirm. If there are a few edge cases where it doesn't work, they can even be hard-coded into the algorithm. |
I ended up using a modified version of the Soundex algorithm to filter for disjointed encodings. This candidate filter is a new feature in Note: This alone will not make the cartesian product of two arrays in the millions computable, as the implementation is still O(m x n). I will be labelling this issue as help wanted for anyone who can improve the performance of the library. |
We built a 5000 row proof of concept that searches first and last name only, and it takes about a minute to show a result. Our implementation would need to search a few million rows. Do you have any suggestions on how to improve performance?
The text was updated successfully, but these errors were encountered: