Slow, even with small dataset #3

ericcbohn · 2020-08-27T17:08:01Z

We built a 5000 row proof of concept that searches first and last name only, and it takes about a minute to show a result. Our implementation would need to search a few million rows. Do you have any suggestions on how to improve performance?

Christopher-Thornton · 2020-08-27T18:54:55Z

I have plans to further optimize the Matcher class and add optional parameters which will use heuristics to quickly filter down potential candidates (e.g. sequence of two or more characters matching). There is some overhead when loading in libraries and models, which I hope wasn't included in the time for your testing. Another idea I am considering implementing is a batch/multiprocessing option.

ericcbohn · 2020-08-27T19:05:18Z

Batch processing was the first thing that came to my mind. We just pulled your library down yesterday, so I think it's current. Maybe we (@odinolav) could help further things with you?

Christopher-Thornton · 2020-08-27T19:34:14Z

Thanks, I am very open to collaborating with other developers. The latest release is v0.1.6 which I uploaded last night.

odinolav · 2020-08-27T19:39:37Z

For what it's worth, I'm first trying to speed up the batch test I have going, and I've found (unsurprisingly) that preemptively making sure at least two characters in any order match up can help quite a bit. At first I was thinking matching by two consecutive characters could be safe... but then there are names like Jim => James. 2 non-consecutive seems safe though.

Christopher-Thornton · 2020-08-27T19:48:40Z

Agreed, I think one or more matching consonant letters would also be a safe candidate filter. I will have to run some tests on my international names dataset to confirm. If there are a few edge cases where it doesn't work, they can even be hard-coded into the algorithm.

Christopher-Thornton · 2020-09-23T16:11:13Z

I ended up using a modified version of the Soundex algorithm to filter for disjointed encodings. This candidate filter is a new feature in v0.1.8 as default behavior Matcher(prefilter=True).

Note: This alone will not make the cartesian product of two arrays in the millions computable, as the implementation is still O(m x n). I will be labelling this issue as help wanted for anyone who can improve the performance of the library.

Christopher-Thornton added the help wanted Extra attention is needed label Sep 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow, even with small dataset #3

Slow, even with small dataset #3

ericcbohn commented Aug 27, 2020

Christopher-Thornton commented Aug 27, 2020 •

edited

Loading

ericcbohn commented Aug 27, 2020 •

edited

Loading

Christopher-Thornton commented Aug 27, 2020

odinolav commented Aug 27, 2020

Christopher-Thornton commented Aug 27, 2020 •

edited

Loading

Christopher-Thornton commented Sep 23, 2020 •

edited

Loading

Slow, even with small dataset #3

Slow, even with small dataset #3

Comments

ericcbohn commented Aug 27, 2020

Christopher-Thornton commented Aug 27, 2020 • edited Loading

ericcbohn commented Aug 27, 2020 • edited Loading

Christopher-Thornton commented Aug 27, 2020

odinolav commented Aug 27, 2020

Christopher-Thornton commented Aug 27, 2020 • edited Loading

Christopher-Thornton commented Sep 23, 2020 • edited Loading

Christopher-Thornton commented Aug 27, 2020 •

edited

Loading

ericcbohn commented Aug 27, 2020 •

edited

Loading

Christopher-Thornton commented Aug 27, 2020 •

edited

Loading

Christopher-Thornton commented Sep 23, 2020 •

edited

Loading