Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate entries in search results #980

Closed
BPerlakiH opened this issue Sep 15, 2024 · 2 comments · Fixed by #981
Closed

Duplicate entries in search results #980

BPerlakiH opened this issue Sep 15, 2024 · 2 comments · Fixed by #981
Assignees
Milestone

Comments

@BPerlakiH
Copy link
Collaborator

BPerlakiH commented Sep 15, 2024

Based on the findings from: #979

We can have multiple search results, as indexed search and title search can return the same entries.
Additionally some entries will have duplicate URLs, one with and one without a trailing slash, eg:
zim://B2A69C3D-7852-400F-9A07-8986875DF683/solar.lowtechmagazine.com/tags/speed/ zim://B2A69C3D-7852-400F-9A07-8986875DF683/solar.lowtechmagazine.com/tags/speed
Whereas both will result in the same search result, and page linked.

Example on macOS:

Screenshot 2024-09-15 at 13 52 05
@BPerlakiH BPerlakiH added this to the 3.6.0 milestone Sep 15, 2024
@BPerlakiH BPerlakiH self-assigned this Sep 15, 2024
@BPerlakiH BPerlakiH changed the title Remove duplicate entry from search results Duplicate entries in search results Sep 15, 2024
@BPerlakiH BPerlakiH linked a pull request Sep 15, 2024 that will close this issue
@kelson42
Copy link
Contributor

Here a general remark. The libzim provides two searches:

  • Title suggestions
  • Fulltext searches

Usually it is either one or the other. Only the Apple reader does somehow a mix. I'm not super found of this approach honestly as it creates a lot of new challenges.

Not against fixes this one obviously, but just want to share the information that this approach might disappear in the future.

@BPerlakiH
Copy link
Collaborator Author

BPerlakiH commented Sep 15, 2024

@kelson42 I did some further investigation on this. There are more issues discovered, we have this:

if (archive.hasFulltextIndex()) {
indexSearchArchives.push_back(archive);
}
titleSearchArchives.push_back(archive);

Now this indeed means we do search in both ways, as you wrote.

Currently, with the wikipedia copy I have, it throws an exception on indexed search:
DatabaseCorruptError: dir_end invalid in block 28240
Which has the following consequences, if I do change it as you suggested: to be either indexed or title search:

if (archive.hasFulltextIndex()) {
    indexSearchArchives.push_back(archive);
} else {
    titleSearchArchives.push_back(archive);
}

it won't give any results, since the indexed search fails, and we won't do the title search at all.

Additionally to this, I did found that we do the search on a set of archives, which is also not perfect:

  • if it throws an exception on 1 archive from the set, we loose the results from the whole set!

I am updating the PR to do it one try / catch per archive. That way we can continue to get results even if one of the archives is "bad".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants