Skip to content

Heuristics for testing search

Julian Harty edited this page Dec 18, 2017 · 2 revisions

Heuristics for testing search

Testing Search effectively, especially given the depth and breadth of contents available for Kiwix. This is compounded by the challenges of making the Search performant, reliable, and relevant on the vast range of Android devices, especially combined with the range of languages in use. Therefore it's hard to specify precisely what we expect in terms of how search will behave and perform. These heuristics are intended as a starting point - they should generally hold true yet you're welcome to adapt, reject or ignore those that don't seem useful.

Note: this wikipage remains a work-in-progress and is not complete or finished. Contributions are welcome.

Expectations

  • Kiwix will be able to search text-based content in ZIM files available to the app. Some storage locations don't seem to be available in practice e.g. OTG connected storage. We don't expect Kiwix to be able to search content it cannot access directly.
  • Searches will be based on the ZIM files currently available on the device at runtime. Users may delete files, add files, replace memory cards, etc. while Kiwix is running and between times when Kiwix is used. When users change what's available while Kiwix is running we expect Kiwix to adapt without needing to be restarted.
  • Searches will be possible in the language of the content; users will be able to input characters in that language e.g. in Japanese for Japanese content regardless of what language the device is configured to use.
  • Users will not be left with a blank results page. If Search doesn't find any results it will tell the users so.

Search heuristics

  • Whitespace is allowed and the first character of whitespace between words is significant. Additional whitespace will silently be ignored in terms of search results. e.g. white space and white space are considered to be equivalent when searching for results.
  • The first character of whitespace at the end of a word in the search box is significant. e.g. go may return different results from go. So go would match good, go would not match good.
  • Top online search terms for Wikimedia sources will be found (and matched) when searching the equivalent ZIM file in Kiwix-Android. There may be exceptions for highly topical searches e.g. in response to breaking news.
  • As more characters are entered there will be fewer results, as characters are removed from the end of the search term more results will be returned. The numbers will broadly be symmetric e.g. for a set of search queries fun -> fund -> fun similar results would be returned for both fun queries, fewer results would be returned for fund as fun matches function, fund does not match function.

Possible sources of top online search terms include:

Consistencies (inconsistencies) in search

Kiwix has been available for many years, as have ZIM files. Over the years the ZIM file format has been extended and modified. So has the way Searching has been implemented. Some of the older software and older ZIM files may behave differently. At some point we may be able to provide a matrix of software and ZIM file formats and how the intersections behave in various ways, including searching and search results. For now, let's remember there are likely to be differences and note these as part of investigating ways to improve search and search results.

Comparative testing

Generally, Search should be consistent across all Kiwix apps, servers and utilities. There may be valid reasons for some differences e.g. related to UX expectations, performance, etc.

Kiwix includes various command-line tools, one is kiwix-search, another zimsearch. We've decided to pick kiwix-search as the reference to test the core search capabilities.

The version I'm currently using is from: http://download.kiwix.org/nightly/2017-12-17/kiwix_tools_linux64_2017-12-17.tar.gz

Unknown behaviours (yet)

The following are unknown, at least from my perspective. Hopefully we will be able to clarify the expected/desired/actual behaviours for these soon.

  • Whether accents are significant in either the term entered or the content matched.
  • Whether commonly paired words such as white space, white-space and whitespace are considered to be equivalent in either the term entered or the content matched.
  • Whether common abbreviations will be supported and matched with the unabbreviated form e.g. WW2 and World War Two.
  • Whether users can enter special characters or otherwise control the behaviours of the search e.g. in terms of case sensitivity, boolean operations, wildcards, etc.