-
Notifications
You must be signed in to change notification settings - Fork 82
Debugging Luwak
So you’ve added a bunch of queries to your luwak Monitor, and you’ve tried matching some documents against them, and you’re not getting the exact answers you want. How can you find out what’s going wrong?
Rather than abort a match run due to one bad query, luwak saves exceptions that are thrown during matching and reports them as part of the final Matches object. So if you find a query isn’t matching a document, and you expect it to be, have a look at the return value of Matches.getErrors() and see if the query is throwing an Exception.
Quite often, matching errors are in fact down to problems with tokenization during document analysis. To check that a query is actually matching once it has been selected by the presearcher, you can run your query directly against the searcher for a particular batch:
TopDocs td = batch.getSearcher().search(query, 10);
Luwak speeds up matching by analysing queries as they are added to the Monitor, and then only selecting those queries that it views as likely to match a given document to actually run at match time. This is fertile ground for bugs. To ensure that your query is actually being selected by the presearcher, you can do one of two things:
- check the getPresearcherHits() values on your Matches response
- run your Monitor with a MatchAllPresearcher to ensure that every query is selected for matching
If the presearcher isn’t selecting your query, and it should be, then you have a bug.
The standard presearcher shipped with luwak is the TermFilteredPresearcher, which works by analysing queries as they are added to the Monitor and extracting combinations of terms that a document must have in order to match the query. Internally, a query is mapped to a tree-like structure called a QueryTree by a QueryTreeBuilder, and then terms are extracted from this tree using a TreeWeightor. The MultipassTermFilteredPresearcher does this several times, extracting different combinations of terms each time. Bugs can occur here if queries are not analysed correctly.
You can get an explanation of how a particular query is being analysed using TermFilteredPresearcher.showQueryTree(Query, PrintStream). This will write a schematic representation of the analysis to a print stream, showing the terms taken from a query, the weights assigned to those terms, and the subset of terms ultimately selected for indexing.
As an example, take the query +field:horsten field:thurston +(+(field:periwinkle field:flibbertigibbet) +field:verbiage)
. Calling showQueryTree
with this query yields the following:
Conjunction[2] 3.8506389 [EXACT field:periwinkle, EXACT field:flibbertigibbet]
Conjunction[2] 3.8506389 [EXACT field:periwinkle, EXACT field:flibbertigibbet]
Disjunction[2] 3.8506389 { [EXACT field:periwinkle] [EXACT field:flibbertigibbet] }
Node [EXACT field:periwinkle] 3.8506389
Node [EXACT field:flibbertigibbet] 3.966673
Node [EXACT field:verbiage] 3.7278461
Node [EXACT field:horsten] 3.6326308
The top level is a conjunction node with two entries (the added SHOULD clause field:thurston
is discarded, because it only matches if other terms are present). Stepping down through the hierarchy, we can see at each level which terms are selected by that node, with their weights. A conjunction will select whichever of its child nodes has the highest weight, while a disjunction selects all of its child nodes, and assigns the lowest of all their weights to itself. If a query is not being broken up correctly, or its term is somehow being mangled, you should be able to see it here.