Update lucene to version 8.11.2 #15

JJK96 · 2024-07-08T20:04:11Z

This gave access to some new features in Lucene, such as Regular Expression search. This is a major refactor because I updated Lucene 5 major versions.

I tested several languages, English, Czech, Chinese, Japanese, Thai and search works in these languages. I am not capable to test if the stemming is good for all languages, so some more testing by native speakers is necessary.

…queries, search as before

I think "a" is not a stop word in this context, because it is a verb here. But my French is not that good.

I don't speak all of these languages, so I sometimes just changed the test to reflect the output. At least that should prevent regression.

tuomas2 · 2024-07-15T17:17:30Z

src/main/java/org/apache/lucene/analysis/AbstractBookTokenFilter.java

@@ -17,7 +17,7 @@
 * © CrossWire Bible Society, 2008 - 2016
 *
 */
-package org.crosswire.jsword.index.lucene.analysis;
+package org.apache.lucene.analysis;


why are these moved away from our namespace (org.crosswire)?

Because I needed to access some protected methods of Lucene classes in order to implement AbstractBookAnalyzer. Access to protected is only allowed in the same namespace.

Hmm then solution is somewhat hacky. Options to consider:

Fork lucene analysis lib and remove protected from that particular class (and make upstream PR). Use fork while it is needed.

Maybe protected is for a reason? Use some other way if lib author suggest something.

Accept hackyness and just leave it like this.

I found out that there's public abtract class StopWordAnalyzerBase that probably could be used as a base class. At least that is the baseclass within lucene core lib that is used for per-language classes there.

Question arises are all custom per-language analyzers still really needed or could we simplify code by using analyzers from lucene core directly.

AbstractBookAnalyzer carries book info and it is passed to some filter classes, but any of those does not seem to use that information. I am having a feeling that all that could be simplified greatly.

I agree, I'll look into it

tuomas2 · 2024-07-15T17:19:28Z

src/main/java/org/apache/lucene/analysis/ArabicLuceneAnalyzer.java

 import org.apache.lucene.util.Version;

 /**
 * An Analyzer whose {@link TokenStream} is built from a
- * {@link ArabicLetterTokenizer} filtered with {@link LowerCaseFilter},
+ * {@link StandardTokenizer} filtered with {@link LowerCaseFilter},


arabic need to be tested

tuomas2 · 2024-07-19T16:03:51Z

@JJK96 added you write access to this repository (and AndBible dev team members too, similar to AndBible repository. Pushed what we have to update_lucene branch locally and created #16 which will replace this PR. You can push your commits directly to update_lucene branch here.

tuomas2 and others added 18 commits June 6, 2024 11:45

Pull translations

8479cf7

Compiles

8258128

Uncleaned version that supports regex searching

41a8b6d

For regex queries search in full non-canonical text, while for other …

fbeaac7

…queries, search as before

Add switch for regex search type

982ce80

Make Regex search case insensitive

4239e9c

Fix Thai analyzer

4c92c9c

Fix Hebrew analyser

a06ecda

Fix Arabic

c784ccc

Fix Persian

7c43cca

Remove local.properties

d7616bc

Fix analyzer references

02fa61f

Fix tests

54c73b6

Add local.properties to gitignore

a4f26c2

Add smartcn analyzer

c3933c7

Fix Chinese and Japanese

d26a312

Fix French stemmer test

f00f512

I think "a" is not a stop word in this context, because it is a verb here. But my French is not that good.

Fix all tests

f355696

I don't speak all of these languages, so I sometimes just changed the test to reflect the output. At least that should prevent regression.

JJK96 mentioned this pull request Jul 8, 2024

Regex search AndBible/and-bible#3287

Open

tuomas2 reviewed Jul 15, 2024

View reviewed changes

tuomas2 mentioned this pull request Jul 19, 2024

Update lucene to version 8.11.2 #16

Open

tuomas2 closed this Jul 19, 2024

This comment was marked as duplicate.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update lucene to version 8.11.2 #15

Update lucene to version 8.11.2 #15

JJK96 commented Jul 8, 2024

tuomas2 Jul 15, 2024

JJK96 Jul 15, 2024

tuomas2 Jul 19, 2024

tuomas2 Jul 19, 2024

tuomas2 Jul 19, 2024

JJK96 Jul 19, 2024

tuomas2 Jul 15, 2024

tuomas2 commented Jul 19, 2024

This comment was marked as duplicate.

Update lucene to version 8.11.2 #15

Update lucene to version 8.11.2 #15

Conversation

JJK96 commented Jul 8, 2024

tuomas2 Jul 15, 2024

Choose a reason for hiding this comment

JJK96 Jul 15, 2024

Choose a reason for hiding this comment

tuomas2 Jul 19, 2024

Choose a reason for hiding this comment

tuomas2 Jul 19, 2024

Choose a reason for hiding this comment

tuomas2 Jul 19, 2024

Choose a reason for hiding this comment

JJK96 Jul 19, 2024

Choose a reason for hiding this comment

tuomas2 Jul 15, 2024

Choose a reason for hiding this comment

tuomas2 commented Jul 19, 2024

This comment was marked as duplicate.