Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update lucene to version 8.11.2 #15

Closed
wants to merge 18 commits into from
Closed

Conversation

JJK96
Copy link

@JJK96 JJK96 commented Jul 8, 2024

This gave access to some new features in Lucene, such as Regular Expression search. This is a major refactor because I updated Lucene 5 major versions.

I tested several languages, English, Czech, Chinese, Japanese, Thai and search works in these languages. I am not capable to test if the stemming is good for all languages, so some more testing by native speakers is necessary.

@@ -17,7 +17,7 @@
* © CrossWire Bible Society, 2008 - 2016
*
*/
package org.crosswire.jsword.index.lucene.analysis;
package org.apache.lucene.analysis;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these moved away from our namespace (org.crosswire)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I needed to access some protected methods of Lucene classes in order to implement AbstractBookAnalyzer. Access to protected is only allowed in the same namespace.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm then solution is somewhat hacky. Options to consider:

  1. Fork lucene analysis lib and remove protected from that particular class (and make upstream PR). Use fork while it is needed.
  2. Maybe protected is for a reason? Use some other way if lib author suggest something.
  3. Accept hackyness and just leave it like this.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out that there's public abtract class StopWordAnalyzerBase that probably could be used as a base class. At least that is the baseclass within lucene core lib that is used for per-language classes there.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question arises are all custom per-language analyzers still really needed or could we simplify code by using analyzers from lucene core directly.

AbstractBookAnalyzer carries book info and it is passed to some filter classes, but any of those does not seem to use that information. I am having a feeling that all that could be simplified greatly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I'll look into it

import org.apache.lucene.util.Version;

/**
* An Analyzer whose {@link TokenStream} is built from a
* {@link ArabicLetterTokenizer} filtered with {@link LowerCaseFilter},
* {@link StandardTokenizer} filtered with {@link LowerCaseFilter},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • arabic need to be tested

@tuomas2
Copy link

tuomas2 commented Jul 19, 2024

@JJK96 added you write access to this repository (and AndBible dev team members too, similar to AndBible repository. Pushed what we have to update_lucene branch locally and created #16 which will replace this PR. You can push your commits directly to update_lucene branch here.

@tuomas2 tuomas2 closed this Jul 19, 2024
@tuomas2

This comment was marked as duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Closed
Development

Successfully merging this pull request may close these issues.

2 participants