-
Notifications
You must be signed in to change notification settings - Fork 5
ami words
NOTE: this is a picocli
subcommand but not fully integrated. It was developed from AMIArgProcessor/WordArgProcessor
.
It should be "relatively easy" to transfer to a Lucene pipeline under ami word
.
It is often run automatically in parallel to ami search
which complicates debugging. There should be no coupling.
WordCollectionFactory.transformWordStream(List<String>) line: 148
WordCollectionFactory.createWordList() line: 136
WordCollectionFactory.extractWords() line: 104
WordArgProcessor.extractWords() line: 168
<called from argProcessor or AMIWordsTool>
WordArgProcessor.extractWords()
runs:
getOrCreateWordCollectionFactory();
wordCollectionFactory.extractWords();
extractWords
essentially calls:
The rawWords
are extracted by currentCTree.extractWords()
and fed into transformWordStream(rawWords)
List<String> rawWords = currentCTree.extractWords();
wordList = (rawWords == null) ? null : transformWordStream(rawWords);
if (wordsTool != null && wordsTool.getVerbosityInt() >= 2) {
LOG.debug("wordsTool " + wordList.size());
}
Takes token stream (as List)
AMIArgProcessor wordArgProcessor = (AMIArgProcessor) amiArgProcessor;
if (amiArgProcessor.getChosenWordTypes().contains(AMIArgProcessor.ABBREVIATION)) {
transformedWords = createAbbreviations(transformedWords);
}
if (amiArgProcessor.getChosenWordTypes().contains(AMIArgProcessor.CAPITALIZED)) {
transformedWords = createCapitalized(transformedWords);
}
if (amiArgProcessor.getWordCaseList().contains(AMIArgProcessor.IGNORE)) {
transformedWords = toLowerCase(transformedWords);
}
List<WordSetWrapper> stopWordSetList = wordArgProcessor.getStopwordSetList();
for (WordSetWrapper stopWordSet : stopWordSetList) {
transformedWords = applyStopwordFilter(stopWordSet, transformedWords);
}
if (amiArgProcessor.getStemming()) {
transformedWords = LuceneUtils.applyPorterStemming(transformedWords);
}
return transformedWords;
#### List<String> transformWordStream(AMIWordsTool wordsTool, List<String> transformedWords)
new version
#### List<String> transformWordStream(AMIWordsTool wordsTool, List<String> transformedWords)
new version
as above but controlled by picocli
options.
if (wordsTool.isAbbreviation()) {
transformedWords = createAbbreviations(transformedWords);
}
if (wordsTool.isCapital()) {
transformedWords = createCapitalized(transformedWords);
}
if (wordsTool.isIgnoreCase()) {
transformedWords = toLowerCase(transformedWords);
}
List<WordSetWrapper> stopWordSetList = wordsTool.getStopWordsSetList();
for (WordSetWrapper stopWordSet : stopWordSetList) {
transformedWords = applyStopwordFilter(stopWordSet, transformedWords);
}
if (wordsTool.isStemming()) {
transformedWords = LuceneUtils.applyPorterStemming(transformedWords);
}
return transformedWords;
}