An open-source library for Automatic Term Recognition written in Scala.
To cite ATR4S:
N.Astrakhantsev. ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala. arXiv preprint arXiv:1611.07804, 2016.
- AvgTermFreq
- ResidualIDF
- TotalTF-IDF
- CValue
- Basic
- ComboBasic
- PostRankDC
- Relevance
- Weirdness
- DomainPertinence
- NovelTopicModel
- LinkProbability
- KeyConceptRelatedness
- Voting
- PU-ATR
Scala 2.11
Spark 1.5+ (for Voting and PU-ATR)
(Apache OpenNLP is also supported, but
preliminary experiments showed that its quality is not better than Emory nlp4j, while it is not thread-safe;
if you are going to use OpenNLP, download models from Apache OpenNLP and place them into src/main/resources
)
(Stanford CoreNLP is also supported by this helper, which is moved to a separate module licensed by GPL, due to GPL licensing of Stanford CoreNLP).
In order to use some algorithms you need to download auxiliary files and place them into
WORKING_DIRECTORY/data
directory (note that working directory can be specified in gradle.properties
- by default, this is experiments
)
or specify path in the corresponding configuration/builder class
(e.g. Word2VecAdapterConfig
of KeyConceptRelatedness
).
Namely,
- for LinkProbability download info_measure.txt;
- for Relevance download COHA_term_occurrences.txt;
- for KeyConceptRelatedness download w2vConcepts.model.
Datasets used in the experiments can be downloaded from Release page.
PU algorithm may or may not work on Windows due to some bugs in Spark (see relevant questions on Stackoverflow, maybe they help you: 1, 2, 3).
The library is published into Maven central and JCenter. Add the following lines depending on your build system.
compile 'ru.ispras:atr4s:1.2.2'
<dependency>
<groupId>ru.ispras</groupId>
<artifactId>atr4s</artifactId>
<version>1.2.2</version>
</dependency>
libraryDependencies += "ru.ispras" % "atr4s" % "1.2.2"
Build library with gradle:
./gradlew jar
./gradlew recognize -Pdataset=acl2 -PtopCount=10 -Pconfig=CValue.conf -Poutput=cvalueterms.txt
Here we recognize top 10 terms from text files stored in acl2
directory
(should be subdirectory of WORKING_DIRECTORY
) by CValue measure
(stored in CValue.conf
file) and writes recognized terms with weights in cvalueterms.txt
.
Note that if the encoding of input text files differs from UTF-8, then you should specify the correct encoding in the config of NLPPreprocessor
(or convert input files, there are many tools for that).
See ATRConfig
class, which is a Configuration/builder for a facade class AutomaticTermsRecognizer
.
See AutomaticTermsRecognizer
object for example.
Usage in Java does not differ significantly, so see the same classes for examples.
However, since Java does not support parameters with default values,
we provide helper static functions named make()
for most classes containing parameters with default values or parameters with Scala collections,
see example below.
Also note that there is a special method returning weighted terms as Java Iterable, so that you won't need to convert Scala collections to Java ones.
class ATRExample {
public static void main(String[] args) {
String datasetDir = args[0];
int topCount = args[1];
ATRConfig atrConfig = new ATRConfig(EmoryNLPPreprocessorConfig.make(),
TCCConfig.make(),
new OneFeatureTCWeighterConfig(Weirdness.make()));
Iterable<WeightedTerm> terms = atrConfig.build().recognizeAsJavaIterable(datasetDir, topCount);
for (WeightedTerm termAndWeight: terms) {
System.out.println(termAndWeight);
}
}
}
Apache License Version 2.0.