WARNING: WORK IN PROGRESS - THIS IS ONLY A TEMPLATE FOR THE DOCUMENTATION.
RELEASE DOCS ARE ON THE PROJECT WEBSITE
A SemText
object contains a list of Sentence
objects, which in turn can contain a list of Term
objects.
Each Term
has a MeaningStatus
, and possibly a selected meaning and a list of alternative meanings suggested for disambiguation.
Both sentences and terms are spans, and actual text is only stored in root SemText
object. Span` offsets are always absolute and calculated with respects to it. Spans can't overlap.
All of semantic text items (sentences, terms, meaning, semtext itself) can hold metadata.
Utility functions are held in SemTexts
class and to iterate through semtext terms a TermsView
and a TermIterator
are provided.
SemText data model supports a simple interaction cycle with the user, where some nlp service enriches the text with tags and then a human user validates the tags. There are four possible MeaningStatus
associated to a Term
:
SemText is available on Maven Central. To use it, put this in the dependencies section of your pom.xml:
<dependency>
<groupId>eu.trentorise.opendata.semtext</groupId>
<artifactId>semtext</artifactId>
<version>#{version}</version>
</dependency>
In case updates are available, version numbers follows semantic versioning rules.
Objects have no public constructor. To make them use factory methods starting with of
:
SemText semText1 = SemText.of(Locale.ITALIAN, "ciao");
To obtain a modified version of an object, use with
methods:
SemText semText2 = semText1.with("buongiorno");
assert semText1.getText().equals("ciao");
assert semText2.getText().equals("buongiorno");
Let's go through the steps to construct a SemText of one sentence and one term with SELECTED meaning.
We will pick this text:
String text = "Welcome to Garda lake.";
This builds theMeaning
:
Meaning meaning = Meaning.of("http://someknowledgebase.org/entities/garda-lake",
MeaningKind.ENTITY,
0.7);
We indicate the span where Garda lake occurs (a Term
can span many words):
Term term = Term.of(11, 21, MeaningStatus.SELECTED, meaning);
We finally construct the immutable SemText
. Notice language can be set only for the whole SemText
:
SemText semText = SemText.of(
Locale.ENGLISH,
text,
Sentence.of(0, 26, term)); // sentence spans the whole text
Only SemText
actually contains the text. To retrieve it from a span use getText
:
assert "Garda lake".equals(semText.getText(term));
SemTexts
class contains utilities, like converters and checkers:
ImmutableList<SemText> semtexts = SemTexts.dictToSemTexts(Dict.of(Locale.ITALIAN, "Ciao"));
try {
SemTexts.checkScore(1.7, "Invalid score!");
} catch(IllegalArgumentException ex){
}
Other examples usages can be found in the tests
Serialization to/from JSON is done with Jackson.
For ser/deserilization to work, you need first to register SemTextModule
and other required modules in your own Jackson ObjectMapper
:
ObjectMapper om = new ObjectMapper();
om.registerModule(new GuavaModule());
om.registerModule(new TodCommonsModule());
om.registerModule(new SemTextModule());
String json = om.writeValueAsString(SemText.of(Locale.ITALIAN, "ciao"));
SemText reconstructedSemText = om.readValue(json, SemText.class);
Notice we have also registered the necessary Guava (for immutable collections) and Tod Commons modules (for Dict and LocalizedString). To register everything in one command just write:
ObjectMapper om = new ObjectMapper();
SemTextModule.registerModulesInto(om);
ObjectMapper om = new ObjectMapper();
SemTextModule.registerModulesInto(om);
String json = om.writeValueAsString(SemText.of(Locale.ITALIAN, "ciao"));
SemText reconstructedSemText = om.readValue(json, SemText.class);
Any object can be attached as metadata to SemText
, Sentence
, Term
or Meaning
, with the constraint that it must be non-null and should be immutable. Metadata is accessed by providing a namespace. For example, let's say we want to associate a Java Date
to SemText
, under the namespace testns
.
To create our object in Java we can write this:
SemText.of("ciao").withMetadata("testns", new Date(123))
Once serialized as JSON, it will look like this:
{
"locale":"",
"text":"ciao",
"sentences":[],
"metadata":{
"testns":123
}
}
In order for metadata objects to be properly deserialized, we need to associate namespaces to object types by registering them in the SemTextModule
. So, in order to serialize/deserialize the example above, we would do something like the following:
ObjectMapper om = new ObjectMapper();
// register all required modules into the Jackson Object Mapper
SemTextModule.registerModulesInto(om);
// declare that metadata under namespace 'testns' in SemText objects
// should be deserialized into a Date object
SemTextModule.registerMetadata(SemText.class, "testns", Date.class);
String json = om.writeValueAsString(SemText.of("ciao").withMetadata("testns", new Date(123)));
SemText reconstructedSemText = om.readValue(json, SemText.class);
Date reconstructedMetadata = (Date) reconstructedSemText.getMetadata("testns");
assert new Date(123).equals(reconstructedMetadata);
NOTE: namespace register is a static variable, so it's shared among all the object mappers. If you're writing a library with SemText serializer make sure to use a reasonably unique (thus long) namespace. Applications instead may use shorter namespaces.
A more complex example can be found in SemTextModuleTest.metadataSerializationComplex, which shows how to develop and register a custom immutable metadata object.