The aim is to split a complex sentence into a meaning-preserving sequence of shorter sentences.
The input sentence with more than two clauses is strategically broken into 2 sentences, each sentence having no more than 2 clauses. They are sent to the Hugging Face's T5 pre-trained model fine-tuned with 300K sentences from websplit v1.0 dataset, to split up into multiple sentences. Each multiple sentence is further assigned a similarity score to the input sentence based on TF-IDF Vectorizer. The sentences with fewer similarity scores are removed.
The link to the models and data can be found here, and link to the jars can be found here.
[1] Preservation of keywords is an important factor. But the output from the fine-tuned Hugging Face's T5 model replaced a few words with their synonyms. This can be improved by filtering the training data from the dataset.
[2] Loss of important keywords. The output sometimes ignores important dates, places, etc.
[1] rui-yan: split-and-rephrase
[2] shreyaUp: Sentence-Simplification