Semantic Quality Benchmark for Word Embeddings, i.e. Natural Language Models in Python. Acronym SeaQuBe
or seaqube
.
This python framework provides several text augmentation implementations and word embedding quality evaluation methods. It is designed to fit in your machine learning pipeline. The BaseAugmentation
class provides the same api as the python package nlpaug, so that this packages can used together smoothly. However BaseAugmentation
provides also other methods. Detailed examples see beneath.
SeaQuBe
provides also a toolkit to wrap a trained nlp model to a nice interactive tool.
- Text Data Augmentation
- Chaining and Reducing of Text Data Augmentations
- Word Embedding Quality Methods
- Interactive NLM Model Wrapper
- Augmentation in three lines
- Example of Basic Text Augmentation
- Example of Text Augmentation Chaining
- Example of Word Embedding Evaluation
- Example of Interactive NLP
Level | Augmenter | Description |
---|---|---|
Character | QwertyAugmentation | Simulate keyboard distance error |
Corpus | UnigramAugmentation | Replace ubiquitous words with other ubiquitous words |
Word | Active2PassiveAugmentation | Change surface of document using an simple active-to-passive transformer |
Word | EDAAugmentation | Augment document using the EDA algorithm |
Word | EmbeddingAugmentation | Replace similar word using WordNet |
Word | TranslationAugmentation | Change surface of document using translation and back-translation (with GoogleTranslate) |
The streaming feature of augmentation is implemented in the AugmentationStreamer
class. One Reduceing
class exist, more can implemented
extending the BaseReduction
class.
Action | Class | Description |
---|---|---|
Streaming | AugmentationStreamer | Run augmentation for each document through all chained augmentations. |
Reducing | UniqueCorpusReduction | Getting a list of documents, only unique documents are returned. |
Method | Description |
---|---|
WordAnalogyBenchmark | This method benchmark how go relations of the type: a is to b as c is to d can be solved correctly. |
WordSimilarityBenchmark | This methods compares the similarity of a word pair, calculated by a model with a human estimated similarity score. |
WordOutliersBenchmark | This method benchmark how good a outlier of a group of words can be detected. |
SemanticWordnetBenchmark | Based on the WordNet graph, the goodnes of the semantic / similarity of a nlp model is benchmarked. |
SeaQuBe
can be installed from PyPip using: pip install seaqube
or run in the main directory: python setup.py install
.
Some external dependencies are not installed automatically, but seaqube
or nltk
might throw errors with an instruction what to do.
For example seqube
might ask you to run:
python -c "from seaqube import download;download('vec4ir')"
from seaqube.augmentation.word import Active2PassiveAugmentation, EDAAugmentation, TranslationAugmentation, EmbeddingAugmentation
translate = TranslationAugmentation(max_length=2)
translate.doc_augment(['This', 'is', 'a', 'tokenized', 'corpus'])
TODO