List of packages developed with focus on natural language processing.
- aitextgen - A robust Python tool for text-based AI training and generation using GPT-2 [site].
- AllenNLP - An open-source NLP research library, built on PyTorch [site].
- BERTopic - Leveraging BERT and c-TF-IDF to create easily interpretable topics [site].
- BigARTM - Fast topic modeling platform [site].
- ChatterBot - a machine learning, conversational dialog engine for creating chat bots [site].
- clean-text - package for text cleaning.
- cltk - The Classical Language Toolkit [site].
- ColossalAI - Making large AI models cheaper, faster and more accessible [site]
- conTEXT-explorer - open Web-based system for exploring and visualizing concepts (combinations of occurring words and phrases) over time in the text documents.
- contextualized-topic-models - package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics.
- DeText - A Deep Neural Text Understanding Framework for Ranking and Classification Tasks.
- dl-translate - A deep learning-based translation library built on Huggingface transformers.
- ecco - Explain, analyze, and visualize NLP language models [site].
- flair - A very simple framework for state-of-the-art Natural Language Processing (NLP).
- flashtext - Extract Keywords from sentence or Replace keywords in sentences.
- ftfy - ftfy (fixes text for you) fixes mojibake and other glitches in Unicode text, after the fact [site].
- gluon-nlp - A toolkit that helps you solve NLP problems [site].
- Gensim - Topic Modelling for Humans [site].
- Gramformer - A framework for detecting, highlighting and correcting grammatical errors on natural language text.
- HanLP - The multilingual NLP library for researchers and companies, built on PyTorch and TensorFlow 2.x, for advancing state-of-the-art deep learning techniques in both academia and industry [site].
- haystack - framework to interact with your data using Transformer models and LLMs [site].
- interpret-text - A library that incorporates state-of-the-art explainers for text-based machine learning models and visualizes the result with a built-in dashboard.
- intertext - Detect and visualize text reuse [site].
- jury - Comprehensive NLP Evaluation System.
- ktrain - library that makes deep learning and AI more accessible and easier to apply.
- langchain - Building applications with LLMs through composability.
- llama-cpp-python - Python bindings for llama.cpp.
- lexical_diversity - package for calculating a variety of lexical diversity indices.
- lexrank - LexRank algorithm for text summarization.
- multi_rake - Multilingual Rapid Automatic Keyword Extraction (RAKE) for Python.
- Multilingual Latent Dirichlet Allocation - A Multilingual Latent Dirichlet Allocation (LDA) Pipeline with Stop Words Removal, n-gram features, and Inverse Stemming, in Python.
- multiplex-plot - A Python library to create and annotate beautiful network graph visualizations, text visualizations and more.
- neattext - a simple NLP package for cleaning textual data and text preprocessing.
- news-graph - Key information extraction from text and graph visualization.
- NLTK - Natural Language Toolkit [site].
- NLP-Cube - Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing [site].
- nlpaug - Data augmentation for NLP.
- nlpnet - A neural network architecture for NLP tasks, using cython for fast performance. Currently, it can perform POS tagging, SRL and dependency parsing.
- nlu - 1 line for thousands of State of The Art NLP models in hundreds of languages The fastest and most accurate way to solve text problems [site].
- OpenKiwi - Open-Source Machine Translation Quality Estimation in PyTorch [site].
- ParlAI - A framework for training and evaluating AI models on a variety of openly available dialogue datasets [site].
- Parrot - A practical and feature-rich paraphrasing framework to augment human intents in text form to build robust NLU models for conversational engines.
- pattern - Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization. [wiki].
- polyglot - Multilingual text (NLP) processing toolkit [site].
- pyhunspell - Python bindings for the Hunspell spellchecker engine.
- PyNLPl - Python Natural Language Processing Library.
- pysentimiento - Multilingual toolkit for Sentiment Analysis and Social NLP tasks.
- PySS3 - A Python package implementing a new interpretable machine learning model for text classification (with visualization tools for Explainable AI) [site].
- pytextrank - Python implementation of TextRank algorithms for phrase extraction [site].
- PyTorch-NLP - Basic Utilities for PyTorch Natural Language Processing [site].
- pywsd - Implementations of Word Sense Disambiguation (WSD) Technologies.
- rasa - Open source machine learning framework to automate text- and voice-based conversations [site].
- rosetta - Tools and wrappers for data science with a concentration on text processing.
- scattertext - Beautiful visualizations of how language differs among document types.
- sense2vec - Contextually-keyed word vectors [site].
- sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [site].
- sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.
- small-text - Active Learning for Text Classification in Python.
- spaCy - Industrial-strength Natural Language Processing (NLP) in Python. [site].
- spacy-stanza - Use the latest Stanza (StanfordNLP) research models directly in spaCy.
- Spark NLP - State of the Art Natural Language Processing [site].
- Stanza - Official Stanford NLP Python Library for Many Human Languages [site].
- texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow.
- TextAttack - a Python framework for adversarial attacks, data augmentation, and model training in NLP [site].
- textacy - NLP, before and after spaCy [site].
- TextBlob - Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more [site].
- TextBox - a text generation library with pre-trained language models.
- TextFeatureSelection - library for feature selection for text features.
- texthero - Text preprocessing, representation and visualization from zero to hero [site].
- textkit - Command line tool for manipulating and analyzing text [site].
- textnets - Text analysis with networks [site].
- textplot - maps of texts with kernel density estimation and force-directed networks.
- textstat - python package to calculate readability statistics of a text object - paragraphs, sentences, articles.
- text2text - Crosslingual NLP/G toolkit.
- tomotopy - Python package of Tomoto, the Topic Modeling Tool [site].
- topic modelling tools - Topic Modelling with Latent Dirichlet Allocation using Gibbs sampling.
- torchtext - Data loaders and abstractions for text and NLP [site].
- trankit - a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing.
- transformers - Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX [site].
- txtai - Build AI-powered semantic search applications [site].
- verbecc - Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian.
- wordfreq - Access a database of word frequencies, in various natural languages.
- wordseer - text analysis tool, written in Flask [site].
- wordtree - A Python library for generating word tree diagrams.
- BTM - Biterm Topic Models for Short Text [cran].
- cleanNLP - Package providing annotators and a normalized data model for natural language processing [cran].
- CRAN Task View: Natural Language Processing.
- corporaexplorer - An R package for dynamic exploration of text collections [cran], [site].
- dfrtopics - package for exploring topic models of text.
- koRpus - An R Package for Text Analysis [cran].
- hunspell - High-Performance Stemmer, Tokenizer, and Spell Checker for R [cran], [site].
- languageR - Analyzing Linguistic Data.
- lda - Collapsed Gibbs Sampling Methods for Topic Models.
- ldatuning - LDA models parameters tuning [cran].
- lsa - Latent Semantic Analysis.
- NLP - Basic classes and methods for Natural Language Processing.
- openNLP - An interface to the Apache OpenNLP tools.
- pattern.nlp - R package to perform sentiment analysis and Parts of Speech tagging for Dutch/French/English/German/Spanish/Italian.
- quanteda - package for the Quantitative Analysis of Textual Data [cran], [site].
- RKEA - interface to KEA (Keyphrase Extraction Algorithm).
- r-corpus - Text corpus analysis in R [cran].
- RMallet - An R Wrapper for the Java Mallet Topic Modeling Toolkit [cran].
- sentencepiece - R package for Byte Pair Encoding / Unigram modelling based on Sentencepiece [cran].
- SnowballC - Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library.
- spacyr - R wrapper to spaCy NLP [cran], [site].
- stm - Estimation of the Structural Topic Model [cran], [site].
- stopwords - Multilingual Stopword Lists in R [cran], [site].
- stringi - Fast and portable character string processing in R (with the Unicode ICU) [cran], [site].
- stringr - A fresh approach to string manipulation in R [cran], [site].
- tau - Text Analysis Utilities.
- Text2vec - Fast vectorization, topic modeling, distances and GloVe word embeddings in R [cran], [site].
- textnets - R package to perform automated text analysis using network techniques.
- textplot - unctionalities to easily visualise complex relations in texts [cran].
- textplot - Plotting for text data.
- textreuse - Detect text reuse and document similarity [cran], [site].
- tidytext - Text mining using tidy tools [cran], [site].
- tm - A framework for text mining applications within R.
- tokenizers - Fast, Consistent Tokenization of Natural Language Text [cran], [site].
- topicdoc - Topic-Specific Diagnostics for LDA and CTM Topic Models [cran], [site].
- topicmodels - Latent Dirichlet Allocation (LDA) models and Correlated Topics Models (CTM).
- udpipe - package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit [cran], [site].
- wordcloud - Functionality to create pretty word clouds, visualize differences and similarity between documents, and avoid over-plotting in scatter plots with text.
- wordnet - WordNet Interface.
- wordVectors - package for building and exploring word embedding models.
- zipfR - Statistical Models for Word Frequency Distributions [site].
- CorpusLoaders - A variety of loaders for various NLP corpora.
- Embeddings - Functions and data dependencies for loading various word embeddings (Word2Vec, FastText, GLoVE).
- Languages - A package for working with human languages.
- Snowball - Snowball stemming algorithms.
- StringAnalysis - Hard-Forked from JuliaText/TextAnalysis.jl.
- TextAnalysis - Julia package for text analysis.
- TextModels - Neural Network based models for Natural Language Processing.
- WordLists - Dictionaries without definitions.
- WordNet - A Julia package for Princeton's WordNet.
- WordTokenizers - High performance tokenizers for natural language processing and other related tasks.
- Word2Vec - Julia interface to word2vec.
- compromise - modest natural-language processing [site].
- natural - general natural language facilities for node.
- nlp.js - An NLP library for building bots, with entity extraction, sentiment analysis, automatic language identify, and more.
- wink-nlp-utils - NLP Functions for amplifying negations, managing elisions, creating ngrams, stems, phonetic codes to tokens and more [site].
- CoreNLP - Stanford CoreNLP: A Java suite of core NLP tools [site].
- Mallet - package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text [site].
- OpenNLP - The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. [site].