Current content:
-
Multilingual Sentence Embeddings (21/01/2021): Gives an overview of various current multilingual sentence embedding techniques and tools, and how they compare given various sequence lengths.
-
Spacy 3.0 (01/02/2021): Spacy 3.0 has just been released and in this tip, we'll have a look at some of the new features. We'll be training a German NER model and streamline the end-to-end pipeline using the brand new spaCy projects!
-
Compact transformers (26/02/2021): Bigger isn't always better. In this tip we look at some compact BERT-based models that provide a nice balance between computational resources and model accuracy.
-
Keyword Extraction with pke (18/03/2021): The KEYNG (read king) is dead, long live the KEYNG! In this tip we look at
pke
, an alternative to Gensim for keyword extraction. -
Explainable transformers using SHAP (22/04/2021): BERT, explain yourself! 📖 Up until recently language model predictions have lacked transparency. In this tip we look at
SHAP
, a way to explain your latest transformer based models. -
Transformer-based Data Augmentation (18/06/2021): Ever struggled with having a limited non-English NLP dataset for a project? Fear not, data augmentation to the rescue ⛑️ In this week's tip, we look at backtranslation 🔀 and contextual word embedding insertions as data augmentation techniques for multilingual NLP.
-
Long range transformers (14/07/2021): Beyond and above the 512! 🏅 In this week's tip, we look at novel long range transformer architectures and compare them against the well-known RoBERTa model.
-
Neural Keyword Extraction (10/09/2021): Neural Keyword Extraction 🧠 In this week's tip, we look at neural keyword extraction methods and how they compare to classical methods.
-
HuggingFace Optimum (12/10/2021): HuggingFace Optimum Quantization ✂️ In this week's tip, we take a look at the new HuggingFace Optimum package to check out some model quantization techniques.
-
Text Augmentation using large-scale LMs and prompt engineering (25/11/2021): Typically, the more data we have, the better performance we can achieve 🤙. However, it is sometimes difficult and/or expensive to annotate a large amount of training data 😞. In this tip, we leverage three large-scale LMs (GPT-3, GPT-J and GPT-Neo) to generate very realistic samples from a very small dataset.
-
Gender debaising of datasets using CDA (25/01/2022): A lot of large language models are trained on webtext. However, this means that unintended biases can sneak into your model behaviour 😞. In this tip, we'll look at how to try and alleviate this bias using Counterfactual Data Augmentation ⚖️.
-
GPT2 Quantization using ONNXRuntime (19/04/2022): Large language models are costly to run, in this notebook we leverage ONNXRuntime to quantize and run our Dutch GPT2 model in a more efficient way 💰.