Skip to content

GilBERTo: A pretrained language model based on RoBERTa for Italian

Notifications You must be signed in to change notification settings

idb-ita/GilBERTo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 

Repository files navigation

GilBERTo: An Italian pretrained language model based on RoBERTa

GilBERTo is an Italian pretrained language model based on Facebook RoBERTa architecture and CamemBERT text tokenization approach.

Model was trained with the subword masking technique for 100k steps managing ~71GB of Italian text with 11,250,012,896 words (OSCAR: Open Super-large Crawled ALMAnaCH coRpus). We took up a vocabulary of 32k BPE subwords, generated using SentencePiece tokenizer.

GilBERTo evaluation was executed in different downstream tasks, comparing it to mBERT and other (not BERT-based) models. More specifically, the models comparison was accomplished by executing the following tasks:

  • Part-of-Speech tagging
  • Named Entity Recognition

Download

GilBERTo is available both using huggingface/transformers and pytorch/fairseq librarires.

Model Library Download
GilBERTo-uncased-from-camembert pytorch/fairseq GilBERTo-uncased-fairseq.v1.zip
GilBERTo-uncased-from-camembert huggingface/transformers GilBERTo-uncased-transformers.v1.zip

Results

We are in the drafting phase of the paper including all details (coming soon).

To the best of our knowledge, downstream task applications are limited due to the lack of datasets available for Italian. We strongly recommend everyone to contribute to the repository in order to improve the Italian NLP SOTA. We will be happy to support.

Currently we selected the following tasks based on what we have found in the Italian state of the art:

PoS Tagging

PoS task has been evaluated using the Accuracy metric with two different Italian dataset: Italian ParTUT and Italian ISDT. We also compared the results with UDPipe and UDify models.

Model Italian ParTUT Italian ISDT
UDPipe 98.4 98.4
UDify 98.2 98.5
mBERT 98.0 98.5
GilBERTo 98.8 98.6

Named Entity Recognition

NER task has been evaluated using the WikiNER Italian dataset already used by Spacy pretrained model for Italian who achieve F-1 Score:86.40; Precision:86.73; Recall:86.08

Model F1 Precision Recall
mBERT 92.2 92.1 92.3
GilBERTo 92.7 92.7 92.8

How to use

You can use GilBERTo with the latest version of huggingface/transformers or pytorch/fairseq Python libraries.

huggingface/transformers

from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("idb-ita/gilberto-uncased-from-camembert", do_lower_case=True)
model = AutoModel.from_pretrained("idb-ita/gilberto-uncased-from-camembert")

input_ids = torch.tensor(tokenizer.encode("Io sono italiano e mi chiamo GilBERTo!")).unsqueeze(0)  
#>> tensor([[5, 755, 181, 1413, 25, 155, 12513, 14397, 16247, 31976, 6]])
token_list = tokenizer.convert_ids_to_tokens(tokenizer.encode("Io sono italiano e mi chiamo GilBERTo!")) 
#>> ['<s>', '▁io', '▁sono', '▁italiano', '▁e', '▁mi', '▁chiamo', '▁gil', 'berto', '!', '</s>']

pytorch/fairseq

$ pip install fairseq
from fairseq.models.roberta import RobertaModel as FairseqRobertaModel
from fairseq.modules import TransformerSentenceEncoderLayer

# Import GilBERTo with pytorch\fairseq Library
gilberto_model = FairseqRobertaModel.from_pretrained('path/to/checkpoints_folder', 
                                                    bpe='sentencepiece') 
# Mask Predictions
gilberto_model.fill_mask('Buongiorno mi <mask> Gilberto!', topk=3) #Fill mask token with GilBERTo

# Outputs
[('Buongiorno mi chiamo Gilberto!', 0.5044017434120178, ' chiamo'),
 ('Buongiorno mi presento Gilberto!', 0.05189879611134529, ' presento'),
 ('Buongiorno mi sento Gilberto!', 0.022937586531043053, ' sento')]
 
# Other examples

# Input: `È più facile per un italiano gesticolare senza <mask> che parlare senza gesticolare.`
# Output: `È più facile per un italiano gesticolare senza parlare che parlare senza gesticolare.`

# Input: `Agli italiani piace pasta, <mask> e mandolino`
# Output: `Agli italiani piace pasta, pizza e mandolino`

# Input: `Chi dice che il denaro non fa la <mask>, oltre a essere antipatico, è pure fesso.`
# Output: `Chi dice che il denaro non fa la felicità, oltre a essere antipatico, è pure fesso.`

# Input: `Era un uomo così antipatico che dopo la sua <mask> i parenti chiesero il bis`
# Output: `Era un uomo così antipatico che dopo la sua morte i parenti chiesero il bis`

Contacts

Giulio Ravasio: Linkedin | Twitter | Github | [email protected]

Leonardo Di Perna: Linkedin | Twitter | Github | [email protected]

References

About

GilBERTo: A pretrained language model based on RoBERTa for Italian

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published