GilBERTo: An Italian pretrained language model based on RoBERTa

GilBERTo is an Italian pretrained language model based on Facebook RoBERTa architecture and CamemBERT text tokenization approach.

Model was trained with the subword masking technique for 100k steps managing ~71GB of Italian text with 11,250,012,896 words (OSCAR: Open Super-large Crawled ALMAnaCH coRpus). We took up a vocabulary of 32k BPE subwords, generated using SentencePiece tokenizer.

GilBERTo evaluation was executed in different downstream tasks, comparing it to mBERT and other (not BERT-based) models. More specifically, the models comparison was accomplished by executing the following tasks:

Part-of-Speech tagging
Named Entity Recognition

Download

GilBERTo is available both using huggingface/transformers and pytorch/fairseq librarires.

Model	Library	Download
`GilBERTo-uncased-from-camembert`	pytorch/fairseq	GilBERTo-uncased-fairseq.v1.zip
`GilBERTo-uncased-from-camembert`	huggingface/transformers	GilBERTo-uncased-transformers.v1.zip

Results

We are in the drafting phase of the paper including all details (coming soon).

To the best of our knowledge, downstream task applications are limited due to the lack of datasets available for Italian. We strongly recommend everyone to contribute to the repository in order to improve the Italian NLP SOTA. We will be happy to support.

Currently we selected the following tasks based on what we have found in the Italian state of the art:

PoS Tagging

PoS task has been evaluated using the Accuracy metric with two different Italian dataset: Italian ParTUT and Italian ISDT. We also compared the results with UDPipe and UDify models.

Model	Italian ParTUT	Italian ISDT
UDPipe	98.4	98.4
UDify	98.2	98.5
mBERT	98.0	98.5
GilBERTo	98.8	98.6

Named Entity Recognition

NER task has been evaluated using the WikiNER Italian dataset already used by Spacy pretrained model for Italian who achieve F-1 Score:86.40; Precision:86.73; Recall:86.08

Model	F1	Precision	Recall
mBERT	92.2	92.1	92.3
GilBERTo	92.7	92.7	92.8

How to use

You can use GilBERTo with the latest version of huggingface/transformers or pytorch/fairseq Python libraries.

huggingface/transformers

from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("idb-ita/gilberto-uncased-from-camembert", do_lower_case=True)
model = AutoModel.from_pretrained("idb-ita/gilberto-uncased-from-camembert")

input_ids = torch.tensor(tokenizer.encode("Io sono italiano e mi chiamo GilBERTo!")).unsqueeze(0)  
#>> tensor([[5, 755, 181, 1413, 25, 155, 12513, 14397, 16247, 31976, 6]])
token_list = tokenizer.convert_ids_to_tokens(tokenizer.encode("Io sono italiano e mi chiamo GilBERTo!")) 
#>> ['<s>', '▁io', '▁sono', '▁italiano', '▁e', '▁mi', '▁chiamo', '▁gil', 'berto', '!', '</s>']

pytorch/fairseq

$ pip install fairseq

from fairseq.models.roberta import RobertaModel as FairseqRobertaModel
from fairseq.modules import TransformerSentenceEncoderLayer

# Import GilBERTo with pytorch\fairseq Library
gilberto_model = FairseqRobertaModel.from_pretrained('path/to/checkpoints_folder', 
                                                    bpe='sentencepiece') 
# Mask Predictions
gilberto_model.fill_mask('Buongiorno mi <mask> Gilberto!', topk=3) #Fill mask token with GilBERTo

# Outputs
[('Buongiorno mi chiamo Gilberto!', 0.5044017434120178, ' chiamo'),
 ('Buongiorno mi presento Gilberto!', 0.05189879611134529, ' presento'),
 ('Buongiorno mi sento Gilberto!', 0.022937586531043053, ' sento')]
 
# Other examples

# Input: `È più facile per un italiano gesticolare senza <mask> che parlare senza gesticolare.`
# Output: `È più facile per un italiano gesticolare senza parlare che parlare senza gesticolare.`

# Input: `Agli italiani piace pasta, <mask> e mandolino`
# Output: `Agli italiani piace pasta, pizza e mandolino`

# Input: `Chi dice che il denaro non fa la <mask>, oltre a essere antipatico, è pure fesso.`
# Output: `Chi dice che il denaro non fa la felicità, oltre a essere antipatico, è pure fesso.`

# Input: `Era un uomo così antipatico che dopo la sua <mask> i parenti chiesero il bis`
# Output: `Era un uomo così antipatico che dopo la sua morte i parenti chiesero il bis`

Contacts

Giulio Ravasio: Linkedin | Twitter | Github | [email protected]

Leonardo Di Perna: Linkedin | Twitter | Github | [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GilBERTo: An Italian pretrained language model based on RoBERTa

Download

Results

PoS Tagging

Named Entity Recognition

How to use

huggingface/transformers

pytorch/fairseq

Contacts

References

About

Releases

Packages

Contributors 3

idb-ita/GilBERTo

Folders and files

Latest commit

History

Repository files navigation

GilBERTo: An Italian pretrained language model based on RoBERTa

Download

Results

PoS Tagging

Named Entity Recognition

How to use

huggingface/transformers

pytorch/fairseq

Contacts

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages