Skip to content

mashazya/TAED2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

annotations_creators language language_creators license multilinguality pretty_name size_categories source_datasets tags task_categories task_ids
no-annotation
ca
crowdsourced
cc-by-sa-3.0
monolingual
Viquipèdia
unknown
original
wikipedia
catalan
text
articles
educational
text-generation
language-modeling

Dataset Description

Dataset summary

The Wikipedia dataset is a collection of scraped Wikipedia pages. The dataset is defined in catalan language, thus the model will be trained to recognize input exclusively in catalan.

Supported tasks

Text generation

Languages

Catalan

Dataset structure

{
  'ca-2': [
    'ca.wiki.test.tokens',
    'ca.wiki.train.tokens',
    'ca.wiki.valid.tokens']
  'ca-100': [
    'ca.wiki.test.tokens',
    'ca.wiki.train.tokens',
    'ca.wiki.valid.tokens']
  'ca-all': [
    'ca.wiki.test.tokens',
    'ca.wiki.train.tokens',
    'ca.wiki.valid.tokens']
}

Data fields

Plain text

Data splits

train validation test
ca-2 10.64MB 1.07MB 1.06MB
ca-100 528.96MB 1.07MB 1.06MB
ca-all 1.32GB 1.07MB 1.06MB

About

TAED2 project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published