Skip to content

Open source bilingual Catalan corpus used to train machine learning systems

Notifications You must be signed in to change notification settings

davidcanovas/parallel-catalan-corpus

 
 

Repository files navigation

Description

This repository collects open source parallel aligned corpuses Catalan to several languages.

We use these corpuses to train the Softcatalà neural translation system:

Note: files with extension xz need to be descompressed with xz.

Sources of the corpus used

We strongly recommend the following sources of aligned Catalan parallel corpuses:

On top of these previously available corpus, we have created the following corpus:

Do you want to help?

See here (In Catalan)

Contact

Contact Jordi Mas [email protected]

Metadescription

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name Open source aligned text corpus English, German, Spanish, etc to/from Catalan.
description Open source aligned text corpus for building NLP applications (e.g. machine translation). Already existing corpus have been clean up and new corpus have been introduced: Europarl Catalan, Tilde Catalan and open source translation memories.
sameAs https://github.com/Softcatala/parallel-catalan-corpus/
url
creator Softcatalà

About

Open source bilingual Catalan corpus used to train machine learning systems

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • SystemVerilog 49.6%
  • NewLisp 28.1%
  • JavaScript 22.3%