This repositorium contains linguistic resources for several of the tools included in the Text Tonsorium (https://github.com/kuhumcst/texton).
The resources can be traced back to many different sources. Some resources are straight copies of freely accessible data, other resources are date created by some training algorithm. Resources in the latter category do not make it possible to recreate the training data.
The resources for the tokeniser (lists of abbreviations) are obtained from Wikipedia.
The list below is ordered according to language and tool.
Cite: R.H. Baayen and R. Piepenbrock and L. Gulikers. (1995). CELEX. ELRA, 3.1, ISLRN 302-530-620-279-0.
Link:CELEX
Link: https://www.clarin.si/repository/xmlui/handle/11356/1041
Licence: Creative Commons - Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
Cite: Erjavec, Tomaž; et al., 2010, MULTEXT-East free lexicons 4.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1041.
Link: https://www.clarin.si/repository/xmlui/handle/11356/1042
Licence: Creative Commons - Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
Cite: Erjavec, Tomaž; et al., 2010, MULTEXT-East non-commercial lexicons 4.0, Slovenian language resource repository CLARIN.SI, http://hdl.handle.net/11356/1042.
Link: https://github.com/UniversalDependencies/UD_Afrikaans-AfriBooms
MULTEXT-East free lexicons 4.0
MULTEXT-East free lexicons 4.0
Medieval: Middelaldertekster Dansk Sprog- og Litteraturselskab, Clara-Kloster/Guldkorpus University of Copenhagen
Late modern, Contemporary: Parole corpus Dansk Sprog- og Litteraturselskab
Medieval: Middelaldertekster Dansk Sprog- og Litteraturselskab, Clara-Kloster/Guldkorpus University of Copenhagen
Late modern: Ordbog over det Danske Sprog (Dansk Sprog- og Litteraturselskab)
Contemporary: CST. (2004). STO: Sprogteknologisk orddatabase over det danske sprog. Center for Sprogteknologi, Department of Nordic Studies and Linguistics, University of Copenhagen. CLARIN-DK-UCPH Repository
CELEX
G. Petasis
Cite: G., Karkaletsis, V., Farmakiotou, D., Androutsopoulos, I., and Spyropoulos, C. D. (2001). A Greek Morphological Lexicon and its Exploitation by a Greek Controlled Language Checker. In Proceedings of the 8th Panhellenic Conference on Informatics (PCI’01), PCI’01, pages 80–89, November 8–10.
Cite: Petasis, G., Karkaletsis, V., Farmakiotou, D., Androutsopoulos, I., and Spyropoulos, C. D. (2003). A Greek Morphological Lexicon and Its Exploitation by Natural Language Processing Applications. In Yannis Manolopoulos, et al., editors, *Advances in Informatics
- Post-proceedings of the 8th Panhellenic Conference in Informatics, volume 2563 of Lecture Notes in Computer Science*, pages 401–419. Springer Berlin / Heidelberg.
Link: https://www.ellogon.org/petasis/
CELEX
Cite:Eric Brill. 1992. A simple rule-based part of speech tagger. In Proceedings of the third conference on Applied natural language processing (ANLC '92). Association for Computational Linguistics, Stroudsburg, PA, USA, 152-155. doi:10.3115/974499.974526
lachica
Link: https://github.com/bumshmyak/lachica
Cite: Bum Shmyak. (2011). Spanish lemmatization.
MULTEXT-East free lexicons 4.0
Cite: Alexander Tkachenko. (2015). Suffix Lemmatizer for Estonian. https://github.com/estnltk/suffix-lemmatizer.
MULTEXT-East non-commercial lexicons 4.0
Cite: Boris New and Christophe Pallier. (2005). Une Base de Données Lexicales Libre.
Link: www.lexique.org
Limsi
Link: https://perso.limsi.fr/anne/OLDlexique.txt
The SETimes.HR+ Croatian dependency treebank
Link: http://nlp.ffzg.hr/
Link: https://github.com/ffnlp/sethr
MULTEXT-East free lexicons 4.0
the Icelandic Centre for Language Technology IFD
Link: http://malfong.is/?pg=ordtidnibok
Cite: Jörgen Pind and Fririk Magnússon and Stefán Briem. (1991). IFD. the Icelandic Centre for Language Technology IFD.
Morph-it!
Link: https://docs.sslmit.unibo.it/doku.php?id=resources:morph-it
Cite: Zanchetta, E. and Baroni, M. (2005). Morph-it! a free corpus-based morphological resource for the italian language. Corpus Linguistics 2005, 1(1).
Cite: Marco Baroni and Eros Zanchetta. (2009). Morph-it! Department of Interpreting and Translation - Forl`ı Campus Corpora, Linguistics, Technology Research centre CoLiTec, 0.48.
Licence: Dual-licensed free software; you can redistribute it and/or modify it under the terms of the under the Creative Commons Attribution ShareAlike 2.0 License and the GNU Lesser General Public License.
James Artz and Calliopi Dourou and J. F. Gentile and Kenny Hickman and Alex Lessie and Viet Luong and Meg Luthin and Molly Miller and Robin Ngo and Skylar Neil and Tufts University LAT-181 class. (2008). Latin Dependency Treebank. Perseus Digital Library, 1.5.
Link: Perseus Digital Library
Ján Šipoš. (2015). Latin lemmata.
Link: Latin lemmata.
MULTEXT-East non-commercial lexicons 4.0
CELEX
Cite: Koenraad de Smedt. (1999). Scarrie Lexicon. Meta Nord.
Marcin Miłkowski and Dawid Weiss. (2016). Morfologik.
Link: Morfologik 1.5
LABELLEX Link: https://label.ist.utl.pt/en/labellex_en.php
Cite: Samuel Eleut´erio and Elisabete Ranchhod. (2014). LABEL-LEX MW. ELRA, ISLRN 502-837-497-805-9.
MULTEXT-East free lexicons 4.0
Alexander Pankov and Arsen Gadjikurbanov and Sergey Bochenkov. (2011). libturglem.
Link: libturglem-0.2.30.
MULTEXT-East free lexicons 4.0
MULTEXT-East free lexicons 4.0
MULTEXT-East non-commercial lexicons 4.0
Cite: Språkrådet. (2007). Lexin. Institutet för språk och folkminnen and Kungliga Tekniska högskolan.
Licence: CC-BY (attribution)
MULTEXT-East free lexicons 4.0