-
Notifications
You must be signed in to change notification settings - Fork 4
NoraIdentification
Language classification
A quick experiment with Lingua::Identify yielded promising results with little effort. Most documents were classified as either English or Norwegian, which was as expected; text files containing gibberish tended to be assigned random languages such as Afrikaans or Somali. The most potential for error seem to be in the files classified as French and Swedish.
The full results are in http://heim.ifi.uio.no/olasba/nora/noralang.log (or ~olasba/noralang.log on ps). Note that the "confidence score" is relative to the other possible languages; a high score doesn't necessarily mean it's not gibberish, as described in http://search.cpan.org/~ambs/Lingua-Identify-0.23/lib/Lingua/Identify.pm#confidence.
This list has been used to categorize my working copy of the converted text in /logon/scratch/olasba/nora0908/logon/scratch/nora/pdf/
The biggest problem with Lingua::Identify was a large percentage of Norwegian texts getting misclassified as Turkish; this was rectified by eliminating 'tr' as a possible language:
Lingua::Identify::deactivate_language('tr');
The most accurate methods for separating Norwegian from English seemed to be 4-grams (strictly speaking, 4-graphs) and stopwords.
my @guess = langof({method=>{smallwords=>0.8, ngrams4=>1.2}}, $text);
Sourcecode is in http://heim.ifi.uio.no/olasba/nora/identifylang03.pl.txt
- —Ola
#10743 (Krog 2006)
- Misfiled under English (contains a large percentage of quotes and loan words)
Home | Forum | Discussions | Events