Skip to content

Commit

Permalink
Data from OMW 1.2
Browse files Browse the repository at this point in the history
  • Loading branch information
fcbond committed Nov 6, 2020
1 parent 9445a83 commit 291d2c2
Show file tree
Hide file tree
Showing 1,352 changed files with 3,733,603 additions and 0 deletions.
6 changes: 6 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,8 @@
# omw-data

This packages up data for the Open Multilingual Wordnet


The directory *wns* has the wordnet data from OMW 1.2 with some small fixes
* added a citation for the Icelandic wordnet
* added human readable citations in ``omw-citations.tab``
37 changes: 37 additions & 0 deletions wns/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
This directory contains redistributable wordnets in further sub-directories.

Structure is:

lang/wn-data-lang.tab synset lemma pairs (see below)
lang/LICENCE original license file (or equivalent)
lang/README Any notes about the conversion
lang/lang2tab.py python script to extract the data
may rely on wordnet version mappings
lang/wn-data-lang.tab.log any notes from the conversion
lang/citation.bib the canonical citation reference(s)

Note that a single directory may have wordnets for multiple languages

wn-data is formatted as follows:
# name<tab>lang<tab>url<tab>license
offset-pos<tab>type<tab>lemma
offset-pos<tab>type<tab>lemma
...

name is the name of the project
lang is the iso 3 letter code for the name
url is the url of the project
license is a short name for the license
offset is the Princeton WordNet 3.0 offset 8 digit offset
pos is one of [a,s,v,n,r]
lemma is the lemma (word separator normalized to ' ')
type is the language:relationship (e.g. eng:lemma)

Example:
# Thai tha http://th.asianwordnet.org/ wordnet
13567960-n tha:lemma กระบวนการทรานแอมมิแนชัน
00155298-n tha:lemma การปฏิเสธ
14369530-n tha:lemma ภาวะการหายใจเร็วของทารกแรกเกิด
10850469-n tha:lemma เบธัน
11268326-n tha:lemma เรินต์เกน

5 changes: 5 additions & 0 deletions wns/als/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
The Albanian wordnet is dual licensed under either:

* GPL 3
* CC BY 3

11 changes: 11 additions & 0 deletions wns/als/README
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
I cleaned up the source xml a little.

------------------------------------------------------------------------
This conversion uses the following mappings

WN-Map: Mappings between WordNet versions
Copyright (c) 2003 TALP Research Center
Universitat Politècnica de Catalunya.
Barcelona, Spain
http://www.lsi.upc.es/~nlp
--------------------------------------------------------------------------
97 changes: 97 additions & 0 deletions wns/als/als2tab.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
#!/usr/share/python
# -*- encoding: utf-8 -*-
#
# Extract synset-word pairs, definitions and others from the Albanian Wordnet
#
#

import sys
import codecs
import re, collections


wndata="/home/bond/work/wns/albanian/"
wnname = "Albanet"
wnlang = "als"
wnurl = "http://fjalnet.com/"
wnlicense = "CC BY 3.0"

#
# header
#
outfile = "wn-data-%s.tab" % wnlang
o = codecs.open(outfile, "w", "utf-8" )
log = codecs.open("log", "w", "utf-8" )

o.write("# %s\t%s\t%s\t%s \n" % (wnname, wnlang, wnurl, wnlicense))

###
### mappings
###
mapdir = "/home/bond/work/wn/mapps/mapping-20-30/"
maps = ["wn20-30.adj", "wn20-30.adv", "wn20-30.noun", "wn20-30.verb"]
pos = {"wn20-30.adj" : "a", "wn20-30.adv" : "r",
"wn20-30.noun" : "n", "wn20-30.verb" : "v", }
map2030 = collections.defaultdict(lambda: 'unknown');
for m in maps:
mf = codecs.open(mapdir + m, "r", "utf-8" )
p = pos[m]
for l in mf:
lst = l.strip().split()
fsfrom = lst[0] + "-" + p
fsto = sorted([(lst[i+1], lst[i]) for i in range(1,len(lst),2)])[-1][1]
##print "%s-%s\t%s-%s" % (fsfrom, p, fsto, p)
map2030[fsfrom] = "%s-%s" % (fsto, p)


#
# Data is in the file shqip.xml
#
# But xml parser complains :-) so back to regexp
#


f = codecs.open(wndata + "shqip.xml", "r", "utf-8" )

synset = unicode()
lemma = unicode()
#of20 = unicode()
defid = 0
exid = 0

for l in f:
##print l, "EOS"
### synset
m = re.search(r"<ID>(.*)</ID>",l.strip())
if (m):
synset20 = m.group(1).strip()[6:]
synset = map2030[synset20]
defid = 0
exid = 0
### lemma
m = re.search(r"<LITERAL>(.+)<SENSE>(.*)</SENSE>",l.strip())
if(m):
lemma = m.group(1).strip()
sense = m.group(2).strip()
if lemma == '' or synset=='unknown':
log.write("Problem with synset/lemma: {} ({}) {}\n".format(synset20, synset, sense))
else:
o.write("%s\t%s:%s\t%s\n" % (synset, wnlang, 'lemma', lemma))
### Definition
m = re.search(r"<DEF>(.+)</DEF>",l.strip())
if(m):
df = m.group(1).strip()
if df == '' or synset=='unknown':
log.write("Problem with synset/def: {} ({})\n".format(synset20, synset))
else:
o.write("%s\t%s:%s\t%d\t%s\n" % (synset, wnlang, 'def', defid, df))
defid += 1
### Example
m = re.search(r"<USAGE>(.+)</USAGE>",l.strip())
if(m):
ex = m.group(1).strip()
if ex == '' or synset=='unknown':
log.write("Problem with synset/exe: {} ({})\n".format(synset20, synset))
else:
o.write("%s\t%s:%s\t%d\t%s\n" % (synset, wnlang, 'exe', exid, ex))
exid += 1
12 changes: 12 additions & 0 deletions wns/als/citation.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
Citation for alsWN
@TechReport{Ruci:2008,
author = "Ervin Ruci",
title = "On the current state of {Albanet} and related
applications",
institution = "University of Vlora",
year = 2008,
note =
"(\url{http://fjalnet.com/technicalreportalbanet.pdf})"
}

7 changes: 7 additions & 0 deletions wns/als/log
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Problem with synset/lemma: 01505508-a (unknown) 1
Problem with synset/lemma: 01505508-a (unknown) 1
Problem with synset/def: 01505508-a (unknown)
Problem with synset/exe: 01505508-a (unknown)
Problem with synset/lemma: 02002046-a (unknown) 1
Problem with synset/def: 02002046-a (unknown)
Problem with synset/exe: 02002046-a (unknown)
Loading

0 comments on commit 291d2c2

Please sign in to comment.