Skip to content

Commit

Permalink
Merge pull request #128 from goodmami/v0.8.0
Browse files Browse the repository at this point in the history
V0.8.0
  • Loading branch information
goodmami committed Jul 6, 2021
2 parents f587a33 + ec97661 commit 3411b03
Show file tree
Hide file tree
Showing 22 changed files with 1,753 additions and 287 deletions.
32 changes: 32 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,27 @@

## [Unreleased]

## [v0.8.0]

**Release date: 2021-07-07**

### Added

* `wn.ic` module ([#40]
* `wn.taxonomy` module ([#125])
* `wn.similarity.res` Resnik similarity ([#122])
* `wn.similarity.jcn` Jiang-Conrath similarity ([#123])
* `wn.similarity.lin` Lin similarity ([#124])
* `wn.util.synset_id_formatter` ([#119])

### Changed

* Taxonomy methods on `wn.Synset` are moved to `wn.taxonomy`, but
shortcut methods remain for compatibility ([#125]).
* Similarity metrics in `wn.similarity` now raise an error when
synsets come from different parts of speech.


## [v0.7.0]

**Release date: 2021-06-09**
Expand Down Expand Up @@ -62,6 +83,10 @@

**Release date: 2021-03-04**

**Notice:** This release introduces backwards-incompatible changes to
the schema that require users upgrading from previous versions to
rebuild their database.

### Added

* For WN-LMF 1.0 support ([#65])
Expand Down Expand Up @@ -347,6 +372,7 @@ the https://github.com/nltk/wordnet/ code which had been effectively
abandoned, but this is an entirely new codebase.


[v0.8.0]: ../../releases/tag/v0.8.0
[v0.7.0]: ../../releases/tag/v0.7.0
[v0.6.2]: ../../releases/tag/v0.6.2
[v0.6.1]: ../../releases/tag/v0.6.1
Expand All @@ -367,6 +393,7 @@ abandoned, but this is an entirely new codebase.
[#17]: https://github.com/goodmami/wn/issues/17
[#19]: https://github.com/goodmami/wn/issues/19
[#23]: https://github.com/goodmami/wn/issues/23
[#40]: https://github.com/goodmami/wn/issues/40
[#46]: https://github.com/goodmami/wn/issues/46
[#47]: https://github.com/goodmami/wn/issues/47
[#58]: https://github.com/goodmami/wn/issues/58
Expand Down Expand Up @@ -406,3 +433,8 @@ abandoned, but this is an entirely new codebase.
[#115]: https://github.com/goodmami/wn/issues/115
[#116]: https://github.com/goodmami/wn/issues/116
[#117]: https://github.com/goodmami/wn/issues/117
[#119]: https://github.com/goodmami/wn/issues/119
[#122]: https://github.com/goodmami/wn/issues/122
[#123]: https://github.com/goodmami/wn/issues/123
[#124]: https://github.com/goodmami/wn/issues/124
[#125]: https://github.com/goodmami/wn/issues/125
46 changes: 38 additions & 8 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ Thanks for helping to make Wn better!
**Quick Links:**

- [Report a bug or request a features](https://github.com/goodmami/wn/issues/new)
- [Ask a question](https://github.com/goodmami/wn/discussions)
- [View documentation](https://wn.readthedocs.io/)

**Developer Information:**
Expand All @@ -14,28 +15,43 @@ Thanks for helping to make Wn better!
- Changelog: [keep a changelog](https://keepachangelog.com/en/1.0.0/)
- Documentation framework: [Sphinx](https://www.sphinx-doc.org/)
- Docstring style: [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html#38-comments-and-docstrings) (via [sphinx.ext.napoleon](https://www.sphinx-doc.org/en/master/usage/extensions/napoleon.html))
- Testing framework: [pytest](https://pytest.org/)
- Packaging framework: [flit](https://flit.readthedocs.io/en/latest/)
- Coding style: [PEP-8](https://www.python.org/dev/peps/pep-0008/)
- Testing automation: [nox](https://nox.thea.codes)
- Unit/regression testing: [pytest](https://pytest.org/)
- Packaging framework: [Flit](https://flit.readthedocs.io/en/latest/)
- Coding style: [PEP-8](https://www.python.org/dev/peps/pep-0008/) (via [Flake8](https://flake8.pycqa.org/))
- Type checking: [Mypy](http://mypy-lang.org/)


## Get Help

Confused about wordnets? See the [Global Wordnet Association
Documentation](https://globalwordnet.github.io/gwadoc/)
Confused about wordnets in general? See the [Global Wordnet
Association Documentation](https://globalwordnet.github.io/gwadoc/)

Having trouble with using Wn? [Raise an
Confused about using Wn or wish to share some tips? [Start a
discussion](https://github.com/goodmami/wn/discussions)

Encountering a problem with Wn or wish to propose a new features? [Raise an
issue](https://github.com/goodmami/wn/issues/new)


## Report a Bug

When reporting a bug, please provide enough information for someone to
reproduce the problem. This might include the version of Python you're
running, the version of Wn you have installed, the wordnet lexicons
you have installed, and possibly the platform (Linux, Windows, macOS)
you're on. Please give a minimal working example that illustrates the
problem.
problem. For example:

> I'm using Wn 0.7.0 with Python 3.8 on Linux and [description of
> problem...]. Here's what I have tried:
>
> ```pycon
> >>> import wn
> >>> # some code
> ... # some result or error
> ```
## Request a Feature
Expand All @@ -47,4 +63,18 @@ would address.
See the "developer information" above for a brief description of
guidelines and conventions used in Wn. If you have a fix, please
submit a pull request to the `main` branch.
submit a pull request to the `main` branch. In general, every pull
request should have an associated issue.
Developers should install Wn locally from source using
[Flit](https://flit.readthedocs.io/en/latest/). Flit may be installed
system-wide or within a virtual environment:
```bash
$ pip install flit
$ flit install -s
```
The `-s` option tells Flit to use symbolic links to install Wn,
similar to pip's -e editable installs. This allows one to edit source
files and use the changes without having to reinstall Wn each time.
7 changes: 0 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,13 +19,6 @@

---

**Notice for users upgrading to v0.6:** Version v0.6.0 introduced
changes to the database schema that require the user to rebuild their
database. Please [raise an
issue](https://github.com/goodmami/wn/issues/new) if you need help.

---

Wn is a Python library for exploring information in wordnets. Install
it from PyPI:

Expand Down
163 changes: 163 additions & 0 deletions docs/api/wn.ic.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,163 @@

wn.ic
=====

.. automodule:: wn.ic

The mathematical formulae for information content are defined in
`Formal Description`_, and the corresponding Python API function are
described in `Calculating Information Content`_. These functions
require information content weights obtained either by `computing them
from a corpus <Computing Corpus Weights_>`_, or by `loading
pre-computed weights from a file <Reading Pre-computed Information
Content Files_>`_.

.. note::

The term *information content* can be ambiguous. It often, and most
accurately, refers to the result of the :func:`information_content`
function (:math:`\text{IC}(c)` in the mathematical notation), but
is also sometimes used to refer to the corpus frequencies/weights
(:math:`\text{freq}(c)` in the mathematical notation) returned by
:func:`load` or :func:`compute`, as these weights are the basis of
the value computed by :func:`information_content`. The Wn
documentation tries to consistently refer to former as the
*information content value*, or just *information content*, and the
latter as *information content weights*, or *weights*.


Formal Description
------------------

The Information Content (IC) of a concept (synset) is a measure of its
specificity computed from the wordnet's taxonomy structure and corpus
frequencies. It is defined by Resnik 1995 ([RES95]_), following
information theory, as the negative log-probability of a concept:

.. math::
\text{IC}(c) = -\log{p(c)}
A concept's probability is the empirical probability over a corpus:

.. math::
p(c) = \frac{\text{freq}(c)}{N}
Here, :math:`N` is the total count of words of the same category as
concept :math:`c` ([RES95]_ only considered nouns) where each word has
some representation in the wordnet, and :math:`\text{freq}` is defined
as the sum of corpus counts of words in :math:`\text{words}(c)`, which
is the set of words subsumed by concept :math:`c`:

.. math::
\text{freq}(c) = \sum_{w \in \text{words}(c)}{\text{count}(w)}
It is common for :math:`\text{freq}` to not contain actual frequencies
but instead weights distributed evenly among the synsets for a
word. These weights are calculated as the word frequency divided by
the number of synsets for the word:

.. math::
\text{freq}_{\text{distributed}}(c)
= \sum_{w \in \text{words}(c)}{\frac{\text{count}(w)}{|\text{synsets}(w)|}}
.. [RES95] Resnik, Philip. "Using information content to evaluate
semantic similarity." In Proceedings of the 14th International
Joint Conference on Artificial Intelligence (IJCAI-95), Montreal,
Canada, pp. 448-453. 1995.
Example
-------

In the Princeton WordNet, the frequency of a concept like **stone
fruit** is not the number of occurrences of *stone fruit*, but also
includes the counts of the words for its hyponyms (*almond*, *olive*,
etc.) and other taxonomic descendants (*Jordan almond*, *green olive*,
etc.). The word *almond* has two synsets: one for the fruit or nut,
another for the plant. Thus, if the word *almond* is encountered
:math:`n` times in a corpus, then the weight (either the frequency
:math:`n` or distributed weight :math:`\frac{n}{2}`) is added to the
total weights for both synsets and to those of their ancestors, but
not for descendant synsets, such as for **Jordan almond**. The fruit/nut
synset of almond has two hypernym paths which converge on **fruit**:

1. **almond** ⊃ **stone fruit** ⊃ **fruit**
2. **almond** ⊃ **nut** ⊃ **seed** ⊃ **fruit**

The weight is added to each ancestor (**stone fruit**, **nut**,
**seed**, **fruit**, ...) once. That is, the weight is not added to
the convergent ancestor for **fruit** twice, but only once.


Calculating Information Content
-------------------------------

.. autofunction:: information_content
.. autofunction:: synset_probability


Computing Corpus Weights
------------------------

If pre-computed weights are not available for a wordnet or for some
domain, they can be computed given a corpus and a wordnet.

The corpus is an iterable of words. For large corpora it may help to
use a generator for this iterable, but the entire vocabulary (i.e.,
unique words and counts) will be held at once in memory. Multi-word
expressions are also possible if they exist in the wordnet. For
instance, the Princeton WordNet has *stone fruit*, with a single space
delimiting the words, as an entry.

The :class:`wn.Wordnet` object must be instantiated with a single
lexicon, although it may have expand-lexicons for relation
traversal. For best results, the wordnet should use a lemmatizer to
help it deal with inflected wordforms from running text.

.. autofunction:: compute


Reading Pre-computed Information Content Files
----------------------------------------------

The :func:`load` function reads pre-computed information content
weights files as used by the `WordNet::Similarity
<http://wn-similarity.sourceforge.net/>`_ Perl module or the `NLTK
<http://www.nltk.org/>`_ Python package. These files are computed for
a specific version of a wordnet using the synset offsets from the
`WNDB <https://wordnet.princeton.edu/documentation/wndb5wn>`_ format,
which Wn does not use. These offsets therefore must be converted into
an identifier that matches those used by the wordnet. By default,
:func:`load` uses the lexicon identifier from its *wordnet* argument
with synset offsets (padded with 0s to make 8 digits) and
parts-of-speech from the weights file to format an identifier, such as
``pwn-00001174-n``. For wordnets that use a different identifier
scheme, the *get_synset_id* parameter of :func:`load` can be given a
callable created with :func:`wn.util.synset_id_formatter`. It can also
be given another callable with the same signature as shown below:

.. code-block:: python
get_synset_id(*, offset: int, pos: str) -> str
.. warning::

The weights files are only valid for the version of wordnet for
which they were created. Files created for the Princeton WordNet
3.0 do not work for the Princeton WordNet 3.1 because the offsets
used in its identifiers are different, although the *get_synset_id*
parameter of :func:`load` could be given a function that performs a
suitable mapping. Some `Open Multilingual Wordnet
<https://github.com/globalwordnet/OMW>`_ wordnets use the Princeton
WordNet 3.0 offsets in their identifiers and can therefore
technically use the weights, but this usage is discouraged because
the distributional properties of text in another language and the
structure of the other wordnet will not be compatible with that of
the Princeton WordNet. For these cases, it is recommended to
compute new weights using :func:`compute`.

.. autofunction:: load
34 changes: 28 additions & 6 deletions docs/api/wn.rst
Original file line number Diff line number Diff line change
Expand Up @@ -168,6 +168,7 @@ The Sense Class
.. automethod:: frames
.. automethod:: counts
.. automethod:: metadata
.. automethod:: relations
.. automethod:: get_related
.. automethod:: get_related_synsets
.. automethod:: closure
Expand Down Expand Up @@ -218,17 +219,38 @@ The Synset Class
.. automethod:: hyponyms
.. automethod:: holonyms
.. automethod:: meronyms
.. automethod:: hypernym_paths
.. automethod:: min_depth
.. automethod:: max_depth
.. automethod:: shortest_path
.. automethod:: common_hypernyms
.. automethod:: lowest_common_hypernyms
.. automethod:: relations
.. automethod:: get_related
.. automethod:: closure
.. automethod:: relation_paths
.. automethod:: translate

.. The taxonomy methods below have been moved to wn.taxonomy
.. method:: hypernym_paths(simulate_root=False)

Shortcut for :func:`wn.taxonomy.hypernym_paths`.

.. method:: min_depth(simulate_root=False)

Shortcut for :func:`wn.taxonomy.min_depth`.

.. method:: max_depth(simulate_root=False)

Shortcut for :func:`wn.taxonomy.max_depth`.

.. method:: shortest_path(other, simulate_root=False)

Shortcut for :func:`wn.taxonomy.shortest_path`.

.. method:: common_hypernyms(other, simulate_root=False)

Shortcut for :func:`wn.taxonomy.common_hypernyms`.

.. method:: lowest_common_hypernyms(other, simulate_root=False)

Shortcut for :func:`wn.taxonomy.lowest_common_hypernyms`.


The ILI Class
-------------
Expand Down
Loading

0 comments on commit 3411b03

Please sign in to comment.