Skip to content

Commit

Permalink
Update references in documentation, plus minor docs changes
Browse files Browse the repository at this point in the history
- Format all references in APA
- Add references section to all relevant documentation
- Includes links and DIOs for references where possible
- Rework some wording
- Add @see annotations to distance and similarity interfaces

Signed-off-by: solonovamax <[email protected]>
  • Loading branch information
solonovamax committed Sep 29, 2023
1 parent 9a7f035 commit da9bb56
Show file tree
Hide file tree
Showing 25 changed files with 270 additions and 115 deletions.
155 changes: 104 additions & 51 deletions kt-string-similarity/dokka/includes/kt-string-similarity.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,15 @@ Kotlin String Similarity is a Kotlin Multiplatform library for measuring and com

Kotlin String Similarity implements various string similarity and distance measures.
It contains over a dozen algorithms, including, but not limited to,
[Levenshtein][ca.solostudios.stringsimilarity.Levenshtein] distance (and siblings),
[Levenshtein][ca.solostudios.stringsimilarity.edit.Levenshtein] distance (and siblings),
[Jaro-Winkler][ca.solostudios.stringsimilarity.JaroWinkler],
[Longest Common Subsequence][ca.solostudios.stringsimilarity.LongestCommonSubsequence],
[Longest Common Subsequence][ca.solostudios.stringsimilarity.edit.LCS],
[Cosine similarity][ca.solostudios.stringsimilarity.Cosine], and many others.
Check the summary table below for the complete list.

This is project contains a port of tdebatty's
[java-string-similarity](https://github.com/tdebatty/java-string-similarity) to Kotlin Multiplatform.
This is project was initially a port of tdebatty's
[java-string-similarity](https://github.com/tdebatty/java-string-similarity) to Kotlin Multiplatform,
however is now expanding upon it.

## Including

Expand All @@ -20,28 +21,35 @@ You can include ${project.module} in your project by adding the following:
### Maven

```xml
<dependency>
<groupId>${project.group}</groupId>
<artifactId>${project.module}</artifactId>
<version>${project.version}</version>
</dependency>
<dependencies>
<dependency>
<groupId>${project.group}</groupId>
<artifactId>${project.module}</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>
```

### Gradle Groovy DSL

```groovy
implementation '${project.group}:${project.module}:${project.version}'
```gradle
dependencies {
implementation '${project.group}:${project.module}:${project.version}'
}
```

### Gradle Kotlin DSL

```kotlin
implementation("${project.group}:${project.module}:${project.version}")
dependencies {
implementation("${project.group}:${project.module}:${project.version}")
}
```

### Gradle Version Catalog

```toml
[libraries]
${project.module} = { group = "${project.group}", name = "${project.module}", version = "${project.version}" }
```

Expand All @@ -51,42 +59,87 @@ The main characteristics of each implemented algorithm are presented below.
The "cost" column gives an estimation of the computational cost to compute the similarity between two strings of length
\\(m\\) and \\(n\\) respectively.

| Name | Similarity support | Normalized | Metric | Type | Cost | Typical usage |
|--------------------------------------|--------------------|------------|--------|---------|-------------------------------------|----------------------------------|
| Levenshtein |||| | \\(O(m \\times n)\\) <sup>1</sup> | |
| Normalized Levenshtein |||| | \\(O(m \\times n)\\) <sup>1</sup> | |
| Weighted Levenshtein |||| | \\(O(m \\times n)\\) <sup>1</sup> | OCR |
| Damerau-Levenshtein<sup>3</sup> |||| | \\(O(m \\times n)\\) <sup>1</sup> | |
| Optimal String Alignment<sup>3</sup> |||| | \\(O(m \\times n)\\) <sup>1</sup> | |
| Jaro-Winkler |||| | \\(O(m \\times n)\\) | typo correction |
| Longest Common Subsequence |||| | \\(O(m \\times n)\\) <sup>1,2</sup> | diff utility, GIT reconciliation |
| Metric Longest Common Subsequence |||| | \\(O(m \\times n)\\) <sup>1,2</sup> | |
| N-Gram |||| | \\(O(m \\times n)\\) | |
| Q-Gram |||| Profile | \\(O(m+n)\\) | |
| Cosine similarity |||| Profile | \\(O(m+n)\\) | |
| Jaccard index |||| Set | \\(O(m+n)\\) | |
| Sorensen-Dice coefficient |||| Set | \\(O(m+n)\\) | |
| Ratcliff-Obershelp |||| | ? | |

1. In this library, Levenshtein edit distance, LCS distance and their sibblings are computed using the dynamic programming method, which
has a cost \\(O(m \\times n)\\).
For Levenshtein distance, the algorithm is sometimes called Wagner-Fischer algorithm ("The string-to-string correction problem", 1974).
The original algorithm uses a matrix of size m x n to store the Levenshtein distance between string prefixes.

If the alphabet is finite, it is possible to use the method of four russians (Arlazarov et al. "On economic construction of the
transitive
closure of a directed graph", 1970) to speedup computation.
This was published by Masek in 1980 ("A Faster Algorithm Computing String Edit Distances").
This method splits the matrix in blocks of size \\(t \\times t\\).
Each possible block is precomputed to produce a lookup table.
This lookup table can then be used to compute the string similarity (or distance) in \\(O(\\frac{nm}{t})\\).
Usually, \\(t\\) is chosen as \\(log(m)\\) if \\(m > n\\).
The resulting computation cost is thus \\(O(\\frac{mn}{log(m)})\\).
This method has not been implemented (yet).

2. In "Length of Maximal Common Subsequences", K.S. Larsen proposed an algorithm that computes the length of LCS in time
\\(O(log(m) \\times log(n))\\). But the algorithm has a memory requirement \\(O(m \\times n^2)\\) and was thus not implemented here.

3. There are two variants of Damerau-Levenshtein string distance: Damerau-Levenshtein with adjacent transpositions (also sometimes called
unrestricted Damerau–Levenshtein distance) and Optimal String Alignment (also sometimes called restricted edit distance).
For Optimal String Alignment, no substring can be edited more than once.
| Name | Distance | Similarity | Normalized | Metric | Memory cost | Execution cost | Typical usage |
|--------------------------------------------|:--------:|:----------:|:----------:|:------:|----------------------|------------------------------------|-----------------|
| Levenshtein ||||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | |
| Damerau-Levenshtein[@ft-c] ||||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | |
| Optimal String Alignment[@ft-c] ||||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | |
| Longest Common Subsequence ||||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] | diff, git |
| Normalized Levenshtein ||||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | |
| Normalized Damerau-Levenshtein[@ft-c] ||||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | |
| Normalized Optimal String Alignment[@ft-c] ||||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a] | |
| Normalized Longest Common Subsequence ||||| \\(O(m \\times n)\\) | \\(O(m \\times n)\\)[@ft-a][@ft-b] | |
| Cosine similarity ||||| \\(O(m + n)\\) | \\(O(m + n)\\) | |
| Jaccard index ||||| \\(O(m + n)\\) | \\(O(m + n)\\) | |
| Jaro-Winkler ||||| \\(O(m + n)\\) | \\(O(m \\times n)\\) | typo correction |
| N-Gram ||||| | \\(O(m \\times n)\\) | |
| Q-Gram ||||| | \\(O(m + n)\\) | |
| Ratcliff-Obershelp ||||| \\(O(m + n)\\) | \\(O(n^3)\\) | |
| Sorensen-Dice coefficient ||||| | \\(O(m + n)\\) | |
| Sift 4 ||||| \\(O(m + n)\\) | \\(O(m + n)\\) | |

<h2 class="footnotes-header">Notes</h2>
<div class="footnotes">
<ol>
<li id="footnote-a">

In this library, Levenshtein edit distance, LCS distance and their siblings are computed using the dynamic
programming method, which has a cost \\(O(m \\times n)\\).
For Levenshtein distance, the algorithm is sometimes called Wagner-Fischer algorithm.[@ref-1]
The original algorithm uses a matrix of size \\(m \\times n\\) to store the Levenshtein distance between string
prefixes.

If the alphabet is finite, it is possible to use the "Four-Russians" technique[@ref-2] to speedup computation,
as shown by Masek and Paterson.[@ref-3]
This method splits the matrix in blocks of size \\(t \\times t\\).
Each possible block is precomputed to produce a lookup table.
This lookup table can then be used to compute the string similarity (or distance) in \\(O(\\frac{n \\times m}{t})\\).
Usually, \\(t\\) is chosen as \\(log(m)\\) if \\(m > n\\).
The resulting computation cost is thus \\(O(\\frac{m \\times n}{\\text{log}(m)})\\).
This method has not been implemented (yet).
</li>
<li id="footnote-b">

K.S. Larsen proposed an algorithm that computes the length of LCS in time
\\(O(log(m) \\times log(n))\\).[@ref-4] But the algorithm has a memory requirement \\(O(m \\times n^2)\\) and was thus not
implemented here.
</li>
<li id="footnote-c">

There are two variants of Damerau-Levenshtein string distance: Damerau-Levenshtein with adjacent transpositions
(also sometimes called unrestricted Damerau–Levenshtein distance) and Optimal String Alignment (also sometimes called
restricted edit distance). For Optimal String Alignment, no substring can be edited more than once.
</li>
</ol>
</div>

<h2 class="references-header">References</h2>
<div class="references">
<ol>
<li id="reference-1">

Wagner, R. A., & Fischer, M. J. (1974-01). The string-to-string correction problem.
Journal of the ACM, 21(1), 168–173.
<https://doi.org/10.1145/321796.321811><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.1145/321796.321811)</sup>
</li>
<li id="reference-2">

Arlazarov, V. L., Dinitz, Y. A., Kronrod, M. A., & Faradzhev, I. (1970).
An algorithm for the reduction of finite non-oriented graphs to canonical form.
*Soviet Mathematics Doklady*, *194*(3), 487-488.
</li>
<li id="reference-3">

Masek, W. J., & Paterson, M. S. (1980-02). A faster algorithm computing string
edit distances. *Journal of Computer and System Sciences*, *20*(1), 18-31.
<https://doi.org/10.1016/0022-0000(80)90002-1><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.1016/0022-0000(80)90002-1)</sup>
</li>
<li id="reference-4">

Larsen, K. S. (1992-10). Length of maximal common subsequences. DAIMI Report
Series, 21(426).
<https://doi.org/10.7146/dpb.v21i426.6740><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.7146/dpb.v21i426.6740)</sup>
</li>
</ol>
</div>

Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,11 @@ import ca.solostudios.stringsimilarity.util.minMaxOf
import kotlin.math.sqrt

/**
* Implements Soft Cosine Similarity between strings. The strings are first
* transformed in vectors of occurrences of k-shingles (sequences of k
* characters). In this n-dimensional space, the similarity between the two
* strings is the Cosine of their respective vectors.
* Implements Soft Cosine Similarity between strings.
*
* The strings are first transformed in vectors of occurrences of k-shingles
* (sequences of k characters). In this n-dimensional space, the similarity
* between the two strings is the Cosine of their respective vectors.
*
* The Cosine similarity between strings \(X\) and \(Y\) is
* the Cosine of the angle between the two strings as vectors. It is computed as:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance
import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity

/**
* Implements the Jaccard index, also known as the Jaccard similarity coefficient (Jaccard, 1912).
*
* Each input string is converted into a set of n-grams, the Jaccard index is
* then computed as \(\frac{\lVert V_1 \cap V_2 \rVert}{\lVert V_1 \cup V_2 \rVert}\).
* Like Q-Gram distance, the input strings \(X\) and \(Y\) are first converted into sets of
Expand All @@ -41,6 +43,11 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity
* The distance is computed as
* \(1 - similarity(X, Y)\).
*
* #### References
* Jaccard, P. (1912-02). The distribution of the flora in the alpine zone.
* *New Phytologist*, *11*(2), 37–50.
* <https://doi.org/10.1111/j.1469-8137.1912.tb05611.x><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.1111/j.1469-8137.1912.tb05611.x)</sup>
*
* @see MetricStringDistance
* @see NormalizedStringDistance
* @see NormalizedStringSimilarity
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,8 @@ import kotlin.math.max
import kotlin.math.min

/**
* Implements the Jaro-Winkler distance (Winkler, 1990) between strings.
*
* The Jaro–Winkler distance is designed and best suited for short
* strings such as person names, and to detect typos; it is (roughly) a
* variation of Damerau-Levenshtein, where the substitution of 2 close
Expand All @@ -47,6 +49,11 @@ import kotlin.math.min
* The distance is computed as
* \(1 - similarity(X, Y)\).
*
* #### References
* Winkler, W. E. (1990). String comparator metrics and enhanced decision rules
* in the fellegi-sunter model of record linkage. *Proceedings of the Survey
* Research Methods Section*, 354-359. <https://eric.ed.gov/?id=ED325505>
*
* @param threshold The threshold value used for adding the Winkler bonus.
*
* @see NormalizedStringDistance
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,15 +35,20 @@ import ca.solostudios.stringsimilarity.util.min
import ca.solostudios.stringsimilarity.util.minMaxByLength

/**
* N-Gram Similarity as defined by Kondrak, "N-Gram Similarity and Distance",
* String Processing and Information Retrieval, Lecture Notes in Computer
* Science Volume 3772, 2005, pp 115-126.
* Implements the N-Gram Similarity (Kondrak, 2005) between strings.
*
* The algorithm uses affixing with special character '\0' to increase the
* The algorithm uses affixing with special character `'\0'` to increase the
* weight of first characters. The normalization is achieved by dividing the
* total similarity score the original length of the longest word.
*
* [N-Gram Similarity and Distance](http://webdocs.cs.ualberta.ca/~kondrak/papers/spire05.pdf)
* The similarity is computed as
* \(1 - distance(X, Y)\).
*
* #### References
* Kondrak, G. (2005-11-02). N-gram similarity and distance. In String processing
* and information retrieval, lecture notes in computer science (Pages 115-126).
* Springer Berlin Heidelberg.
* <https://doi.org/10.1007/11575832_13><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.1007/11575832_13)</sup>
*
* @see NormalizedStringDistance
* @see NormalizedStringSimilarity
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -32,10 +32,8 @@ import ca.solostudios.stringsimilarity.interfaces.StringDistance
import kotlin.math.abs

/**
* Q-gram distance, as defined by
* Esko Ukkonen. Bo, "Approximate string-matching with q-grams and maximal matches", in Theoretical Computer Science,
* vol. 92, no. 1, pp. 191-211, Elsevier BV, Jan. 1992, pp. 191–211, doi: 10.1016/0304-3975(92)90143-4.
* <sup>[&#91;sci-hub&#93;](https://sci-hub.st/https://doi.org/10.1016/0304-3975(92)90143-4)</sup>
* Implements the Q-gram distance (Ukkonen, 1992) between strings.
*
* The distance between two strings is defined as
* the number of occurrences of different q-grams in each string:
* \(\sum_{i=1}^n \lVert \vec{v1_i} - \vec{v2_i} \rVert\).
Expand All @@ -47,9 +45,14 @@ import kotlin.math.abs
* resulting in \(distance(X, Y) = 0\) where \(X \neq Y\).
* However, it does respect the other 3 axioms.
*
* #### References
* Ukkonen, E. (1992-01). Approximate string matching with q-grams and maximal
* matches. *Theoretical Computer Science*, *92*(1), 191–211.
* <https://doi.org/10.1016/0304-3975(92)90143-4><sup>[&#91;sci-hub&#93;](https://sci-hub.st/10.1016/0304-3975(92)90143-4)</sup>
*
* @param q The length of each q-gram.
*
* @throws IllegalArgumentException if \(k \leqslant 0\)
* @throws IllegalArgumentException if \(q \leqslant 0\)
*
* @author Thibault Debatty, solonovamax
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringDistance
import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity

/**
* Implements Ratcliff/Obershelp pattern recognition, also known as Gestalt pattern matching,
* Implements Ratcliff/Obershelp pattern recognition (Ratcliff & Metzener, 1988), also known as Gestalt pattern matching,
* similarity between strings.
*
* The similarity is defined as
Expand All @@ -41,6 +41,11 @@ import ca.solostudios.stringsimilarity.interfaces.NormalizedStringSimilarity
* The distance is computed as
* \(1 - similarity(X, Y)\).
*
* #### References
* Ratcliff, J., & Metzener, D. E. (1988-07-01). Pattern matching: The gestalt ap-
* proach. *Dr. Dobb’s Journal*, *13*(7), 46. https://www.drdobbs.com/database/
* pattern-matching-the-gestalt-approach/184407970?pgno=5
*
* @author [Ligi](https://github.com/dxpux), solonovamax, Ported to java from .net by denmase
*/
public class RatcliffObershelp : NormalizedStringSimilarity, NormalizedStringDistance {
Expand Down
Loading

0 comments on commit da9bb56

Please sign in to comment.