The plan for this package is to have Go implementation of different string distance/similarity functions, like Levenshtein (normalized, weighted, Damerau), Jaro-Winkler, Jaccard index, Euclidean distance, Hamming distance...
Currently it has implemented:
- Levenshtein
- Jaccard
- Hamming
- LCS
- Q-gram
- n-gram based Cosine distanc
Work in progress...
To get the library just run:
go get github.com/dexyk/stringosim
To use the library just import it in your code:
import "github.com/dexyk/stringosim"
To run the tests, go to the directory where stringosim package is installed and run:
go test
Currently only Levenshtein, Jaccard, Hamming, LCS string, Q-gram and Cosine distances are implemented.
Levenshtein distance can be calculated with default parameters (use DefaultSimilarityOptions) where cost of insert, delete and substitute operation are 1. You can also use it with other parameters by using SimilarityOptions type. Setting CaseInsensitive to true in SimilarityOptions the comparison will be done without considering character cases.
Example:
fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("stingobim")))
fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("stingobim"),
stringosim.LevenshteinSimilarityOptions{
InsertCost: 3,
DeleteCost: 5,
SubstituteCost: 2,
}))
fmt.Println(stringosim.Levenshtein([]rune("stringosim"), []rune("STRINGOSIM"),
stringosim.LevenshteinSimilarityOptions{
InsertCost: 3,
DeleteCost: 4,
SubstituteCost: 5,
CaseInsensitive: true,
}))
Jaccard distance can be calculated by setting the size of the n-gram which will be used for comparison. If the size is omitted the default value of 1 will be used.
Example:
fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim")))
fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim"), []int{2}))
fmt.Println(stringosim.Jaccard([]rune("stringosim"), []rune("stingobim"), []int{3}))
Hamming distance can be calculated with options. Default function will calculate standard hamming distance with case sensitive option. It can be also used without case sensitive option.
If the strings to compare have different lengths, the error will be returned.
Example:
dis, _ := stringosim.Hamming([]rune("testing"), []rune("restink"))
fmt.Println(dis)
dis, _ = stringosim.Hamming([]rune("testing"), []rune("FESTING"), stringosim.HammingSimilarityOptions{
CaseInsensitive: true,
})
fmt.Println(dis)
_, err := stringosim.Hamming([]rune("testing"), []rune("testin"))
fmt.Println(err)
LCS between two strings can be calculated with options. Default function will calculate the LCS with case insensitive option. It can be also used without case sensitive option.
Example:
fmt.Println(stringosim.LCS([]rune("testing lcs algorithm"), []rune("another l c s example")))
fmt.Println(stringosim.LCS([]rune("testing lcs algorithm"), []rune("ANOTHER L C S EXAMPLE"),
stringosim.LCSSimilarityOptions{
CaseInsensitive: true,
}))
Jaro and Jaro-Winkler can be calculated with options: case insensitive, and specific values for Jaro-Winkler - threshold, p value and l value.
Example:
fmt.Println(stringosim.Jaro([]rune("abaccbabaacbcb"), []rune("bababbcabbaaca")))
fmt.Println(stringosim.Jaro([]rune("abaccbabaacbcb"), []rune("ABABAbbCABbaACA"),
stringosim.JaroSimilarityOptions{
CaseInsensitive: true,
}))
fmt.Println(stringosim.JaroWinkler([]rune("abaccbabaacbcb"), []rune("bababbcabbaaca")))
fmt.Println(stringosim.JaroWinkler([]rune("abaccbabaacbcb"), []rune("BABAbbCABbaACA"),
stringosim.JaroSimilarityOptions{
CaseInsensitive: true,
Threshold: 0.7,
PValue: 0.1,
LValue: 4,
}))
Q-gram distance can be calculated using default options (DefaultQGramOptions): length of q-grams is 2 and comparison is case sensitive. Using QGramSimilarityOptions as the parameter of the function we can set custom q-gram length and if the comparison is case sensitive or not.
Example:
fmt.Println(stringosim.QGram([]rune("abcde"), []rune("abdcde")))
fmt.Println(stringosim.QGram([]rune("abcde"), []rune("ABDCDE"),
stringosim.QGramSimilarityOptions{
CaseInsensitive: true,
NGramSizes: []int{3},
}))
Cosine distance can be calculated using default options (DefaultCosineOptions): length of n-grams is 2 and comparison is case sensitive. Using CosineSimilarityOptions as the parameter of the function we can set custom n-gram length and if the comparison is case sensitive or not.
Example:
fmt.Println(stringosim.Cosine([]rune("abcde"), []rune("abdcde")))
fmt.Println(stringosim.Cosine(Cosine[]rune("abcde"), []rune("ABDCDE"),
stringosim.CosineSimilarityOptions{
CaseInsensitive: true,
NGramSizes: []int{3},
}))