-
Notifications
You must be signed in to change notification settings - Fork 33
/
README.Rmd
158 lines (109 loc) · 7.46 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
output: md_document
title: Detect Text Reuse and Document Similarity
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
suppressPackageStartupMessages(library(dplyr))
```
# textreuse
[![CRAN_Status_Badge](http://www.r-pkg.org/badges/version/textreuse)](https://cran.r-project.org/package=textreuse)
[![CRAN_Downloads](http://cranlogs.r-pkg.org/badges/grand-total/textreuse)](https://cran.r-project.org/package=textreuse)
[![Build Status](https://travis-ci.org/ropensci/textreuse.svg?branch=master)](https://travis-ci.org/ropensci/textreuse)
[![Build status](https://ci.appveyor.com/api/projects/status/9qwf0473xi8cyuoh/branch/master?svg=true)](https://ci.appveyor.com/project/lmullen/textreuse-6xljc/branch/master)
[![Coverage Status](https://img.shields.io/codecov/c/github/ropensci/textreuse/master.svg)](https://codecov.io/github/ropensci/textreuse?branch=master)
[![rOpenSci badge](https://badges.ropensci.org/20_status.svg)](https://github.com/ropensci/onboarding/issues/20)
## Overview
This [R](https://www.r-project.org/) package provides a set of functions for measuring similarity among documents and detecting passages which have been reused. It implements shingled n-gram, skip n-gram, and other tokenizers; similarity/dissimilarity functions; pairwise comparisons; minhash and locality sensitive hashing algorithms; and a version of the Smith-Waterman local alignment algorithm suitable for natural language. It is broadly useful for, for example, detecting duplicate documents in a corpus prior to text analysis, or for identifying borrowed passages between texts. The classes provides by this package follow the model of other natural language processing packages for R, especially the [NLP](https://cran.r-project.org/package=NLP) and [tm](https://cran.r-project.org/package=tm) packages. (However, this package has no dependency on Java, which should make it easier to install.)
### Citation
If you use this package for scholarly research, I would appreciate a citation.
```{r}
citation("textreuse")
```
## Installation
To install this package from CRAN:
```{r eval=FALSE}
install.packages("textreuse")
```
To install the development version from GitHub, use [devtools](https://github.com/hadley/devtools).
```{r eval=FALSE}
# install.packages("devtools")
devtools::install_github("ropensci/textreuse", build_vignettes = TRUE)
```
## Examples
There are three main approaches that one may take when using this package: pairwise comparisons, minhashing/locality sensitive hashing, and extracting matching passages through text alignment.
See the [introductory vignette](https://cran.r-project.org/package=textreuse/vignettes/textreuse-introduction.html) for a description of the classes provided by this package.
```{r eval = FALSE}
vignette("textreuse-introduction", package = "textreuse")
```
### Pairwise comparisons
In this example we will load a tiny corpus of three documents. These documents are drawn from Kellen Funk's [research](http://kellenfunk.org/field-code/) into the propagation of legal codes of civil procedure in the nineteenth-century United States.
```{r}
library(textreuse)
dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, meta = list(title = "Civil procedure"),
tokenizer = tokenize_ngrams, n = 7)
```
We have loaded the three documents into a corpus, which involves tokenizing the text and hashing the tokens. We can inspect the corpus as a whole or the individual documents that make it up.
```{r}
corpus
names(corpus)
corpus[["ca1851-match"]]
```
Now we can compare each of the documents to one another. The `pairwise_compare()` function applies a comparison function (in this case, `jaccard_similarity()`) to every pair of documents. The result is a matrix of scores. As we would expect, some documents are similar and others are not.
```{r}
comparisons <- pairwise_compare(corpus, jaccard_similarity)
comparisons
```
We can convert that matrix to a data frame of pairs and scores if we prefer.
```{r}
pairwise_candidates(comparisons)
```
See the [pairwise vignette](https://cran.r-project.org/package=textreuse/vignettes/textreuse-pairwise.html) for a fuller description.
```{r eval=FALSE}
vignette("textreuse-pairwise", package = "textreuse")
```
### Minhashing and locality sensitive hashing
Pairwise comparisons can be very time-consuming because they grow geometrically with the size of the corpus. (A corpus with 10 documents would require at least 45 comparisons; a corpus with 100 documents would require 4,950 comparisons; a corpus with 1,000 documents would require 499,500 comparisons.) That's why this package implements the minhash and locality sensitive hashing algorithms, which can detect candidate pairs much faster than pairwise comparisons in corpora of any significant size.
For this example we will load a small corpus of ten documents published by the American Tract Society. We will also create a minhash function, which represents an entire document (regardless of length) by a fixed number of integer hashes. When we create the corpus, the documents will each have a minhash signature.
```{r}
dir <- system.file("extdata/ats", package = "textreuse")
minhash <- minhash_generator(200, seed = 235)
ats <- TextReuseCorpus(dir = dir,
tokenizer = tokenize_ngrams, n = 5,
minhash_func = minhash)
```
Now we can calculate potential matches, extract the candidates, and apply a comparison function to just those candidates.
```{r}
buckets <- lsh(ats, bands = 50, progress = FALSE)
candidates <- lsh_candidates(buckets)
scores <- lsh_compare(candidates, ats, jaccard_similarity, progress = FALSE)
scores
```
For details, see the [minhash vignette](https://cran.r-project.org/package=textreuse/vignettes/textreuse-minhash.html).
```{r eval=FALSE}
vignette("textreuse-minhash", package = "textreuse")
```
### Text alignment
We can also extract the optimal alignment between to documents with a version of the [Smith-Waterman](https://en.wikipedia.org/wiki/Smith-Waterman_algorithm) algorithm, used for protein sequence alignment, adapted for natural language. The longest matching substring according to scoring values will be extracted, and variations in the alignment will be marked.
```{r}
a <- "'How do I know', she asked, 'if this is a good match?'"
b <- "'This is a match', he replied."
align_local(a, b)
```
For details, see the [text alignment vignette](https://cran.r-project.org/package=textreuse/vignettes/textreuse-alignment.html).
```{r eval=FALSE}
vignette("textreuse-alignment", package = "textreuse")
```
### Parallel processing
Loading the corpus and creating tokens benefit from using multiple cores, if available. (This works only on non-Windows machines.) To use multiple cores, set `options("mc.cores" = 4L)`, where the number is how many cores you wish to use.
### Contributing and acknowledgments
Please note that this project is released with a [Contributor Code of Conduct](https://github.com/ropensci/textreuse/blob/master/CONDUCT.md). By participating in this project you agree to abide by its terms.
Thanks to [Noam Ross](http://www.noamross.net/) for his thorough [peer review](https://github.com/ropensci/onboarding/issues/20) of this package for [rOpenSci](https://ropensci.org/).
------------------------------------------------------------------------
[![rOpenSCi logo](http://ropensci.org/public_images/github_footer.png)](http://ropensci.org)