reid
is a collection of tools designed to facilitate full-text searches of
academic papers maintained in an EndNote library for specific words or
phrases. The reid
project currently consists of the following programs:
reid-enxml
parses an XML-formatted library file exported from EndNote. This may be used to display information contained in the XML file, or to convert the XML into a "reid project" file that the remainder of thereid
tools can work with.reid-convert
iterates through all, or a specified subset, of records stored in a "reid project" file and converts their associated PDFs into "minified" text files, which can then be searched with thereid-search
program. Upon successfully converting files, the "reid project" file is updated to reflect the location of the "minified" text files.reid-search
searches the previously "minified" text files for one or more terms specified on the command line. Simple terms and phrases can be specified, in addition to Regular Expressions. The search can be performed over all files contained in the "reid project" file, or limited by year range, author, or publication. For each matched input term, this program outputs the number of occurrences observed in corresponding source material. By default, the information about matches are printed to the terminal. However, format of this output can be changed to CSV or JSON, and the data can be written to a file.
The following are required to build and use reid
- Linux - Tessaract reportedly misbehaves on OSX, which has not been tested.
- Go 1.7 or later. (Earlier versions have not been tested)
- tesseract:
tesseract-eng
,libtesseract
,libtesseract-dev
- poppler utilities:
pdfimages
,pdftotext
In general, one uses the reid
tools in the order listed above. This section
briefly explains how to use the tools. Information about building the tools from
source is shown in a later section.
For additional usage information, such as available flags and arguments, run
the tools with a --help
command line option.
First, export entries from your EndNote library to an XML. This can be done
via the File > Export...
option in EndNote. In the window that appears,
be sure to specify:
XML (*.xml)
as the File TypeAll Fields
as the output format
Also note the checkbox that allows you to export only the selected items. Check this only if you are trying to create a "reid project" file from a subset of your library.
While the exported XML library files are human-readable text, they a bit
dense and a bit tough to review manually. The reid-enxml
program's
"show" command can be used to summarize the contents of the XML file. For
large library exports, it may take a few seconds to load the XML file.
For the following examples, assume that an EndNote library has been exported
to a mylib.xml
file.
List all records contained in the XML
As this is likely to be a large list, you can pipe the output of reid-enxml
to the less
program and then use the Page Up/Down or arrow keys to navigate
the list.
$ reid-enxml -x mylib.xml show all | less
List all publications
Perhaps you want a sorted list of all the publications within a library?
This can be done using show publications
piping the output to the sort
program.
$ reid-enxml -x mylib.xml show publications | sort | less
List all authors
It also possible to list all authors contained in the library. However, the accuracy of this is only as good as the metadata from PDFs imported into your EndNote library. It is likely that different publication will specify names differently -- including or omitting a middle initial, or possibly including a full first name instead of an initial.
Listing and sorting authors' names may be helpful in determining if this is the case in your library.
$ reid-enxml -x mylib.xml show authors | sort
List all years covered by the library contents
When surveying the use of terms over a period of time, it is often useful
to first confirm that you've exported the correct date ranges before starting
your searches. Use the show years
command to quickly verify that years
that the exported library covers.
$ reid-enxml -x mylib.xml show years
Only a subset of information from the EndNote library XML file is required.
Additionally, the reid
tools need to know some information, such as
where your converted text files are (to be) stored. For that reason, a "reid
project" file and an associated data directory must be provided.
The following command converts a mylib.xml
to a myproject.json
file and
creates a mydata
directory. This directory will be empty for now, but will
later be used to stored the text extracted from PDF files.
$ reid-enxml -x mylib.xml create myproject.json mydata
Before being able to search PDF documents with reid
, we must first extract
the text from them. When converting to text, reid-convert
also "minifies"
this text to make it easier to search. In the context of the reid
project,
this "minification" consists of:
- Reformatting text to a single line and re-joining hyphenated text.
- Removing references, URLs, and punctuation.
- Converting any remaining white space to a single space (" ").
- Converting text to lower-case. (As a result, searches are case-insensitive.)
- Removing quotation marks
Currently, all of the above is performed by default. If this interferes with your ability to search a document, please submit a feature request on the Issue Tracker with respect to tuning the various "minifications" that are performed.
The following command will convert all the PDFs associated with entries
in the previously created project file to "minified" text, storing these
text files in the mydata
directory.
The --debug
argument is optional; you can specify this to view some
additional information about what the program is currently doing. The process
generally takes a few seconds (or less) per PDF, so depending upon the
size of your EndNote library, this may be a good time to go make yourself
a cup of your favorite hot beverage.
$ reid-convert -p myproject.json --debug
Note that reid-convert
also supports converting only a specified set
of PDF files. This is is largely for debugging purposes and is not expected to
be terribly useful to "end users." Run reid-convert --help
for the
available options for this.
With all that done, we finally search our entire library for various terms and phrases. Below is a simple and fictional example, in which we search for all the occurrences of the term, "bootloader":
$ reid-search -p myproject.json -t bootloader
By default, the results are "pretty printed" to the console:
Query: bootloader
Occurrences: 7
Year: 2015
Publication: The Journal Of The Internet Of Things That Shouldn't Be
Author(s): Goodspeed, T. / Ridley, S. / Grand, J. / Fitz, J.
Title: Cross-Platform ROP Gadget Polyglots for ARM, MIPS, and PIC32
Query: bootloader
Occurrences: 13
Year: 2021
Publication: Phrack #70
Authors(s): Laphroaig, M.
Title: Inserting Backdoors Into Black Box Firmware For Fun and Profit
But what if some works use the term "boot loader" (with a space) instead of "bootloader?" The same search can be performed, but with an additional term:
$ reid-search -p myproject.json -t bootloader -t 'boot loader'
Query: bootloader
Occurrences: 7
Year: 2015
Publication: The Journal Of The Internet Of Things That Shouldn't Be
Author(s): Goudaspeed, T. / Lidrey, S. / Grandious, J. / Ritz, J.
Title: Cross-Platform ROP Gadget Polyglots for ARM, MIPS, and PIC32
Query: boot loader
Occurrences: 42
Year: 1996
Publication: Real-time and Embedded Systems
Author(s): Smith, J.
Title: Modern Boot Loader Design
Query: bootloader
Occurrences: 13
Year: 2021
Publication: Phrack #70
Authors(s): Laphroaig, M.
Title: Inserting Backdoors Into Black Box Firmware For Fun and Profit
The same search as the above can be performed using a Regular Expression to
define the same desired search, using the --regexp/-r
option instead of the
--term/-t
option:
$ reid-search -p myproject.json -r 'boot ?loader'
Query: regexp{boot ?loader}
Occurrences: 7
Year: 2015
Publication: The Journal Of The Internet Of Things That Shouldn't Be
Author(s): Goudaspeed, T. / Lidrey, S. / Grandious, J. / Ritz, J.
Title: Cross-Platform ROP Gadget Polyglots for ARM, MIPS, and PIC32
Query: boot loader
Occurrences: 42
Year: 1996
Publication: Real-time and Embedded Systems
Author(s): Smith, J.
Title: Modern Boot Loader Design
Query: regexp{boot ?loader}
Occurrences: 13
Year: 2021
Publication: Phrack #70
Authors(s): Laphroaig, M.
Title: Inserting Backdoors Into Black Box Firmware For Fun and Profit
It is important to note that one "Occurrences" count will be listed for everything matched by the regular expression, not each individual term or phrase. This can be very handy in cases where you want to group occurrences of similar terms or phrases. For example, the following will report the total number of occurrences of either "architecture" or "architectural".
$ reid-search -p myproject.json -r 'architechtur(e|al)'
When using regular expressions, be aware that it's up to you to specify
where whitespace or the beginning/end of a document may occur. In the
previous example, results for "microarchitecure" or "architecure-independent"
would be included. If you only want an exact match for "architecture" then
you would need to specify (^| )architecture( |$)
. In fact, this is
exactly what the -t/--term
option is doing with the provided text.
(Remember, the minified text is converted to lower case and all white space
is converted to a single space ' ' character.)
In all liklihood, you won't want always to search the entire library. Instead, you might want to search articles from a specific publication, or over a limited number of years.
The --from/-F
and --to/-T
arguments can be used to specify the earliest
and latest years to include in searches.
The --publication/-P
argument can be used to specific publication to search.
This argument can be specified multiple times to include multiple publications
in the search.
Searches can also be done by author, using the --author/-a
argument. This too
can be specified multiple times to include multiple authors in the search.
Below is an example of these arguments in action:
$ reid-search --from 1990 --to 2001 \
--publication 'Real-time and Embedded Systems' \
--publication 'Circuit Cellar' \
--publication 'IEEE Xplore' \
--author 'Smith, J.' \
--author 'Turing, A.' \
--author 'Lovelace, A.' \
-r 'boot ?loader'
The default "pretty print" output may be helpful for getting a quick sense of
results, but is not well-suited for aggregating a large set of results. Instead,
the --format/-f
argument may be used to change the output format. This may be used
in conjunction with the --outfile/-o
argument to write results to a file.
The available formats are:
pretty
: Simple descriptive format. Not well-suited for automated parsing. (Default)csv
: Comma separated values with quoted strings. Can be imported into tools like Excel.csv-no-hdr
: Same ascsv
but without a header rowjson
: Javascript Object Notation. This is the best option if you want to work with the data programatically.
For more information, run reid-search --help
.
Binary releases of this software are not yet provided; the reid
tools
must be built from source. (You can always ask me politely for a build,
of course!)
Running make
from the top-level directory will result in the go get
calls
needed to fetch and build dependencies.
Upon completion the reid
tools will be located in the top-level directory.
Copy or move these into a location within your ${PATH}
.
This software is released under version 3.0 of the GNU General Public License. The text of this license may be found in the COPYING file.
This software is developed and maintained on an as-needed basis, in the author's spare time. As such, no support for these tools is officially offered.
Please use the Issue Tracker only to report defects in the software. General support or usage questions will be immediately closed.
With that being said, the author considers himself a fairly decent human being.
If you're really struggling to use these tools, feel free to send and email
(found in the git commit log), and he'll probably find time to lend a helping
hand. Hint: Amazon gift cards and beer money are always appreciated. ;)
"EndNote" is a registered trademark owned by Clarivate Analytics. The author of
the reid
tools is not affiliated with Clarivate Analytics.
To the best of his knowledge, this software has been developed in a manner that
is consistent with the EndNote End User License Agreement; the reid
tools only
process user-exported XML files, and do not utilize any Clarivate-owned applications,
libraries, or SDKs. No reverse engineering of the EndNote software was performed to
develop these tools; reid-enxml
simply parses the self-explanatory,
human-readable, user-exported library XML files.
However, the author is neither a lawyer nor an actor that plays one on TV. The
user of this software is responsible for ensuring their usage of the reid
software is consistent with the EndNote End User License Agreement.
The reid
tools were developed using a very small sample of XML files output
by EndNote X7. It may not adequately support XML files produced by
other versions of the software. Furthermore, the sample XML files used to
develop these tools contained only journal articles. As such, changes may to
the reid
tools may be required if one's exported library contains other types
of published works.
Finally, the reid
tools were developed on best-effort basis. The author
takes no responsibility for, and makes no guarantees of, the correctness of the
data output by these tools. Users are ultimately responsible for ensuring the
validity and correctness of their data and results.