GnCrossmap

This gem crossmaps a checklist of scientific names to names from a data source in GN Resolver.

Checklist has to be in a CSV format.

Compatibility

This gem is compatible with Ruby versions higher or equal to 2.1.0

Installation

Add this line to your application's Gemfile:

gem 'gn_crossmap'

And then execute:

bundle

Or install it yourself as:

gem install gn_crossmap

Usage

Usage as a Web Application

see [gn_crossmap_web] project

Usage from command line

# to see help
crossmap --help

# to compare with default source (Catalogue of Life)
crossmap -i my_list.csv -o my_list_col.csv

# to compare with other source (Index Fungorum in this example)
crossmap -i my_list.csv -o my_list_if.csv -d 5

# to use standard intput and/or output
cat my_list.csv | crossmap -i - -o - > output

# to keep only taxonID (if given) from original input
# no original fields will be kept without taxonID
cat my_list.csv | crossmap -i my_list.csv -s

# to show classification from the source
cat my_list.csv | crossmap -i my_list.csv -w

Usage as Ruby Library (API description)

`GnCrossmap.run`

Compares an input list to a data source from GN Resolver and writes result into an output file.

opts = { input: input, output: output, data_source_id: 1 ,
         skip_original: true, alt_headers: [] }
GnCrossmap.run(opts)

input : (string) Either a path to a csv file with list of names, or "-" which designates STDIN

output : (string) Either a path to the output file, or "-" which designates STDOUT

data_source_id : (integer) id of a data source from GN resolver

skip_original : (boolean) if true only taxonID (if given) is preserved from original data. Otherwise all original data is preserved. If there is no taxonID, no original data will be preserved.

alt_headers : (array) empty array by default. If alt_headers are not empty they are used instead of the headers supplied with the file

resolver_url : URL to globalnames' resolver. Default is http://resolver.globalnames.org

with_classification : (boolean) if true, adds classification path to the output

`GnCrossmap.logger=`

Allows to set logger to a custom logger (default is STDERR)

Usage Example

require "gn_crossmap"

# If you want to change logger -- default Logging is to standard error

GnCrossmap.logger = MyCustomLogger.new

opts = { input: "path/to/input.csv", output: "path/to/output.csv,
         data_source_id: 5 , skip_original: true }
GnCrossmap.run("path/to/input.csv", "path/to/output.csv", 5, true)

# if you want to use alternative headers instead of ones supplied in a file

opts = { input: "path/to/input.csv", output: "path/to/output.csv,
         data_source_id: 5 , skip_original: true,
         alt_headers: %w(taxonId, scientificName, rank) }
GnCrossmap.run(opts)

If you want to get intermediate statistics for each resolution cycle use a block:

GnCrossmap.run(opts) do |stats|
  puts stats
  puts "Matches:"
  stats[:matches].each do |key, value|
    puts "#{GnCrossmap::MATCH_TYPES[key]}: #{value}"
  end
end

To trigger termination of the resolution before it is completed

GnCrossmap.run(opts) do
  # do something and then return STOP string from the block
  "STOP"
end

Intermediate stats format

Field	Description
status	current phase: (init, ingested
total_records	total number of names in original list
ingestion_start	time when the reading from csv started
ingestion_span	time of intermediate checkpoint of reading csv
ingested_records	number of ingested records at an intermediate checkpoint
resolution_start	time when resolution of names started
resolution_stop	time when resolution of names stopped
resolution_span	time of intermediate checkpoint of resolving names
resolved_records	number of names already processed
speed	weighted speed average
current_speed	speed of resolution for the last batch
matches	Distribution of processed data by match type (see below)
errors	First 0-10 errors found during the csv file processing

Match types

Match types dictionary can be accessed with GnCrossmap::MATCH_TYPES constant

Match code	Match type
0	No match
1	Exact string match
2	Canonical form exact match
3	Canonical form fuzzy match
4	Partial canonical form match
5	Partial canonical form fuzzy match
6	Genus part match
7	Error in matching

Input file format

Comma Separated File with names of fields in first row.
Columns can be separated by tab, comma or semicolon
At least some columns should have recognizable fields

taxonID kingdom phylum class order family genus species subspecies variety form scientificNameAuthorship scientificName taxonRank

simplest Example -- only scientificName

scientificName
Animalia
Macrobiotus echinogenitus subsp. areolatus Murray, 1907

taxonID and scientificName Example

taxonID;scientificName
1;Macrobiotus echinogenitus subsp. areolatus Murray, 1907
...

taxonID	scientificName
1	Animalia
2	Macrobiotus echinogenitus subsp. areolatus Murray, 1907

Rank Example

taxonID;scientificName;taxonRank
1;Macrobiotus echinogenitus f. areolatus Murray, 1907;form
...

taxonID	scientificName	taxonRank
1	Animalia	kingdom
2	Macrobiotus echinogenitus subsp. areolatus Murray, 1907	subspecies

Family and Authorship Example

taxonID;family;scientificName;scientificNameAuthorship
1;Macrobiotidae;Macrobiotus echinogenitus subsp. areolatus;Murray, 1907
...

taxonID	family	scientificName	scientificNameAuthorship
1		Animalia
2	Macrobiotidae	Macrobiotus echinogenitus	Murray

Fine-grained Example

TaxonId;kingdom;subkingdom;phylum;subphylum;superclass;class;subclass;cohort;superorder;order;suborder;infraorder;superfamily;family;subfamily;tribe;subtribe;genus;subgenus;section;species;subspecies;variety;form;ScientificNameAuthorship
1;Animalia;;Tardigrada;;;Eutardigrada;;;;Parachela;;;Macrobiotoidea;Macrobiotidae;;;;Macrobiotus;;;harmsworthi;obscurus;;;Dastych, 1985

TaxonId	kingdom	phylum	class	order	superfamily	family	genus	species	subspecies	ScientificNameAuthorship
136021	Animalia	Pogonophora
136022	Animalia	Pogonophora	Frenulata							Webb, 1969
565443	Animalia	Tardigrada	Eutardigrada	Parachela	Macrobiotoidea	Macrobiotidae	Macrobiotus	harmsworthi	obscurus	Dastych, 1985

More examples can be found in spec/files directory

Output file format

Output includes following fields:

Field	Description
classification	classification path of the data source (if available)
editDistance	for fuzzy-matching -- how many characters differ between checklist and data source name
matchSize	number of returned matches for a name
matchType	what kind of match it is
matchedCanonicalForm	canonical form of the matched name
matchedRank	corresponding rank from the data source
matchedScientificName	name matched from the GN Reolver data source
rank	rank from the source (if it was given/inferred)
scientificName	name from the checklist
score	heuristic score from 0 to 1 where 1 is a good match, 0.5 match requires further human investigation
taxonID	original ID attached to a name in the checklist

Types of Matches

The output fomat returns 7 possible types of matches:

Exact string match - The exact name was matched (but ignoring non-ascii characters)
Exact match by canonical form of a name - The canonical form of the name (a version of a scientific name that contains complete versions of the latin words, but lacks insertions of subtaxa, annotations, or authority information) was matched
Fuzzy match by canonical form - The canonical form gave a fuzzy (detecting lexical or spelling variations of a name using Tony Rees' algorithm TAXAMATCH) match
Partial exact match by species part of canonical form - The canonical form returned a partial but exact match
Partial fuzzy match by species part of canonical form - The canonical form returned a partial, fuzzy match
Exact match by genus part of a canonical form - The genus part of the canonical form of the species name returned an exact match
[Blank] - No match

Development

After checking out the repo, run bin/setup to install dependencies. Then, run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release to create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org

Contributing

Fork it ( https://github.com/[my-github-username]/gn_crossmap/fork )
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Copyright

Author -- Dmitry Mozzherin

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
bin		bin
exe		exe
lib		lib
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.ruby-version		.ruby-version
.travis.yml		.travis.yml
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
LICENSE		LICENSE
README.md		README.md
Rakefile		Rakefile
gn_crossmap.gemspec		gn_crossmap.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GnCrossmap

Compatibility

Installation

Usage

Usage as a Web Application

Usage from command line

Usage as Ruby Library (API description)

`GnCrossmap.run`

`GnCrossmap.logger=`

Usage Example

Intermediate stats format

Match types

Input file format

simplest Example -- only scientificName

taxonID and scientificName Example

Rank Example

Family and Authorship Example

Fine-grained Example

Output file format

Types of Matches

Development

Contributing

Copyright

About

Releases 3

Packages

Contributors 2

Languages

License

GlobalNamesArchitecture/gn_crossmap

Folders and files

Latest commit

History

Repository files navigation

GnCrossmap

Compatibility

Installation

Usage

Usage as a Web Application

Usage from command line

Usage as Ruby Library (API description)

GnCrossmap.run

GnCrossmap.logger=

Usage Example

Intermediate stats format

Match types

Input file format

simplest Example -- only scientificName

taxonID and scientificName Example

Rank Example

Family and Authorship Example

Fine-grained Example

Output file format

Types of Matches

Development

Contributing

Copyright

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 2

Languages

`GnCrossmap.run`

`GnCrossmap.logger=`

Packages