Skip to content

Commit

Permalink
[MRG] add sourmash sig grep (#1864)
Browse files Browse the repository at this point in the history
* upgrade 'manifest' documentation, cli help

* alias fileinfo to summarize

* flakes cleanup

* rescue shadowed tests

* rescue shadowed tests

* rescue shadowed tests

* add 'sig grep' command

* add some basic tests

* fix get manifest stuff

* fail on no manifest

* check manifest req't

* test various combinations of zip, -v, -i

* update with CSV output/manifest

* added -c/--count

* adjust output

* test fail extract

* comment tests better

* add test for count

* update docs

* remove warnings

* cleanup; create CollectionManifest.filter_rows

* create CollectionManifest.filter_on_columns

* minor cleanup

* Update src/sourmash/cli/sig/grep.py

Co-authored-by: Tessa Pierce Ward <[email protected]>

* Add a straight up picklist test

Co-authored-by: Tessa Pierce Ward <[email protected]>
  • Loading branch information
ctb and bluegenes authored Mar 7, 2022
1 parent cccd06c commit ba38d14
Show file tree
Hide file tree
Showing 8 changed files with 650 additions and 8 deletions.
48 changes: 45 additions & 3 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -962,16 +962,16 @@ Most commands will load signatures automatically from indexed databases
(SBT and LCA formats) as well as from signature files, and you can load
signatures from stdin using `-` on the command line.

### `sourmash signature cat` - concatenate multiple signatures together
### `sourmash signature cat` - combine signatures into one file

Concatenate signature files.

For example,
```
sourmash signature cat file1.sig file2.sig -o all.sig
sourmash signature cat file1.sig file2.sig -o all.zip
```
will combine all signatures in `file1.sig` and `file2.sig` and put them
in the file `all.sig`.
in the file `all.zip`.

### `sourmash signature describe` - display detailed information about signatures

Expand Down Expand Up @@ -1029,6 +1029,48 @@ those formats are under semantic versioning.
Note: `sourmash signature summarize` is an alias for `fileinfo`; they are
the same command.

### `sourmash signature grep` - extract matching signatures using pattern matching

Extract matching signatures with substring and regular expression matching
on the name, filename, and md5 fields.

For example,
```
sourmash signature grep -i shewanella tests/test-data/prot/all.zip -o shew.zip
```
will extract the two signatures in `all.zip` with 'Shewanella baltica'
in their name and save them to `shew.zip`.

`grep` will search for substring matches or regular expressions;
e.g. `sourmash sig grep 'os185|os223' ...` will find matches to either
of those expressions.

Command line options include `-i` for case-insensitive matching, and `-v`
for exclusion rather than inclusion.

A CSV file of the matching sketch information can be saved using
`--csv <outfile>`; this file is in the sourmash manifest format and can be used as a picklist with `--pickfile <outfile>::manifest`.

If `--silent` is specified, `sourmash sig grep` will not output matching
signatures.

`sourmash sig grep` also supports a counting mode, `-c/--count`, in which
only the number of matching sketches in files will be displayed; for example,

```
% sourmash signature grep -ci 'os185|os223' tests/test-data/prot/*.zip
```
will produce the following output:
```
2 matches: tests/test-data/prot/all.zip
0 matches: tests/test-data/prot/dayhoff.sbt.zip
0 matches: tests/test-data/prot/dayhoff.zip
0 matches: tests/test-data/prot/hp.sbt.zip
0 matches: tests/test-data/prot/hp.zip
0 matches: tests/test-data/prot/protein.sbt.zip
0 matches: tests/test-data/prot/protein.zip
```

### `sourmash signature split` - split signatures into individual files

Split each signature in the input file(s) into individual files, with
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/cli/sig/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from . import flatten
from . import fileinfo
from . import fileinfo as summarize
from . import grep
from . import kmers
from . import intersect
from . import manifest
Expand Down
94 changes: 94 additions & 0 deletions src/sourmash/cli/sig/grep.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
"""extract one or more signatures by substr/regex match"""

usage="""
sourmash sig grep <pattern> <filename> [... <filenames>]
This will search for the provided pattern in the files or databases,
using the signature metadata, and output matching signatures.
Currently 'grep' searches the 'name', 'filename', and 'md5' fields as
displayed by `sig describe`.
'pattern' can be a string or a regular expression.
'sig grep' uses the built-in Python regexp module, 're', to implement
regexp searching. See https://docs.python.org/3/howto/regex.html and
https://docs.python.org/3/library/re.html for details.
The '-v' (exclude), '-i' (case-insensitive), and `-c` (count) options of 'grep' are
supported.
'-o/--output' can be used to output matching signatures to a specific
location.
By default, 'sig grep' requires a pre-existing manifest for collections;
this prevents potentially slow manifest rebuilding. You
can turn this check off with '--no-require-manifest'.
"""

from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args)


def subparser(subparsers):
subparser = subparsers.add_parser('grep', usage=usage)
subparser.add_argument('pattern', help='search pattern (string/regex)')
subparser.add_argument('signatures', nargs='*')
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true',
help='output debug information'
)
subparser.add_argument(
'-o', '--output', metavar='FILE',
help='output matching signatures to this file (default stdout)',
default='-',
)
subparser.add_argument(
'-f', '--force', action='store_true',
help='try to load all files as signatures, independent of filename'
)
subparser.add_argument(
'--from-file',
help='a text file containing a list of files to load signatures from'
)
subparser.add_argument(
'-v', '--invert-match',
help="select non-matching signatures",
action="store_true"
)
subparser.add_argument(
'-i', '--ignore-case',
help="ignore case distinctions (search lower and upper case both)",
action="store_true"
)
subparser.add_argument(
'--no-require-manifest',
help='do not require a manifest; generate dynamically if needed',
action='store_true'
)
subparser.add_argument(
'--csv',
help='save CSV file containing signature data in manifest format'
)
subparser.add_argument(
'--silent', '--no-signatures-output',
help="do not output signatures",
action='store_true',
)
subparser.add_argument(
'-c', '--count',
help="only output a count of discovered signatures; implies --silent",
action='store_true'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
import sourmash.sig.grep
return sourmash.sig.grep.main(args)
21 changes: 21 additions & 0 deletions src/sourmash/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,27 @@ def select_to_manifest(self, **kwargs):
new_rows = self._select(**kwargs)
return CollectionManifest(new_rows)

def filter_rows(self, row_filter_fn, *, invert=False):
"Create a new manifest filtered through row_filter_fn."
if invert:
orig_row_filter_fn = row_filter_fn
row_filter_fn = lambda x: not orig_row_filter_fn(x)

new_rows = [ row for row in self.rows if row_filter_fn(row) ]

return CollectionManifest(new_rows)

def filter_on_columns(self, col_filter_fn, col_names, *, invert=False):
"Create a new manifest based on column matches."
def row_filter_fn(row):
for col in col_names:
val = row[col]
if val is not None:
if col_filter_fn(val):
return True
return False
return self.filter_rows(row_filter_fn, invert=invert)

def locations(self):
"Return all distinct locations."
seen = set()
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/sig/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
from .__main__ import main
from . import grep
129 changes: 129 additions & 0 deletions src/sourmash/sig/grep.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
"""
Command-line entry point for 'python -m sourmash.sig grep'
"""
import sys
import re

from sourmash import logging, sourmash_args
from sourmash.logging import notify, error, debug, print_results
from sourmash.manifest import CollectionManifest
from .__main__ import _extend_signatures_with_from_file


def main(args):
"""
extract signatures by pattern match.
"""
# basic argument parsing
logging.set_quiet(args.quiet, args.debug)
moltype = sourmash_args.calculate_moltype(args)
picklist = sourmash_args.load_picklist(args)
_extend_signatures_with_from_file(args)

# build the search pattern
pattern = args.pattern
if args.ignore_case:
pattern = re.compile(pattern, re.IGNORECASE)
else:
pattern = re.compile(pattern)

# require manifests?
require_manifest = True
if args.no_require_manifest:
require_manifest = False
debug("sig grep: manifest will not be required")
else:
debug("sig grep: manifest required")

# are we doing --count? if so, enforce --silent so no sigs are printed.
if args.count:
args.silent = True

# define output type: signatures, or no?
if args.silent:
notify("(no signatures will be saved because of --silent/--count).")
save_sigs = sourmash_args.SaveSignaturesToLocation(None)
else:
notify(f"saving matching signatures to '{args.output}'")
save_sigs = sourmash_args.SaveSignaturesToLocation(args.output)
save_sigs.open()

# are we outputting a CSV? if so, initialize that, too.
csv_obj = None
if args.csv:
csv_obj = sourmash_args.FileOutputCSV(args.csv)
csv_fp = csv_obj.open()
CollectionManifest.write_csv_header(csv_fp)

# start loading!
total_rows_examined = 0
for filename in args.signatures:
idx = sourmash_args.load_file_as_index(filename,
yield_all_files=args.force)

idx = idx.select(ksize=args.ksize,
moltype=moltype,
picklist=picklist)

# get (and maybe generate) the manifest.
manifest = idx.manifest
if manifest is None:
if require_manifest:
error(f"ERROR on filename '{filename}'.")
error("sig grep requires a manifest by default, but no manifest present.")
error("specify --no-require-manifest to dynamically generate one.")
sys.exit(-1)
else:
manifest = sourmash_args.get_manifest(idx,
require=False)

# find all matching rows.
sub_manifest = manifest.filter_on_columns(pattern.search,
["name", "filename", "md5"],
invert=args.invert_match)
total_rows_examined += len(manifest)

# write out to CSV, if desired.
if args.csv:
sub_manifest.write_to_csv(csv_fp)

# just print out number of matches?
if args.count:
print_results(f"{len(sub_manifest)} matches: {filename}")
elif not args.silent:
# nope - do output signatures. convert manifest to picklist, apply.
sub_picklist = sub_manifest.to_picklist()

try:
idx = idx.select(picklist=sub_picklist)
except ValueError:
error("** This input collection doesn't support 'grep' with picklists.")
error("** EXITING.")
error("**")
error("** You can use 'sourmash sig cat' with a picklist,")
error("** and then pipe the output to 'sourmash sig grep -")
sys.exit(-1)

# save!
for ss in idx.signatures():
save_sigs.add(ss)
# done with the big loop over all indexes!

if args.silent:
pass
else:
notify(f"loaded {total_rows_examined} total that matched ksize & molecule type")

if save_sigs:
notify(f"extracted {len(save_sigs)} signatures from {len(args.signatures)} file(s)")
save_sigs.close()
else:
error("no matching signatures found!")
sys.exit(-1)

if args.csv:
notify(f"wrote manifest containing all matches to CSV file '{args.csv}'")
csv_obj.close()

if picklist:
sourmash_args.report_picklist(args, picklist)
10 changes: 5 additions & 5 deletions tests/test_cmd_signature.py
Original file line number Diff line number Diff line change
Expand Up @@ -1448,11 +1448,10 @@ def test_sig_extract_8_picklist_md5_zipfile(runtmp):
assert "for given picklist, found 1 matches to 1 distinct values" in err


def test_sig_extract_8_picklist_md5_lca(runtmp):
# extract 47 from an LCA database, using a picklist w/full md5
def test_sig_extract_8_picklist_md5_lca_fail(runtmp):
# try to extract 47 from an LCA database, using a picklist w/full md5; will
# fail.
allzip = utils.get_test_data('lca/47+63.lca.json')
sig47 = utils.get_test_data('47.fa.sig')
sig63 = utils.get_test_data('63.fa.sig')

# select on any of these attributes
row = dict(exactName='NC_009665.1 Shewanella baltica OS185, complete genome',
Expand All @@ -1472,7 +1471,8 @@ def test_sig_extract_8_picklist_md5_lca(runtmp):
with pytest.raises(SourmashCommandFailed) as exc:
runtmp.sourmash('sig', 'extract', allzip, '--picklist', picklist_arg)

# this happens b/c the implementation of 'extract' uses picklists.
# this happens b/c the implementation of 'extract' uses picklists, and
# LCA databases don't support multiple picklists.
print(runtmp.last_result.err)
assert "This input collection doesn't support 'extract' with picklists." in runtmp.last_result.err

Expand Down
Loading

0 comments on commit ba38d14

Please sign in to comment.