Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] add sourmash sig grep #1864

Merged
merged 30 commits into from
Mar 7, 2022
Merged
Show file tree
Hide file tree
Changes from 27 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
30bf6b9
upgrade 'manifest' documentation, cli help
ctb Mar 5, 2022
f891e11
alias fileinfo to summarize
ctb Mar 5, 2022
4fb5f99
flakes cleanup
ctb Mar 5, 2022
7eab2f6
rescue shadowed tests
ctb Mar 5, 2022
7feaad7
rescue shadowed tests
ctb Mar 5, 2022
31d5586
rescue shadowed tests
ctb Mar 5, 2022
c7b63eb
add 'sig grep' command
ctb Mar 5, 2022
44979e5
add some basic tests
ctb Mar 5, 2022
ebe2334
fix get manifest stuff
ctb Mar 5, 2022
5a311c1
fail on no manifest
ctb Mar 5, 2022
9bbc3f6
check manifest req't
ctb Mar 5, 2022
591c352
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 5, 2022
5f6ad7f
test various combinations of zip, -v, -i
ctb Mar 5, 2022
5a2cce5
Merge branch 'latest' of https://github.com/sourmash-bio/sourmash int…
ctb Mar 5, 2022
c19f31e
update with CSV output/manifest
ctb Mar 5, 2022
4ff79cf
added -c/--count
ctb Mar 6, 2022
3baa0e2
adjust output
ctb Mar 6, 2022
d2d600e
test fail extract
ctb Mar 6, 2022
8b0a815
comment tests better
ctb Mar 7, 2022
66232fc
add test for count
ctb Mar 7, 2022
0f248e5
update docs
ctb Mar 7, 2022
9a00d53
remove warnings
ctb Mar 7, 2022
00c3afb
cleanup; create CollectionManifest.filter_rows
ctb Mar 7, 2022
1072608
create CollectionManifest.filter_on_columns
ctb Mar 7, 2022
56a8992
minor cleanup
ctb Mar 7, 2022
4d460c1
Merge branch 'latest' into add/sig_grep
ctb Mar 7, 2022
ef4f33f
Merge branch 'latest' into add/sig_grep
ctb Mar 7, 2022
351c65e
Merge branch 'latest' into add/sig_grep
ctb Mar 7, 2022
13fcfbd
Update src/sourmash/cli/sig/grep.py
ctb Mar 7, 2022
76b2b02
Add a straight up picklist test
ctb Mar 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 45 additions & 3 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -962,16 +962,16 @@ Most commands will load signatures automatically from indexed databases
(SBT and LCA formats) as well as from signature files, and you can load
signatures from stdin using `-` on the command line.

### `sourmash signature cat` - concatenate multiple signatures together
### `sourmash signature cat` - combine signatures into one file

Concatenate signature files.

For example,
```
sourmash signature cat file1.sig file2.sig -o all.sig
sourmash signature cat file1.sig file2.sig -o all.zip
```
will combine all signatures in `file1.sig` and `file2.sig` and put them
in the file `all.sig`.
in the file `all.zip`.

### `sourmash signature describe` - display detailed information about signatures

Expand Down Expand Up @@ -1029,6 +1029,48 @@ those formats are under semantic versioning.
Note: `sourmash signature summarize` is an alias for `fileinfo`; they are
the same command.

### `sourmash signature grep` - extract matching signatures using pattern matching

Extract matching signatures with substring and regular expression matching
on the name, filename, and md5 fields.

For example,
```
sourmash signature grep -i shewanella tests/test-data/prot/all.zip -o shew.zip
```
will extract the two signatures in `all.zip` with 'Shewanella baltica'
in their name and save them to `shew.zip`.

`grep` will search for substring matches or regular expressions;
e.g. `sourmash sig grep 'os185|os223' ...` will find matches to either
of those expressions.

Command line options include `-i` for case-insensitive matching, and `-v`
for exclusion rather than inclusion.

A CSV file of the matching sketch information can be saved using
`--csv <outfile>`; this file is in the sourmash manifest format and can be used as a picklist with `--pickfile <outfile>::manifest`.

If `--silent` is specified, `sourmash sig grep` will not output matching
signatures.

`sourmash sig grep` also supports a counting mode, `-c/--count`, in which
only the number of matching sketches in files will be displayed; for example,

```
% sourmash signature grep -ci 'os185|os223' tests/test-data/prot/*.zip
```
will produce the following output:
```
2 matches: tests/test-data/prot/all.zip
0 matches: tests/test-data/prot/dayhoff.sbt.zip
0 matches: tests/test-data/prot/dayhoff.zip
0 matches: tests/test-data/prot/hp.sbt.zip
0 matches: tests/test-data/prot/hp.zip
0 matches: tests/test-data/prot/protein.sbt.zip
0 matches: tests/test-data/prot/protein.zip
```

### `sourmash signature split` - split signatures into individual files

Split each signature in the input file(s) into individual files, with
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/cli/sig/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from . import flatten
from . import fileinfo
from . import fileinfo as summarize
from . import grep
from . import kmers
from . import intersect
from . import manifest
Expand Down
94 changes: 94 additions & 0 deletions src/sourmash/cli/sig/grep.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
"""extract one or more signatures by substr/regex match"""

usage="""
sourmash sig grep <pattern> <filename> [... <filenames>]

This will search for the provided pattern in the files or databases,
using the signature metadata, and output matching signatures.
Currently 'grep' searches the 'name', 'filename', and 'md5' fields as
displayed by `sig describe`.

'pattern' can be a string or a regular expression.

'sig grep' uses the built-in Python regexp module, 're', to implement
regexp searching. See https://docs.python.org/3/howto/regex.html and
https://docs.python.org/3/library/re.html for details.

The '-v' (exclude) and '-i' (case-insensitive) options of 'grep' are
supported.
ctb marked this conversation as resolved.
Show resolved Hide resolved

'-o/--output' can be used to output matching signatures to a specific
location.

By default, 'sig grep' requires a pre-existing manifest for collections;
this prevents potentially slow manifest rebuilding. You
can turn this check off with '--no-require-manifest'.

"""

from sourmash.cli.utils import (add_moltype_args, add_ksize_arg,
add_picklist_args)


def subparser(subparsers):
subparser = subparsers.add_parser('grep', usage=usage)
subparser.add_argument('pattern', help='search pattern (string/regex)')
subparser.add_argument('signatures', nargs='*')
subparser.add_argument(
'-q', '--quiet', action='store_true',
help='suppress non-error output'
)
subparser.add_argument(
'-d', '--debug', action='store_true',
help='output debug information'
)
subparser.add_argument(
'-o', '--output', metavar='FILE',
help='output matching signatures to this file (default stdout)',
default='-',
)
subparser.add_argument(
'-f', '--force', action='store_true',
help='try to load all files as signatures, independent of filename'
)
subparser.add_argument(
'--from-file',
help='a text file containing a list of files to load signatures from'
)
subparser.add_argument(
'-v', '--invert-match',
help="select non-matching signatures",
action="store_true"
)
subparser.add_argument(
'-i', '--ignore-case',
help="ignore case distinctions (search lower and upper case both)",
action="store_true"
)
subparser.add_argument(
'--no-require-manifest',
help='do not require a manifest; generate dynamically if needed',
action='store_true'
)
subparser.add_argument(
'--csv',
help='save CSV file containing signature data in manifest format'
)
subparser.add_argument(
'--silent', '--no-signatures-output',
help="do not output signatures",
action='store_true',
)
subparser.add_argument(
'-c', '--count',
help="only output a count of discovered signatures; implies --silent",
action='store_true'
)
add_ksize_arg(subparser, 31)
add_moltype_args(subparser)
add_picklist_args(subparser)


def main(args):
import sourmash.sig.grep
return sourmash.sig.grep.main(args)
21 changes: 21 additions & 0 deletions src/sourmash/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,27 @@ def select_to_manifest(self, **kwargs):
new_rows = self._select(**kwargs)
return CollectionManifest(new_rows)

def filter_rows(self, row_filter_fn, *, invert=False):
"Create a new manifest filtered through row_filter_fn."
if invert:
orig_row_filter_fn = row_filter_fn
row_filter_fn = lambda x: not orig_row_filter_fn(x)

new_rows = [ row for row in self.rows if row_filter_fn(row) ]

return CollectionManifest(new_rows)

def filter_on_columns(self, col_filter_fn, col_names, *, invert=False):
"Create a new manifest based on column matches."
def row_filter_fn(row):
for col in col_names:
val = row[col]
if val is not None:
if col_filter_fn(val):
return True
return False
return self.filter_rows(row_filter_fn, invert=invert)

def locations(self):
"Return all distinct locations."
seen = set()
Expand Down
1 change: 1 addition & 0 deletions src/sourmash/sig/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
from .__main__ import main
from . import grep
129 changes: 129 additions & 0 deletions src/sourmash/sig/grep.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
"""
Command-line entry point for 'python -m sourmash.sig grep'
"""
import sys
import re

from sourmash import logging, sourmash_args
from sourmash.logging import notify, error, debug, print_results
from sourmash.manifest import CollectionManifest
from .__main__ import _extend_signatures_with_from_file


def main(args):
"""
extract signatures by pattern match.
"""
# basic argument parsing
logging.set_quiet(args.quiet, args.debug)
moltype = sourmash_args.calculate_moltype(args)
picklist = sourmash_args.load_picklist(args)
_extend_signatures_with_from_file(args)

# build the search pattern
pattern = args.pattern
if args.ignore_case:
pattern = re.compile(pattern, re.IGNORECASE)
else:
pattern = re.compile(pattern)

# require manifests?
require_manifest = True
if args.no_require_manifest:
require_manifest = False
debug("sig grep: manifest will not be required")
else:
debug("sig grep: manifest required")

# are we doing --count? if so, enforce --silent so no sigs are printed.
if args.count:
args.silent = True

# define output type: signatures, or no?
if args.silent:
notify("(no signatures will be saved because of --silent/--count).")
save_sigs = sourmash_args.SaveSignaturesToLocation(None)
else:
notify(f"saving matching signatures to '{args.output}'")
save_sigs = sourmash_args.SaveSignaturesToLocation(args.output)
save_sigs.open()

# are we outputting a CSV? if so, initialize that, too.
csv_obj = None
if args.csv:
csv_obj = sourmash_args.FileOutputCSV(args.csv)
csv_fp = csv_obj.open()
CollectionManifest.write_csv_header(csv_fp)

# start loading!
total_rows_examined = 0
for filename in args.signatures:
idx = sourmash_args.load_file_as_index(filename,
yield_all_files=args.force)

idx = idx.select(ksize=args.ksize,
moltype=moltype,
picklist=picklist)

# get (and maybe generate) the manifest.
manifest = idx.manifest
if manifest is None:
if require_manifest:
error(f"ERROR on filename '{filename}'.")
error("sig grep requires a manifest by default, but no manifest present.")
error("specify --no-require-manifest to dynamically generate one.")
sys.exit(-1)
else:
manifest = sourmash_args.get_manifest(idx,
require=False)

# find all matching rows.
sub_manifest = manifest.filter_on_columns(pattern.search,
["name", "filename", "md5"],
invert=args.invert_match)
total_rows_examined += len(manifest)

# write out to CSV, if desired.
if args.csv:
sub_manifest.write_to_csv(csv_fp)

# just print out number of matches?
if args.count:
print_results(f"{len(sub_manifest)} matches: {filename}")
elif not args.silent:
# nope - do output signatures. convert manifest to picklist, apply.
sub_picklist = sub_manifest.to_picklist()

try:
idx = idx.select(picklist=sub_picklist)
except ValueError:
error("** This input collection doesn't support 'grep' with picklists.")
error("** EXITING.")
error("**")
error("** You can use 'sourmash sig cat' with a picklist,")
error("** and then pipe the output to 'sourmash sig grep -")
sys.exit(-1)

# save!
for ss in idx.signatures():
save_sigs.add(ss)
# done with the big loop over all indexes!

if args.silent:
pass
else:
notify(f"loaded {total_rows_examined} total that matched ksize & molecule type")

if save_sigs:
notify(f"extracted {len(save_sigs)} signatures from {len(args.signatures)} file(s)")
save_sigs.close()
else:
error("no matching signatures found!")
sys.exit(-1)

if args.csv:
notify(f"wrote manifest containing all matches to CSV file '{args.csv}'")
csv_obj.close()

if picklist:
sourmash_args.report_picklist(args, picklist)
Comment on lines +128 to +129
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need a test for grep + picklist? I see one for when this will fail (LCA)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ooh, nice catch! test_sig_grep_7_picklist_md5_lca_fail tests the failure, but there's no test for the successful use of --picklist. Will add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added in 76b2b02

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

10 changes: 5 additions & 5 deletions tests/test_cmd_signature.py
Original file line number Diff line number Diff line change
Expand Up @@ -1448,11 +1448,10 @@ def test_sig_extract_8_picklist_md5_zipfile(runtmp):
assert "for given picklist, found 1 matches to 1 distinct values" in err


def test_sig_extract_8_picklist_md5_lca(runtmp):
# extract 47 from an LCA database, using a picklist w/full md5
def test_sig_extract_8_picklist_md5_lca_fail(runtmp):
# try to extract 47 from an LCA database, using a picklist w/full md5; will
# fail.
allzip = utils.get_test_data('lca/47+63.lca.json')
sig47 = utils.get_test_data('47.fa.sig')
sig63 = utils.get_test_data('63.fa.sig')

# select on any of these attributes
row = dict(exactName='NC_009665.1 Shewanella baltica OS185, complete genome',
Expand All @@ -1472,7 +1471,8 @@ def test_sig_extract_8_picklist_md5_lca(runtmp):
with pytest.raises(SourmashCommandFailed) as exc:
runtmp.sourmash('sig', 'extract', allzip, '--picklist', picklist_arg)

# this happens b/c the implementation of 'extract' uses picklists.
# this happens b/c the implementation of 'extract' uses picklists, and
# LCA databases don't support multiple picklists.
print(runtmp.last_result.err)
assert "This input collection doesn't support 'extract' with picklists." in runtmp.last_result.err

Expand Down
Loading