Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Documentation updates for 4.0 release #1283

Merged
merged 28 commits into from
Feb 17, 2021
Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
36ea292
start adjusting docs
ctb Jan 18, 2021
b9c8bba
add migration links
ctb Jan 18, 2021
d33c66a
switch compute over to sketch in most of the markdown docs
ctb Jan 30, 2021
312a8e2
fix --scaled and --track-abundance thruought
ctb Jan 30, 2021
45d8055
formatting and wording fixes
ctb Jan 30, 2021
58c57af
add sourmash sketch docs
ctb Jan 30, 2021
31c5eda
Merge branch 'latest' of github.com:dib-lab/sourmash into docs_4.0
ctb Feb 8, 2021
32ebec7
substantial update for API examples
ctb Feb 8, 2021
9d15cdd
add ToC to api-example
ctb Feb 8, 2021
97f2cd3
fix heading for API section
ctb Feb 8, 2021
92412c2
bold API examples link
ctb Feb 8, 2021
3916741
(untested) update of tutorials to use sourmash sketch
ctb Feb 8, 2021
5e1f92b
update link targets
ctb Feb 8, 2021
40762f3
updates of indexed databases
ctb Feb 8, 2021
b1d988a
typos in versioning (#1314)
taylorreiter Feb 9, 2021
9da9c18
Apply suggestions from code review
ctb Feb 9, 2021
bf9e950
Update doc/api-example.md
ctb Feb 9, 2021
ab8656d
updated with suggestions from @taylorreiter doc review
ctb Feb 9, 2021
1d6bef1
Merge branch 'docs_4.0' of github.com:dib-lab/sourmash into docs_4.0
ctb Feb 9, 2021
dda99fe
added section on sketch naming
ctb Feb 9, 2021
219e606
[WIP] add migration docs and release notes (#1316)
ctb Feb 9, 2021
b690146
resolve missing link
ctb Feb 9, 2021
cf894c4
update tutorials and notebooks for 4.0
ctb Feb 9, 2021
c802035
[MRG] update the migration guide for 4.0 with version pinning instruc…
ctb Feb 15, 2021
a63d8e9
Merge branch 'latest' into docs_4.0
ctb Feb 16, 2021
a44acf4
Merge branch 'latest' into docs_4.0
ctb Feb 17, 2021
01019dd
Apply suggestions from code review
ctb Feb 17, 2021
71ef18d
Apply suggestions from code review
ctb Feb 17, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 11 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,12 +14,18 @@ Quickly search, compare, and analyze genomic and metagenomic data sets.

Usage:

sourmash compute *.fq.gz
sourmash compare *.sig -o distances
sourmash sketch dna *.fq.gz
sourmash compare *.sig -o distances -k 31
ctb marked this conversation as resolved.
Show resolved Hide resolved
sourmash plot distances

sourmash 1.0 is [published on JOSS](https://doi.org/10.21105/joss.00027); please cite that paper if you use sourmash (`doi: 10.21105/joss.00027`):.

The latest major release is sourmash v4, which has several
command-line and Python incompatibilities with previous
versions. Please
[visit our migration guide](https://sourmash.readthedocs.io/en/latest/support.html#migrating-from-sourmash-v3-x-to-sourmash-4-x)
to ugprade!
ctb marked this conversation as resolved.
Show resolved Hide resolved

----

The name is a riff off of [Mash](https://github.com/marbl/Mash),
Expand All @@ -42,7 +48,7 @@ We recommend using bioconda to install sourmash:
```
conda install -c conda-forge -c bioconda sourmash
```
This will install the latest stable version of sourmash 3.
This will install the latest stable version of sourmash 4.

You can also use pip to install sourmash:

Expand Down Expand Up @@ -70,7 +76,7 @@ you can install sourmash by running:
```bash
$ conda create -n sourmash_env -c conda-forge -c bioconda sourmash python=3.7
$ source activate sourmash_env
$ sourmash compute -h
$ sourmash --help
ctb marked this conversation as resolved.
Show resolved Hide resolved
```

which will install
Expand Down Expand Up @@ -107,4 +113,4 @@ on getting set up with a development environment.
----

CTB
July 2020
Jan 2021
ctb marked this conversation as resolved.
Show resolved Hide resolved
195 changes: 164 additions & 31 deletions doc/api-example.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# `sourmash` API examples
# `sourmash` Python API examples

## A first example: two k-mers
All of sourmash's functionality is available via its [Python API](api.md). Below are both basic and advanced examples that use the API to accomplish common tasks.

```{contents}
:depth: 2
```

## A first example: two k-mers

Define two sequences:

Expand Down Expand Up @@ -42,7 +47,7 @@ and of course the MinHashes match themselves:

```

We can add sequences and query at any time --
We can add sequences to the MinHash objects and query at any time --

```
>>> mh1.add_sequence(seq2)
Expand All @@ -52,8 +57,63 @@ We can add sequences and query at any time --

```

## Consuming files
## Set operations on hashes

All of the hashes are available via the `hashes` property:

```
>>> list(mh1.hashes)
[1274996984489324440, 2529443451610975987, 3115010115530738562, 5059920851104263793, 5740495330885152257, 8652222673649005300, 18398176440806921933]

```

and you can easily do your own set operations with `.hashes` - e.g.
the following calculates the Jaccard similarity (intersection over union) of two
```
>>> s1 = set(mh1.hashes)
>>> s2 = set(mh2.hashes)
>>> round(len(s1 & s2) / len(s1 | s2), 3)
0.571

```
However, the MinHash class also supports a number of basic operations - the following operations work directly on the hashes:
```
>>> combined = mh1 + mh2
>>> combined += mh1
>>> combined.remove_many(mh1.hashes)
>>> combined.add_many(mh2.hashes)

```

You can create an empty copy of a MinHash object with `copy_and_clear`:
```
>>> new_mh = mh1.copy_and_clear()

```

and you can also access the various parameters of a MinHash object directly as properties --
```
>>> mh1.ksize
3
>>> mh1.scaled
0
>>> mh1.num
20
>>> mh1.is_dna
True
>>> mh1.is_protein
False
>>> mh1.dayhoff
False
>>> mh1.hp
False
>>> mh1.moltype
'DNA'

```
see the "Advanced" section, below, for a more complete discussion of MinHash objects.

## Creating MinHash sketches programmatically, from genome files

Suppose we want to create MinHash sketches from genomes --

Expand All @@ -73,7 +133,7 @@ into `add_sequence` directly; here we set `force=True` in `add_sequence`
to skip over k-mers containing characters other than ACTG, rather than
raising an exception.

(Note, just for speed reasons, we'll truncate the sequences to 50kb in length.)
(Note, just for speed reasons, we're truncating the sequences to 50kb in length.)

```
>>> import screed
Expand All @@ -86,7 +146,7 @@ raising an exception.

```

And now the minhashes can be compared against each other:
And now the result MinHash objects can be compared against each other:
ctb marked this conversation as resolved.
Show resolved Hide resolved

```
>>> import sys
Expand All @@ -103,7 +163,7 @@ data/GCF_000783305.1 0.0 0.0 1.0
```

Note that the comparisons are quite quick; most of the time is spent in
making the minhashes, which can be saved and loaded easily.
building the minhashes.

## Plotting dendrograms and matrices

Expand All @@ -114,7 +174,7 @@ please see the notebook
## Saving and loading signature files

Signature files encapsulate MinHashes in JSON, and provide a way to
add some metadata to MinHashes.
wrap MinHash objects with some metadata (the name and filename). To save signatures, use `save_signatures` with a list of signatures and a Python file pointer:

```
>>> from sourmash import SourmashSignature, save_signatures
Expand All @@ -127,7 +187,7 @@ add some metadata to MinHashes.
```

Here, `genome1.sig` is a JSON file that can now be loaded and
compared -- first, load:
compared -- first, load it using `load_one_signature`:

```
>>> from sourmash import load_one_signature
Expand All @@ -145,9 +205,24 @@ then compare:

```

## Manipulating signatures and their hashes.
There are two primary signature loading functions - `load_one_signature`, used above, which loads exactly one signature or else raises an exception; and the powerful and more generic `load_file_as_signatures`, which takes in a filename or directory containing a collection of signatures and returns the individual signatures -- for example, you can load all of the signatures under the `tempdir` created above like so,

```
>>> loaded_sigs = list(sourmash.load_file_as_signatures(tempdir))

```

Both `load_file_as_signatures` and `load_one_signature` take molecule type and k-mer size selectors, e.g.
```
>>> loaded_sigs = load_one_signature(tempdir + '/genome1.sig', select_moltype='DNA', ksize=31)

```
will load precisely one signature containing a DNA MinHash created at k-mer size of 31.

## Going from signatures back to MinHash objects and their hashes -

It is relatively straightforward to work directly with hashes.
Once you load a signature, you can go back to its MinHash object with
`.minhash`; e.g.

First, load two signatures:

Expand All @@ -159,18 +234,18 @@ First, load two signatures:

```

Then, get the hashes, and (e.g.) compute the union:
Then, get the hashes, and (e.g.) calculate the union:

```
>>> hashes1 = set(sig1.minhash.hashes.keys())
>>> hashes2 = set(sig2.minhash.hashes.keys())
>>> hash_union = hashes1.union(hashes2)
>>> print('{} hashes in union of {} and {}'.format(len(hash_union), len(hashes1), len(hashes2)))
>>> print(f'{len(hash_union)} hashes in union of {len(hashes1)} and {len(hashes2)}')
1000 hashes in union of 500 and 500

```

## sourmash MinHash objects and manipulations
## Advanced features of sourmash MinHash objects - `scaled` and `num`

sourmash supports two basic kinds of signatures, MinHash and modulo hash
signatures. MinHash signatures are equivalent to mash signatures;
Expand All @@ -186,9 +261,7 @@ be collected for a given input data set.
Because of this parameter, below we'll call them 'num' signatures.

Modulo hash (or 'scaled') signatures are specific to sourmash and they
enable an expanded range of metagenome analyses, with the downside
that they can become arbitrarily large. The key parameter for modulo
hash signatures is `scaled`, which specifies the average sampling rate
enable containment operations that are useful for metagenome analyses. The tradeoff is that unlike num MinHashes, they can become arbitrarily large. The key parameter for modulo hash signatures is `scaled`, which specifies the average sampling rate
for hashes for a given input data set. A scaled factor of 1000 means that,
on average, 1 in 1000 k-mers will be turned into a hash for later
comparisons; this is a sort of compression factor, in that a 5 Mbp
Expand Down Expand Up @@ -222,7 +295,7 @@ looking at the `num` and `scaled` attributes on a MinHash object:

The MinHash class is otherwise identical between the two types of signatures.

Note that you cannot compute Jaccard similarity or containment for
You cannot calculate Jaccard similarity or containment for
MinHash objects with different num or scaled values (or different ksizes):

```
Expand All @@ -234,7 +307,7 @@ TypeError: must have same num: 500 != 1000

```

You can make signatures compatible by downsampling; see the next
However, you can make signatures compatible by downsampling; see the next
sections.

### A brief introduction to MinHash object methods and attributes
Expand Down Expand Up @@ -380,19 +453,34 @@ you.*
(You can also take a look at the logic in `sourmash signature
downsample` if you are interested.)

## Working with fast search trees (Sequence Bloom Trees, or SBTs)
## Working with indexed collections of signatures

If you want to search large collections of signatures, sourmash provides
two different indexing strategies, together with a generic `Index` class
that supports a common API for searching the collections.

The first indexing strategy is a Sequence Bloom Tree, which is
designed to support fast and efficient containment operations on large
collections of signatures. SBTs are an _on disk_ search structure, so
they are a low-memory way to search collections.

To use SBTs from the command line, we first
need to create some `scaled` signatures:

```
sourmash sketch dna -p scaled=10000 data/GCF*.fna.gz --outdir data/
```

Suppose we have a number of signatures calculated with `--scaled`, like so:
and then build a Sequence Bloom Tree (SBT) index with `sourmash
index`, like so:

```
sourmash compute --scaled 10000 data/GCF*.fna.gz
sourmash index foo.sbt.zip data/GCF*.sig -k 31
```

and now we want to create a Sequence Bloom Tree (SBT) so that we can
search them efficiently. You can do this with `sourmash index`, but
you can also access the Python API directly.
Here, sourmash is storing the entire SBT in a single portable Zip file.

### Creating a search tree
### Creating an on-disk SBT in Python

Let's start by using 'glob' to grab some example signatures from the
test data in the sourmash repository:
Expand All @@ -403,11 +491,11 @@ test data in the sourmash repository:

```

Now, create a tree:
Now, create an SBT:

```
>>> import sourmash
>>> tree = sourmash.create_sbt_index()
>>> import sourmash.sbtmh
>>> tree = sourmash.sbtmh.create_sbt_index()

```

Expand All @@ -423,7 +511,7 @@ Load each signature, and add it to the tree:
```
(note, you'll need to make sure that all of the signatures are compatible
with each other! The `sourmash index` command does all of the necessary
checks.)
checks, but the Python API doesn't.)

Now, save the tree:

Expand Down Expand Up @@ -454,7 +542,7 @@ Now, load a DNA sequence:
```
>>> filename = 'data/GCF_000005845.2_ASM584v2_genomic.fna.gz'
>>> query_seq = next(iter(screed.open(filename))).sequence
>>> print('got {} DNA characters to query'.format(len(query_seq)))
>>> print(f'got {len(query_seq)} DNA characters to query')
got 4641652 DNA characters to query

```
Expand Down Expand Up @@ -482,3 +570,48 @@ NC_000913.3 Escherichia coli str. K-12 substr. MG1655, complete genome
```

et voila!

### In-memory databases: the LCA or "reverse index" database.

The LCA database lets you work with large collections of signatures in
memory.

The LCA database was initially designed to support individual hash
queries for taxonomic operations - hence its name, which stands for
"Lowest Common Ancestor." However, it supports all of the standard
`Index` operations, just like the SBT.

First, let's create an LCA database programmatically.

```
>>> from sourmash.lca import LCA_Database
>>> db = LCA_Database(ksize=31, scaled=10000, moltype='DNA')

```

Now, let's load in all of the signatures from the test directory:

```
>>> for sig in sourmash.load_file_as_signatures('tests/test-data/doctest-data', ksize=31):
... hashes_inserted = db.insert(sig)
... print(f"Inserted {hashes_inserted} hashes into db.")
Inserted 493 hashes into db.
Inserted 525 hashes into db.
Inserted 490 hashes into db.

```

and now you have an `Index` class that supports all the generic index operations (below). You can save an LCA Database to disk with `db.save(filename)`, and load it with `sourmash.load_file_as_index`, below.

### The `Index` class API.

The `Index` class supports a generic API for SBTs, LCAs, and other collections of signatures.

To load an SBT or an LCA database from a file, use `sourmash.load_file_as_index`:
```
>>> sbt_db = sourmash.load_file_as_index('tests/test-data/prot/protein.sbt.zip')
>>> lca_db = sourmash.load_file_as_index('tests/test-data/prot/protein.lca.json.gz')

```

`Index` objects provide `search`, `insert`, `load`, `save`, and `__len__`. The signatures can be accessed directly via the `.signatures()` method, which returns an iterable. Last but not least, `Index.select(ksize=..., moltype=...)` will return a view on the Index object that contains only signatures with the desired k-mer size/molecule type.
Loading