Skip to content

Commit

Permalink
[MRG] more docs update for 4.0 - minor cleanup and additions (#1331)
Browse files Browse the repository at this point in the history
* start adjusting docs

* add migration links

* switch compute over to sketch in most of the markdown docs

* fix --scaled and --track-abundance thruought

* formatting and wording fixes

* add sourmash sketch docs

* substantial update for API examples

* add ToC to api-example

* fix heading for API section

* bold API examples link

* (untested) update of tutorials to use sourmash sketch

* update link targets

* updates of indexed databases

* typos in versioning (#1314)

* Apply suggestions from code review

Co-authored-by: Taylor Reiter <[email protected]>

* Update doc/api-example.md

Co-authored-by: Taylor Reiter <[email protected]>

* updated with suggestions from @taylorreiter doc review

* added section on sketch naming

* [WIP] add migration docs and release notes (#1316)

* add migration docs and release notes

* Update doc/support.md

Co-authored-by: Taylor Reiter <[email protected]>

* Update doc/support.md

Co-authored-by: Taylor Reiter <[email protected]>

* Update doc/release-notes/sourmash-4.0.md

Co-authored-by: Taylor Reiter <[email protected]>

* Update doc/release-notes/sourmash-4.0.md

Co-authored-by: Taylor Reiter <[email protected]>

* Update doc/release-notes/sourmash-4.0.md

Co-authored-by: Taylor Reiter <[email protected]>

* update with last set of changes

* add missing line break

Co-authored-by: Taylor Reiter <[email protected]>

* resolve missing link

* update tutorials and notebooks for 4.0

* fix link; add GTDB mention; logo positioning

* minor fixes and updates

* no need for pysam and htslib any more

* fix words

* Update README.md

Co-authored-by: Luiz Irber <[email protected]>

Co-authored-by: Taylor Reiter <[email protected]>
Co-authored-by: Luiz Irber <[email protected]>
  • Loading branch information
3 people authored Feb 18, 2021
1 parent 74a7f25 commit ce950ca
Show file tree
Hide file tree
Showing 4 changed files with 31 additions and 22 deletions.
31 changes: 16 additions & 15 deletions doc/command-line.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ taken.

Grab three bacterial genomes from NCBI:
```
curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
curl -L -O ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Salmonella_enterica/reference/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz
curl -L -O https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Escherichia_coli/reference/GCF_000005845.2_ASM584v2/GCF_000005845.2_ASM584v2_genomic.fna.gz
curl -L -O https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/Salmonella_enterica/reference/GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz
curl -L -O https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/783/305/GCA_000783305.1_ASM78330v1/GCA_000783305.1_ASM78330v1_genomic.fna.gz
```
Compute signatures for each:
Expand All @@ -36,20 +36,20 @@ This will produce three `.sig` files containing MinHash signatures at k=31.

Next, compare all the signatures to each other:
```
sourmash compare *.sig -o cmp
sourmash compare *.sig -o cmp.dist
```

Optionally, parallelize compare to 8 threads with `-p 8`:

```
sourmash compare -p 8 *.sig -o cmp
sourmash compare -p 8 *.sig -o cmp.dist
```

Finally, plot a dendrogram:
```
sourmash plot cmp --labels
sourmash plot cmp.dist --labels
```
This will output two files, `cmp.dendro.png` and `cmp.matrix.png`,
This will output two files, `cmp.dist.dendro.png` and `cmp.dist.matrix.png`,
containing a clustering & dendrogram of the sequences, as well as a
similarity matrix and heatmap.

Expand All @@ -69,8 +69,8 @@ walkthrough of these commands.
* `compare` compares signatures and builds a distance matrix.
* `plot` plots distance matrices created by `compare`.
* `search` finds matches to a query signature in a collection of signatures.
* `gather` finds the best reference genomes for a metagenome, using the provided collection of signatures
* `index` build a fast index for many (thousands) of signatures
* `gather` finds the best reference genomes for a metagenome, using the provided collection of signatures.
* `index` builds a fast index for many (thousands) of signatures.

There are also a number of commands that work with taxonomic
information; these are grouped under the `sourmash lca`
Expand Down Expand Up @@ -115,7 +115,7 @@ The `sketch protein` command reads in **protein sequences** and outputs **protei

The `sketch translate` command reads in **DNA sequences**, translates them in all six frames, and outputs **protein sketches**.

`sourmash sketch` takes FASTA or FASTQ sequences as input, and they can be
`sourmash sketch` takes FASTA or FASTQ sequences as input; input data can be
uncompressed, compressed with gzip, or compressed with bzip2. The output
will be one or more JSON signature files that can be used with the other
sourmash commands.
Expand Down Expand Up @@ -186,7 +186,8 @@ Options:
--containment -- calculate containment instead of similarity.
C(i, j) = size(i intersection j) / size(i).
--from-file -- append the list of files in this text file to the input
signatures
signatures.
--ignore-abundance -- ignore abundances in signatures.
```

**Note:** compare by default produces a symmetric similarity matrix that can be used as an input to clustering. With `--containment`, however, this matrix is no longer symmetric and cannot formally be used for clustering.
Expand Down Expand Up @@ -325,7 +326,7 @@ input signatures. You can create an "unpacked" version by specifying
subdirectory of files under `.sbt.database`.

Note that you can use `--from-file` to pass `index` a text file
containing a list of files to index; you can also provide individual
containing a list of file names to index; you can also provide individual
signature files, directories full of signatures, or other sourmash
databases.

Expand Down Expand Up @@ -393,7 +394,7 @@ species level assignments would not be reported.
(This is the approach that Kraken and other lowest common ancestor
implementations use, we believe.)

Note: you can specify a list of files to load signatures from in a
Note: you can specify a list of file names to load signatures from in a
text file passed to `sourmash lca classify` with the
`--query-from-file` flag; these files will be appended to the `--query`
input.
Expand Down Expand Up @@ -491,7 +492,7 @@ genome is present only once; when weighted by abundance, the Bacterial genome
is only 41.8% of the metagenome content, while the Archaeal genome is
58.1% of the metagenome content.

Note: you can specify a list of files to load signatures from in a
Note: you can specify a list of file names to load signatures from in a
text file passed to `sourmash lca summarize` with the
`--query-from-file` flag; these files will be appended to the `--query`
input.
Expand All @@ -514,7 +515,7 @@ see
[the NCBI lineage repository](https://github.com/dib-lab/2018-ncbi-lineages).

You can use `--from-file` to pass `lca index` a text file containing a
list of files to index.
list of file names to index.

### `sourmash lca rankinfo` - examine an LCA database

Expand Down Expand Up @@ -885,7 +886,7 @@ some other command.
### Loading all signatures under a directory

All of the `sourmash` commands support loading signatures from
directories provided on the command line.
beneath directories; provide the paths on the command line.

### Combining search databases on the command line

Expand Down
14 changes: 9 additions & 5 deletions doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@ available databases](databases.md).

sourmash also includes k-mer based taxonomic exploration and
classification routines for genome and metagenome analysis. These
routines can use the NCBI taxonomy but do not depend on it in any way.
routines can use the NCBI and GTDB taxonomies but do not depend on them
specifically.

We have [several tutorials](tutorials.md) available! Start with
[Making signatures, comparing, and searching](tutorial-basic.md).
Expand All @@ -27,7 +28,7 @@ background information on how and why MinHash works.

**Want to migrate to sourmash v4?** sourmash v4 is now available, and
has a number of incompatibilites with v2 and v3. Please see
[our migration guide](support.md#migrating-from-sourmash-v3-x-to-sourmash-4-x)!
[our migration guide](support.md#migrating-from-sourmash-v3-x-to-sourmash-v4-x)!

----

Expand Down Expand Up @@ -102,7 +103,7 @@ sourmash has relatively small disk and memory requirements compared to
many other software programs used for genome search and taxonomic
classification.

`sourmash search` and `sourmash gather` can be used to search all
`sourmash search` and `sourmash gather` can be used to search 100k
genbank microbial genomes ([using our prepared databases](databases.md)
with about 20 GB of disk and in under 1 GB of RAM.
Typically a search for a single genome takes about 30 seconds on a laptop.
Expand Down Expand Up @@ -138,7 +139,7 @@ versions!
**sourmash cannot find matches across large evolutionary distances.**

sourmash seems to work well to search and compare data sets for
matches at the species and genus level, but does not have much
nucleotide matches at the species and genus level, but does not have much
sensitivity beyond that. (It seems to be particularly good at
strain-level analysis.) You should use protein-based analyses
to do searches across larger evolutionary distances.
Expand All @@ -155,9 +156,12 @@ to signature size when using 'scaled' signatures.
The sourmash logo was designed by Stéfanie Fares Sabbag,
with feedback from Clara Barcelos,
Taylor Reiter and Luiz Irber.

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img
alt="Creative Commons License" style="border-width:0"
src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />The logo
src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />

The logo
is licensed under a <a rel="license"
href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons
Attribution-ShareAlike 4.0 International License</a>.
Expand Down
3 changes: 1 addition & 2 deletions doc/release.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ Michael Crusoe.
The basic build environment needed below can be created as follows:

```
conda create -y -n sourmash-rc python=3.7 pip cxx-compiler make \
htslib pysam twine
conda create -y -n sourmash-rc python=3.7 pip cxx-compiler make twine
```

Then activate it with `conda activate sourmash-rc`.
Expand Down
5 changes: 5 additions & 0 deletions doc/sourmash-sketch.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@ The `sketch protein` command reads in **protein sequences** and outputs **protei

The `sketch translate` command reads in **DNA sequences**, translates them in all six frames, and outputs **protein sketches**.

All `sourmash sketch` commands take FASTA or FASTQ sequences as input;
input data can be uncompressed, compressed with gzip, or compressed
with bzip2. The output will be one or more JSON signature files that
can be used with the other sourmash commands.

## Quickstart

### DNA sketches for genomes and reads
Expand Down

0 comments on commit ce950ca

Please sign in to comment.