Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MRG: update JOSS paper per pyopensci review #2964

Merged
merged 3 commits into from
Feb 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 33 additions & 6 deletions paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,10 @@ @article{Pierce:2019
title = {Large-scale sequence comparisons with sourmash},
journal = {F1000Research}
}

@article{gather,
doi = {10.1101/2022.01.11.475838},
url = {https://doi.org/10.1101/2022.01.11.475838},
title={Lightweight compositional analysis of metagenomes with FracMinHash and minimum metagenome covers},
author={Irber, Luiz Carlos and Brooks, Phillip T and Reiter, Taylor E and Pierce-Ward, N Tessa and Hera, Mahmudur Rahman and Koslicki, David and Brown, C Titus},
journal={bioRxiv},
Expand All @@ -43,6 +46,8 @@ @article{gather
}

@article{branchwater,
doi = {10.1101/2022.11.02.514947},
url={https://doi.org/10.1101/2022.11.02.514947},
title={Sourmash Branchwater Enables Lightweight Petabyte-Scale Sequence Search},
author={Irber, Luiz Carlos and Pierce-Ward, N Tessa and Brown, C Titus},
journal={bioRxiv},
Expand All @@ -51,6 +56,8 @@ @article{branchwater
}

@article{koslicki2019improving,
doi={10.1016/j.amc.2019.02.018},
url={https://doi.org/10.1016/j.amc.2019.02.018},
title={Improving minhash via the containment index with applications to metagenomic analysis},
author={Koslicki, David and Zabeti, Hooman},
journal={Applied Mathematics and Computation},
Expand All @@ -60,10 +67,30 @@ @article{koslicki2019improving
publisher={Elsevier}
}

@article{hera2022debiasing,
title={Debiasing FracMinHash and deriving confidence intervals for mutation rates across a wide range of evolutionary distances},
author={Hera, Mahmudur Rahman and Pierce-Ward, N Tessa and Koslicki, David},
journal={bioRxiv},
year={2022},
publisher={Cold Spring Harbor Laboratory}
@article{hera2023deriving,
doi={10.1101/gr.277651.123},
url={https://doi.org/10.1101/gr.277651.123},
title={Deriving confidence intervals for mutation rates across a wide range of evolutionary distances using FracMinHash},
author={Rahman Hera, Mahmudur and Pierce-Ward, N Tessa and Koslicki, David},
journal={Genome Research},
pages={gr--277651},
year={2023},
publisher={Cold Spring Harbor Lab}
}

@article{liu2023fast,
doi={10.1101/2023.11.06.565843},
url={https://doi.org/10.1101/2023.11.06.565843},
title={Fast, lightweight, and accurate metagenomic functional profiling using FracMinHash sketches},
author={Liu, S and Wei, W and Ma, C and Koslicki, D and others},
year={2023}
}

@article{portik2022evaluation,
doi={10.1186/s12859-022-05103-0},
url={https://doi.org/10.1186/s12859-022-05103-0},
title={Evaluation of taxonomic profiling methods for long-read shotgun metagenomic sequencing datasets},
author={Portik, Daniel M and Brown, C Titus and Pierce-Ward, N Tessa},
journal={Bioinformatics},
year={2022}
}
64 changes: 38 additions & 26 deletions paper.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
---
title: 'sourmash: a tool to quickly search, compare, and analyze genomic and metagenomic data sets'
title: 'sourmash: a tool to quickly search, compare, and analyze genomic
and metagenomic data sets'
tags:
- FracMinHash
- MinHash
Expand Down Expand Up @@ -114,51 +115,62 @@ affiliations:
- name: No affiliation
index: 9

date: 27 Mar 2023
date: 31 Jan 2024
bibliography: paper.bib
---

# Summary

sourmash is a command line tool and Python library for sketching
collections of DNA, RNA, and amino acid k-mers for biological sequence
search, comparison, and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and accurate sequence comparisons between datasets of different sizes [@gather], including petabase-scale database search [@branchwater]. From release 4.x, sourmash is built on top of Rust and provides an experimental Rust interface.
sourmash is a command line tool and Python library for sketching collections
of DNA, RNA, and amino acid k-mers for biological sequence search, comparison,
and analysis [@Pierce:2019]. sourmash's FracMinHash sketching supports fast and
accurate sequence comparisons between datasets of different sizes [@gather],
including taxonomic profiling [@portik2022evaluation], functional profiling
[@liu2023fast], and petabase-scale sequence search [@branchwater]. From
release 4.x, sourmash is built on top of Rust and provides an experimental
Rust interface.

FracMinHash sketching is a lossy compression approach that represents
data sets using a "fractional" sketch containing $1/S$ of the original
k-mers. Like other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash provides a lightweight way to store representations of large DNA or RNA sequence collections for comparison and search. Sketches can be used to identify samples, find similar samples, identify data sets with shared sequences, and build phylogenetic trees. FracMinHash sketching supports estimation of overlap, bidirectional containment, and Jaccard similarity between data sets and is accurate even for data sets of very different sizes.
FracMinHash sketching is a lossy compression approach that represents data
sets using a "fractional" sketch containing $1/S$ of the original k-mers. Like
other sequence sketching techniques (e.g. MinHash, [@Ondov:2015]), FracMinHash
provides a lightweight way to store representations of large DNA or RNA
sequence collections for comparison and search. Sketches can be used to
identify samples, find similar samples, identify data sets with shared
sequences, and build phylogenetic trees. FracMinHash sketching supports
estimation of overlap, bidirectional containment, and Jaccard similarity
between data sets and is accurate even for data sets of very different sizes.

Since sourmash v1 was released in 2016 [@Brown:2016], sourmash has expanded
to support new database types and many more command line functions.
In particular, sourmash now has robust support for both Jaccard similarity
and containment calculations, which enables analysis and comparison of data sets
of different sizes, including large metagenomic samples. As of v4.4,
and Containment calculations, which enables analysis and comparison of data
sets of different sizes, including large metagenomic samples. As of v4.4,
sourmash can convert these to estimated Average Nucleotide Identity (ANI)
values, which can provide improved biological context to sketch comparisons [@hera2022debiasing].
values, which can provide improved biological context to sketch comparisons
[@hera2022debiasing].

# Statement of Need

Large collections of genomes, transcriptomes, and raw sequencing data
sets are readily available in biology, and the field needs lightweight
computational methods for searching and summarizing the content of
both public and private collections. sourmash provides a flexible set
of programmatic functionality for this purpose, together with a robust
and well-tested command-line interface. It has been used in well over 200
publications (based on citations of @Brown:2016 and @Pierce:2019) and it continues
to expand in functionality.
Large collections of genomes, transcriptomes, and raw sequencing data sets are
readily available in biology, and the field needs lightweight computational
methods for searching and summarizing the content of both public and private
collections. sourmash provides a flexible set of programmatic functionality
for this purpose, together with a robust and well-tested command-line
interface. It has been used in over 350 publications (based on citations of
@Brown:2016 and @Pierce:2019) and it continues to expand in functionality.

# Acknowledgements

This work is funded in part by the Gordon and Betty Moore Foundation’s
Data-Driven Discovery Initiative [GBMF4551 to CTB].

Notice: This manuscript has been authored by BNBI under Contract
No. HSHQDC-15-C-00064 with the DHS. The US Government retains
and the publisher, by accepting the article for publication, acknowledges
that the USG retains a non-exclusive, paid-up, irrevocable, world-wide
license to publish or reproduce the published form of this manuscript,
or allow others to do so, for USG purposes. Views and conclusions
contained herein are those of the authors and should not be interpreted
to represent policies, expressed or implied, of the DHS.
No. HSHQDC-15-C-00064 with the DHS. The US Government retains and the
publisher, by accepting the article for publication, acknowledges that the USG
retains a non-exclusive, paid-up, irrevocable, world-wide license to publish
or reproduce the published form of this manuscript, or allow others to do
so, for USG purposes. Views and conclusions contained herein are those of
the authors and should not be interpreted to represent policies, expressed
or implied, of the DHS.

# References
Loading