Allow non-owned Sequence types #19

lynn · 2022-11-09T22:51:28Z

Currently DnaSequence owns a Vec<Nucleotide>. But we'd also like to be able to have a DnaSequence backed by a borrowed &[Nucleotide]. Same for ProteinSequence.

Thought from @vgel: use generics, so DnaSequence<Vec<Nucleotide>> vs DnaSequence<&[Nucleotide]>, and then maybe nice aliases for these. Maybe even make DnaSequence<[Nucleotide; N]> work. This is what SmallVec does.

Alternatively, follow the Rust tradition of two typenames for borrowed vs. owned, like str vs String and Path vs PathBuf.

The text was updated successfully, but these errors were encountered:

swooster · 2022-11-15T19:00:19Z

I can think of 3 main ways to handle this...

Rust tradition of two types
Generic wrappers
Extension traits on Nucleotide slices

While my default temptation was to follow Rust tradition for Chesterton's fence reasons, the more I think about it the more a generic wrapper on anything that derefs to a nucleotide slice sounds appealing for composability reasons; I'd love to be able to use other kinds of backing storage.

If all Nucleotide sequences are valid DNA, then perhaps extension traits could make sense... They'd offer the same backing-storage-flexibility of a generic wrapper, but we wouldn't have to wrap/unwrap things in a Dna type and we wouldn't need to resort to any deref hackery to get at the underlying storage. On the other hand, it means we'd need to import the trait to access the methods (whereas with methods on concrete types, we can call them on returned values without needing to import anything).

Also, wrapper types and extension traits aren't mutually exclusive. We could have extension traits do the heavy lifting, and have methods of the same name on wrapper types that mostly defer to the trait implementation (and possibly rewrap the output if we care about e.g. dna sequence windows being dna sequences).

vgel · 2022-11-15T19:24:08Z

I'm not sure what the two-types solution would look like. Are you suggesting something like an unsized struct dna([Nucleotide])? Is there a library that does this besides std?

I think extension traits could be nice. It would be cool if people could bring their own storage, like SmallVec, since all we really care about is length + index + maybe mutability. I'd lean in that direction (without generic wrappers) to start with.

swooster · 2022-11-18T00:13:43Z

I've pushed a protoype branch and I'd love some feedback about whether this seems like a potentially acceptable direction to head in.

First some quick example code:

use quickdna::Nucleotides; // extension trait

// This approach should be backwards compatible
let dna: DnaSequence<Nucleotide> = DnaSequence::from_str("ATCGAATTCCGG").unwrap();

// IMHO, translation tables really feel like functions mapping codons to amino acids
// I'd be sorely tempted to also make these accessible as ordinary functions
let ncbi1 = TranslationTable::Ncbi1.to_fn();

// Everything is lazy, so (almost) no resources have been used yet. Note that this
// is an iter of Option<AminoAcid>, in case the translation table is missing something
let protein = dna.codons().map(ncbi1);

// Because everything is just an iterator, we can choose whatever storage we like
// and rely on built-in Rust code for hoisting the Option out during collection.
let protein: Option<SmallVec<[AminoAcid; 10]>> = protein.collect();
let protein = protein.expect("Unable to translate to amino acids");

// Again this supports Vecs, SmallVecs, arrays or anything else that coerces to a slice
use Nucleotide::*;
let dna = [A, T, C, G];

// Like before, this is lazy; it has consumed almost no resources.
let frames = dna.reverse_complement().frames();

// Unfortunately, because frames is a SmallVec of iterators, we have to clone() to extract from it.
let v: Vec<_> = frames[1].clone().collect();
assert_eq!(v, [G, A, T]);

The approach I took definitely isn't perfect; sadly I wasn't able to figure out a way to make the extension trait work for all nucleotide iterators while retaining the convenience of only needing to import a single trait. But perhaps something vaguely like this might be something to work towards?

What do other people think? (any ideas for improvements or other approaches?)

swooster · 2022-11-21T18:27:44Z

It hit me that in order to save other people effort, I should probably describe the false start I went on before reaching the above:

I wanted to make an extension trait for anything that implements something like IntoIterator<Item=impl NucleotideLike>. However, there's the issue that something remapping nucleotides (such as reverse_complement) would probably be an iterator yielding items of a NuclueotideLike type by value, whereas a slice (or Vec or array) would have an iterator that yields references to a NuclueotideLike type. So I'd want to make the extension trait work for either iterators of values or iterators of references, which brings Borrow<T> to mind... unfortunately, Borrow<T> is parametrized on the type it's being borrowed as, so there's potentially more than one Borrow impl for any given type, so you can't use Borrow (or ToOwned) to constrain the output nucleotide type based on the input nucleotide type. I ended up giving up and going with the design of the previous comment; it's not general purpose in the way that I was hoping but it seems like it'd work fine in practice.

swooster · 2022-11-21T21:15:19Z

I've also pushed another version of this prototype to swooster/dna-prototype-general. In exchange for not supporting custom types that implement NucleotideLike (which I assume is a small price to pay) it works with anything that can be converted to a nucleotide iterator. The API is the same as above.

swooster · 2022-11-21T22:06:42Z

Pinging @vgel and @lynn (apologies if this is too soon)

lynn · 2022-11-23T18:19:57Z

I like this API!

In #22, @vgel commented that one of my changes towards Rust-y trait-y .collect() usage actually made the generated code a bit slower, and I reverted it. I expect that to come up here too, but in scenarios where it matters I expect all the eliminated copying to of course balance it out.

Anyway, that's why I think it's a good idea to benchmark quickdna at this point and get a better idea of the impact of these changes. (Also, I should note, maybe we don't need to worry about the code becoming e.g. 10% slower if it's still 90x faster than Biopython and maybe not a bottleneck for our purposes.)

swooster · 2022-11-27T21:15:18Z

@lynn: Completely agreed regarding the tradeoff of slightly suboptimal performance in exchange for making it convenient to avoid allocations/copies... Case in point, I've just submitted the above API as PR #24, which includes a benchmark for generating windows the way synthclient does vs reducing allocations/copies via iterator/reference-based windowing; at least on my laptop, the latter is about 18x faster for a 100kbp sequence.

lynn mentioned this issue Nov 9, 2022

Zero-copy windowing #20

Open

swooster assigned vgel Nov 21, 2022

swooster mentioned this issue Nov 27, 2022

Add iterator-based API for interacting with DNA #24

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow non-owned Sequence types #19

Allow non-owned Sequence types #19

lynn commented Nov 9, 2022 •

edited

Loading

swooster commented Nov 15, 2022

vgel commented Nov 15, 2022

swooster commented Nov 18, 2022 •

edited

Loading

swooster commented Nov 21, 2022

swooster commented Nov 21, 2022 •

edited

Loading

swooster commented Nov 21, 2022

lynn commented Nov 23, 2022

swooster commented Nov 27, 2022

Allow non-owned Sequence types #19

Allow non-owned Sequence types #19

Comments

lynn commented Nov 9, 2022 • edited Loading

swooster commented Nov 15, 2022

vgel commented Nov 15, 2022

swooster commented Nov 18, 2022 • edited Loading

swooster commented Nov 21, 2022

swooster commented Nov 21, 2022 • edited Loading

swooster commented Nov 21, 2022

lynn commented Nov 23, 2022

swooster commented Nov 27, 2022

lynn commented Nov 9, 2022 •

edited

Loading

swooster commented Nov 18, 2022 •

edited

Loading

swooster commented Nov 21, 2022 •

edited

Loading