Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-owned Sequence types #19

Open
lynn opened this issue Nov 9, 2022 · 8 comments
Open

Allow non-owned Sequence types #19

lynn opened this issue Nov 9, 2022 · 8 comments
Assignees

Comments

@lynn
Copy link
Contributor

lynn commented Nov 9, 2022

Currently DnaSequence owns a Vec<Nucleotide>. But we'd also like to be able to have a DnaSequence backed by a borrowed &[Nucleotide]. Same for ProteinSequence.

Thought from @vgel: use generics, so DnaSequence<Vec<Nucleotide>> vs DnaSequence<&[Nucleotide]>, and then maybe nice aliases for these. Maybe even make DnaSequence<[Nucleotide; N]> work. This is what SmallVec does.

Alternatively, follow the Rust tradition of two typenames for borrowed vs. owned, like str vs String and Path vs PathBuf.

@swooster
Copy link
Contributor

I can think of 3 main ways to handle this...

  • Rust tradition of two types
  • Generic wrappers
  • Extension traits on Nucleotide slices

While my default temptation was to follow Rust tradition for Chesterton's fence reasons, the more I think about it the more a generic wrapper on anything that derefs to a nucleotide slice sounds appealing for composability reasons; I'd love to be able to use other kinds of backing storage.

If all Nucleotide sequences are valid DNA, then perhaps extension traits could make sense... They'd offer the same backing-storage-flexibility of a generic wrapper, but we wouldn't have to wrap/unwrap things in a Dna type and we wouldn't need to resort to any deref hackery to get at the underlying storage. On the other hand, it means we'd need to import the trait to access the methods (whereas with methods on concrete types, we can call them on returned values without needing to import anything).

Also, wrapper types and extension traits aren't mutually exclusive. We could have extension traits do the heavy lifting, and have methods of the same name on wrapper types that mostly defer to the trait implementation (and possibly rewrap the output if we care about e.g. dna sequence windows being dna sequences).

@vgel
Copy link
Contributor

vgel commented Nov 15, 2022

I'm not sure what the two-types solution would look like. Are you suggesting something like an unsized struct dna([Nucleotide])? Is there a library that does this besides std?

I think extension traits could be nice. It would be cool if people could bring their own storage, like SmallVec, since all we really care about is length + index + maybe mutability. I'd lean in that direction (without generic wrappers) to start with.

@swooster
Copy link
Contributor

swooster commented Nov 18, 2022

I've pushed a protoype branch and I'd love some feedback about whether this seems like a potentially acceptable direction to head in.

First some quick example code:

use quickdna::Nucleotides; // extension trait

// This approach should be backwards compatible
let dna: DnaSequence<Nucleotide> = DnaSequence::from_str("ATCGAATTCCGG").unwrap();

// IMHO, translation tables really feel like functions mapping codons to amino acids
// I'd be sorely tempted to also make these accessible as ordinary functions
let ncbi1 = TranslationTable::Ncbi1.to_fn();

// Everything is lazy, so (almost) no resources have been used yet. Note that this
// is an iter of Option<AminoAcid>, in case the translation table is missing something
let protein = dna.codons().map(ncbi1);

// Because everything is just an iterator, we can choose whatever storage we like
// and rely on built-in Rust code for hoisting the Option out during collection.
let protein: Option<SmallVec<[AminoAcid; 10]>> = protein.collect();
let protein = protein.expect("Unable to translate to amino acids");

// Again this supports Vecs, SmallVecs, arrays or anything else that coerces to a slice
use Nucleotide::*;
let dna = [A, T, C, G];

// Like before, this is lazy; it has consumed almost no resources.
let frames = dna.reverse_complement().frames();

// Unfortunately, because frames is a SmallVec of iterators, we have to clone() to extract from it.
let v: Vec<_> = frames[1].clone().collect();
assert_eq!(v, [G, A, T]);

The approach I took definitely isn't perfect; sadly I wasn't able to figure out a way to make the extension trait work for all nucleotide iterators while retaining the convenience of only needing to import a single trait. But perhaps something vaguely like this might be something to work towards?

What do other people think? (any ideas for improvements or other approaches?)

@swooster
Copy link
Contributor

It hit me that in order to save other people effort, I should probably describe the false start I went on before reaching the above:

I wanted to make an extension trait for anything that implements something like IntoIterator<Item=impl NucleotideLike>. However, there's the issue that something remapping nucleotides (such as reverse_complement) would probably be an iterator yielding items of a NuclueotideLike type by value, whereas a slice (or Vec or array) would have an iterator that yields references to a NuclueotideLike type. So I'd want to make the extension trait work for either iterators of values or iterators of references, which brings Borrow<T> to mind... unfortunately, Borrow<T> is parametrized on the type it's being borrowed as, so there's potentially more than one Borrow impl for any given type, so you can't use Borrow (or ToOwned) to constrain the output nucleotide type based on the input nucleotide type. I ended up giving up and going with the design of the previous comment; it's not general purpose in the way that I was hoping but it seems like it'd work fine in practice.

@swooster
Copy link
Contributor

swooster commented Nov 21, 2022

I've also pushed another version of this prototype to swooster/dna-prototype-general. In exchange for not supporting custom types that implement NucleotideLike (which I assume is a small price to pay) it works with anything that can be converted to a nucleotide iterator. The API is the same as above.

@swooster
Copy link
Contributor

Pinging @vgel and @lynn (apologies if this is too soon)

@lynn
Copy link
Contributor Author

lynn commented Nov 23, 2022

I like this API!

In #22, @vgel commented that one of my changes towards Rust-y trait-y .collect() usage actually made the generated code a bit slower, and I reverted it. I expect that to come up here too, but in scenarios where it matters I expect all the eliminated copying to of course balance it out.

Anyway, that's why I think it's a good idea to benchmark quickdna at this point and get a better idea of the impact of these changes. (Also, I should note, maybe we don't need to worry about the code becoming e.g. 10% slower if it's still 90x faster than Biopython and maybe not a bottleneck for our purposes.)

@swooster
Copy link
Contributor

@lynn: Completely agreed regarding the tradeoff of slightly suboptimal performance in exchange for making it convenient to avoid allocations/copies... Case in point, I've just submitted the above API as PR #24, which includes a benchmark for generating windows the way synthclient does vs reducing allocations/copies via iterator/reference-based windowing; at least on my laptop, the latter is about 18x faster for a 100kbp sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants