The Alphabet API, encoding and attempting to parse / convert #1
Replies: 0 comments 2 replies
-
So the core of the problem is: Given some input I don't think we should make constructors return Here is what I think we should do. I haven't given that much thought to it, so I'm open to suggestions:
There is some issues with these three
|
Beta Was this translation helpful? Give feedback.
-
@SabrinaJaye, I think part of your ideas around I don't think these Having said that, a naive function LongSequence(it)
symbols = unique(it)
idx = findfirst(alphabet->issubset(symbols, alphabet), alphabets) # alphabets will need some clever ordering.
A = alphabets[idx]
return LongSequence{A}(it)
end However, this approach doesn't handle a stateful iterator unless it also collects everything or can reset the iterator. A single-pass-shovel-forward-like approach that can handle a stateful iterator would need a way to promote the symbols already encoded or packed. So for a single pass operation, we would keep a record of the unique symbols observed and check that the observed symbols are still a subset of the chosen alphabet. If they are not, change the chosen alphabet and promote the sink, then continue. The hope is that the sink gets promoted to the correct alphabet early. I think it's worth noting that we're talking about convenience, so the best guess is good enough, and a performance hit is inevitable. However, I don't think convenience should be allowed to affect the main kernel. So, I'd advocate for an additional layer/iterator like the one suggested here and alphabets with invalid symbol representation. |
Beta Was this translation helpful? Give feedback.
-
@BioJulia/maintainers
I wanted to start this discussion as with a 0.1 of Kmers submitted to General, and with some feedback on it, I'm finding some of the issues there & some of the ones in BioSequences are kinda colliding in my mind as they're entangled. So I wanted to start a discussion here to talk about it and come up with a principled and consistent answer across the two packages.
The issues are:
#231
#224
#219
From BioSequences and these from Kmers:
BioJulia/Kmers.jl#14
They all touch on how to deal with strings, parsing, or attempting to parse and encode elements.
For example in Kmers, looking at some of the constructors where the alphabet may need to be deduced either at pre-compile time, or runtime. Or when I was looking at iterating over say RNA kmers in a DNA Longseq, but skipping invalid characters rather than throwing.
Plus the issue that eltype for Alphabets is not quite as it should behave.
I'm thinking these issues probably require us to re-address how Alphabets are used and how we should go about constructing sequences - especially the encoding phases.
Let's take the general constructor:
Now there are two points where this can throw - conversion to the eltype, and encoding. This strictness is obviously in many places, desirable, however, there are cases where we don't want to do that e.g. string parsing where you might want to
tryparse
into a longsequence with various alphabets until you succeed, without resorting to horrid nested try-catch. I'm open to suggestions about how we might improve this.One thing I've been considering recently is the idea of a
tryencode
that takes the responsibility of both eltype checking/conversion AND encoding, producing nothing on failure and the encoded data on success. Because convert afaik is a do or throw situation, we might need to work around that.Beta Was this translation helpful? Give feedback.
All reactions