The Alphabet API, encoding and attempting to parse / convert #1

TransGirlCodes · 2022-06-09T12:57:45Z

TransGirlCodes
Jun 9, 2022
Maintainer

@BioJulia/maintainers

I wanted to start this discussion as with a 0.1 of Kmers submitted to General, and with some feedback on it, I'm finding some of the issues there & some of the ones in BioSequences are kinda colliding in my mind as they're entangled. So I wanted to start a discussion here to talk about it and come up with a principled and consistent answer across the two packages.

The issues are:

#231
#224
#219

From BioSequences and these from Kmers:

BioJulia/Kmers.jl#14

They all touch on how to deal with strings, parsing, or attempting to parse and encode elements.
For example in Kmers, looking at some of the constructors where the alphabet may need to be deduced either at pre-compile time, or runtime. Or when I was looking at iterating over say RNA kmers in a DNA Longseq, but skipping invalid characters rather than throwing.

Plus the issue that eltype for Alphabets is not quite as it should behave.

I'm thinking these issues probably require us to re-address how Alphabets are used and how we should go about constructing sequences - especially the encoding phases.

Let's take the general constructor:

function LongSequence{A}(it) where {A <: Alphabet}
    len = length(it)
    data = Vector{UInt64}(undef, seq_data_len(A, len))
    bits = zero(UInt)
    bitind = bitindex(BitsPerSymbol(A()), encoded_data_eltype(LongSequence{A}), 1)
    @inbounds for (i, x) in enumerate(it)
        xT = convert(eltype(A), x)
        enc = encode(A(), xT)
        bits |= enc << offset(bitind)
        if iszero(offset(nextposition(bitind)))
            data[index(bitind)] = bits
            bits = zero(UInt64)
        end
        bitind = nextposition(bitind)
    end
    iszero(offset(bitind)) || (data[index(bitind)] = bits)
    LongSequence{A}(data, len % UInt)
end

Now there are two points where this can throw - conversion to the eltype, and encoding. This strictness is obviously in many places, desirable, however, there are cases where we don't want to do that e.g. string parsing where you might want to tryparse into a longsequence with various alphabets until you succeed, without resorting to horrid nested try-catch. I'm open to suggestions about how we might improve this.

One thing I've been considering recently is the idea of a tryencode that takes the responsibility of both eltype checking/conversion AND encoding, producing nothing on failure and the encoded data on success. Because convert afaik is a do or throw situation, we might need to work around that.

jakobnissen · 2022-06-09T13:43:51Z

jakobnissen
Jun 9, 2022
Maintainer

So the core of the problem is: Given some input x::T where we can't deduce from T whether or not x conforms to alphabet A, check if the alphabet applies.
This is distinct from situations where we can know this at compile time, e.g. we know that a T{A} <: BioSequence{RNAAlphabet{2}} can be encoded as a T{RNAAlphabet{4}}. If we can know it at compile time, we can just solve it with dispatch.

I don't think we should make constructors return Union{T, Nothing}. That seems un-idiomatic. T(x) should always return T.
One difficulty I foresee is a tension between on one hand avoiding errors by checking the input using isalphabet, and on the other doing everything in one pass, because the input may be stateful. This is a generally unsolved problem in Julia.

Here is what I think we should do. I haven't given that much thought to it, so I'm open to suggestions:

T(s::AbstractString), if implemented, should dispatch to parse(T, s)
parse(T, s) should just call a new function tryparse_internal(T, s) and error if the latter returns an error result.
tryparse(T, s) should call tryparse_internal. We have this internal function because we want to propagate information about the error which is not possible to store in nothing (this is yet another unsolved problem with Julia, there is a long discussion on the Julia repo on this)
If we need to implement tryconvert or tryencode to implement tryparse_internal, so be it.
Add iscompatible(x::AbstractString, ::Alphabet) which checks if s can be parsed to a sequence of that alphabet
Add iscompatible(x::BioSymbol, ::Alphabet, ) which checks x can be converted to an element of that alphabet. I'm not sure how to implement this.
Add iscompatible(x, ::Alphabet) which checks if all elements of x is convertible to the alphabet.

There is some issues with these three iscompatible methods having slightly different semantics here. I have to think more about it.

Add Base.iterate(::Alphabet, state...) to fix #231 - should be straightforward.
If you get object x of unknown type and want to predict the alphabet, you need to collect so you can guarantee it's not stateful, then call iscompatible multiple times until you find the correct alphabet.

0 replies

CiaranOMara · 2022-06-13T03:04:27Z

CiaranOMara
Jun 13, 2022

@SabrinaJaye, I think part of your ideas around tryparse and tryencode arises from the specific case of handling strings or chars where the user cannot supply information about which symbol set (A) to use?

I don't think these tryparse and tryencode need to be in the constructor. I also agree with @jakobnissen that the constructor should not return Union{T, Nothing}. Instead, tryparse and tryencode could be in an iterator that gets passed to the constructor, and when the iterator encounters an error, it could provide an acceptable invalid symbol if the sink has enough bits to support an encoding that represents such a symbol.

Having said that, a naive collect like approach for constructing without try and catch might look like the following.

function LongSequence(it)
	symbols = unique(it)
	idx = findfirst(alphabet->issubset(symbols, alphabet), alphabets) # alphabets will need some clever ordering.
	A = alphabets[idx]
	return LongSequence{A}(it)
end

However, this approach doesn't handle a stateful iterator unless it also collects everything or can reset the iterator.

A single-pass-shovel-forward-like approach that can handle a stateful iterator would need a way to promote the symbols already encoded or packed. So for a single pass operation, we would keep a record of the unique symbols observed and check that the observed symbols are still a subset of the chosen alphabet. If they are not, change the chosen alphabet and promote the sink, then continue. The hope is that the sink gets promoted to the correct alphabet early.

I think it's worth noting that we're talking about convenience, so the best guess is good enough, and a performance hit is inevitable. However, I don't think convenience should be allowed to affect the main kernel. So, I'd advocate for an additional layer/iterator like the one suggested here and alphabets with invalid symbol representation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BioJulia

The Alphabet API, encoding and attempting to parse / convert #1

{{title}}

Replies: 0 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

BioJulia

The Alphabet API, encoding and attempting to parse / convert #1

TransGirlCodes Jun 9, 2022 Maintainer

Replies: 0 comments · 2 replies

jakobnissen Jun 9, 2022 Maintainer

CiaranOMara Jun 13, 2022

TransGirlCodes
Jun 9, 2022
Maintainer

Replies: 0 comments 2 replies

jakobnissen
Jun 9, 2022
Maintainer

CiaranOMara
Jun 13, 2022