Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define public API #1

Open
natir opened this issue Jul 30, 2020 · 18 comments
Open

Define public API #1

natir opened this issue Jul 30, 2020 · 18 comments

Comments

@natir
Copy link
Owner

natir commented Jul 30, 2020

We need to define a public API.

@Daniel-Liu-c0deb0t I give to you power to write on this repo so you can in theory edit any comment, I propose to use the next comment like a resume of what we want in this API.

@natir
Copy link
Owner Author

natir commented Jul 30, 2020

Accept ? :

encode(input: &[u8]) -> Vec<u64> // each cell of output table contains 32 nucleotide
decode(input: &[u8]) -> Vec<u64> // each cell of input table contains 32 nucleotide

valid_DNA(input: &[u8]) -> bool // If anything in input isn't a A, C, T, G, a, c, t or g  return false
valid_DNA_with_N(input: &[u8]) -> bool // If anything in input isn't a A, C, T, G, N, a, c, t, g, n  return false

In discussion:

comp(&mut Vec<u64>): // Make complement in place or build a new vector ?

@natir
Copy link
Owner Author

natir commented Jul 30, 2020

Can we let the user choose the type used to store compacted nucleotide? We can use u16, u32, u64 or u128 to store more nucleotide in one variable. If it's possible encode signature can become:

encode<T>(input: &[u8]) -> Vec<T> 

I think we should provide in addition to these functions an iterator to facilitate the use of nuc2bit in some scenarios. The iterator would apply the function encode/decode to the data and only provide one nucleotide at a time. Something like:

let it = nuc2bit::Encode("ACTG"):

assert_eq(Some(0b00), it.next());
assert_eq(Some(0b01), it.next());
assert_eq(Some(0b10), it.next());
assert_eq(Some(0b11), it.next());
assert_eq(None, it.next());

@natir natir changed the title Define API Define public API Jul 30, 2020
@Daniel-Liu-c0deb0t
Copy link
Collaborator

Rust doesn't play well with generic numeric types, so we have to pick a single type, like u64. The alternative is macros, but I don't see much benefit from having it generic.

I think it would be easy to have two features:
iterator-based (similar to your nuc2bit implementation and your previous comment):

encode(input: &[u8]) -> impl Iterator<u8>;

and bulk-processing (similar to my implementations):

encode(input: &[u8]) -> Vec<u64>;
decode(input: &[u64]) -> Vec<u8>;

These all encode/decode ATCG (and U) to binary strings, and they would detect if SIMD features are available. If they are, then they will make use of those features. Essentially, it would choose the fastest function for every case. We don't have to implement other slower functions that we tested. I think the documentation mentions how to do CPU feature detection. Eventually, we will probably need a scalar, SSE, and AVX implementation for each function.

We can also have a check function to verify it a string is valid nucleotides, and another function to get the complement of a string of nucleotides. We can do this later. Also, let's not worry about implementing the triplet encoding for undetermined nucleotides that I talked about in my repo.

@natir
Copy link
Owner Author

natir commented Aug 3, 2020

Rust doesn't play well with generic numeric types, so we have to pick a single type, like u64. The alternative is macros, but I don't see much benefit from having it generic.

Ok I agree. (I edit first comment)

I think it would be easy to have two features:

Why did you want to have two features, for me iterator it's just sugar on top of encode/decode` function. Compute all elements at building and give elements one by one during iteration. This use more memory, but not too much, and it's probably the most efficient way.

These all encode/decode ATCG (and U) to binary strings, and they would detect if SIMD features are available …

This plan is perfect for me.

We can also have a check function to verify it a string is valid nucleotides, and another function to get the complement of a string of nucleotides.

Maybe add a revcomp function too.

Also, let's not worry about implementing the triplet encoding for undetermined nucleotides that I talked about in my repo.

Hey I see, but you didn't want add this work in nuc2bit ?

@Daniel-Liu-c0deb0t
Copy link
Collaborator

Using an iterator means that the nucleotides cannot be packed into u64 chunks. For each byte character in, it would lazily output one byte character, without compressing four nucleotides into a single byte.

The undetermined nucleotide encoding technique is a bit annoying to implement in a cross-platform way, so we will worry about it in the future.

@natir
Copy link
Owner Author

natir commented Aug 4, 2020

Using an iterator means that the nucleotides cannot be packed into u64 chunks. For each byte character in, it would lazily output one byte character, without compressing four nucleotides into a single byte.

I disagree with you we can pack into u64 chunks and unpack it on demand.
I write just a python generator as an exemple to show what I have in my mine:

def nuc2bit_iterator(dna):
    u64_vec = encode(dna)
    for u64_val in u64_vec:
        for _ in range(0, 32):
            yield u64_val & 0b11000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000 >> 62
            u64_val << 2

(we can probably do better stuff)

Ok we lose the advantage of compression but we keep the speedup of encoding all sequences speedily and gain the flexibility for the user. My actual main usage of nuc2bit is iterate over 2 bit encodes nucleotide one by one, but I can understand if you think it's out of nuc2bit scope.

The undetermined nucleotide encoding technique is a bit annoying to implement in a cross-platform way, so we will worry about it in the future.

Ok (I remove it from the actual public api)

@Daniel-Liu-c0deb0t
Copy link
Collaborator

Ok, I think that encoding all the sequences using my SIMD methods and then unpacking it is slow. Most of the time spent is packing it into u64 chunks. If we don't have to pack it, then we save a lot of time. For the iterator implementation, it is better to just use a lookup table to convert byte -> byte, like what you already have.

@natir
Copy link
Owner Author

natir commented Aug 11, 2020

Ok perfect.

We have, an interface to work at the chunk level and one to work at nucleotide level, it seems good for me.

Plus function to validate DNA and function to perform complement it seems a good API for me.

encode(input: &[u8]) -> Vec<u64> // each cell of output table contains 32 nucleotide
decode(input: &[u8]) -> Vec<u64> // each cell of input table contains 32 nucleotide

EncodeIterator // An iterator to encode &[u8] DNA string in 2 bit
DecodeIterator // An iterator to decode Vec<u64> of 2 bit in DNA

valid_DNA(input: &[u8]) -> bool // If anything in input isn't a A, C, T, G, a, c, t or g  return false
valid_DNA_with_N(input: &[u8]) -> bool // If anything in input isn't a A, C, T, G, N, a, c, t, g, n  return false

comp(&mut Vec<u64>): // Make complement in place
comp_one(&mut u64); // Make complement in place

@Daniel-Liu-c0deb0t
Copy link
Collaborator

Looks good, I'll start working on it when I have time.

@Daniel-Liu-c0deb0t
Copy link
Collaborator

I've started to work on this, and I've realized that to have efficient complement, we have to encode A=00, T=11, C=01, G=10 or something like that, so complement can just be binary negation.

@natir
Copy link
Owner Author

natir commented Aug 21, 2020

Are you sure the xor with 10 n time is really slower than the binary negation?

And in fact I use the A-> 00, C -> 01, T -> 10 and G -> 11 have some cool property if we work with sub-sequence with odd length see

Maybe we can allow the user to choose is encoding? It's probably very hard to implement this functionality

@Daniel-Liu-c0deb0t
Copy link
Collaborator

Yeah, I realized right after I made that comment that XORing with 0b10 is possible. I think we could just keep the original encoding without any changes.

@Daniel-Liu-c0deb0t
Copy link
Collaborator

Do you want to implement a function for checking whether a sequence of bytes represents a valid nucleotide sequence? It shouldn't be too hard to implement, by using _mm256_shuffle_epi8 to lookup each nucleotide.

@Daniel-Liu-c0deb0t
Copy link
Collaborator

I added a popcount implementation, which should make Hamming distance calculations very fast. This isn't directly part of our API, but I think it would be very useful.

@natir
Copy link
Owner Author

natir commented Aug 26, 2020

Yes ok no problem, I didn't see you made lot of work !

@natir
Copy link
Owner Author

natir commented Aug 26, 2020

I add a first version of Encode and Decode iterator.

I add a benchmark to check the effect of length on computation time and we have a with different GC%. For this, I need to create bench-internals feature to be able to benchmark private function.

I create some issues and a milestone to follow our progression.

@Daniel-Liu-c0deb0t
Copy link
Collaborator

Alright, that is great! I like how internal benchmarks are done; I haven't thought of doing it that way.

@Daniel-Liu-c0deb0t
Copy link
Collaborator

Daniel-Liu-c0deb0t commented Aug 27, 2020

I think we should consider changing the API to take &mut [u8] or &mut [u64] instead of allocating memory for every operation. The only problem is aligning the slices to 16- or 32-byte boundaries, so stores can be done efficiently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants