Define public API #1

natir · 2020-07-30T13:57:35Z

We need to define a public API.

@Daniel-Liu-c0deb0t I give to you power to write on this repo so you can in theory edit any comment, I propose to use the next comment like a resume of what we want in this API.

natir · 2020-07-30T14:05:16Z

Accept ? :

encode(input: &[u8]) -> Vec<u64> // each cell of output table contains 32 nucleotide
decode(input: &[u8]) -> Vec<u64> // each cell of input table contains 32 nucleotide

valid_DNA(input: &[u8]) -> bool // If anything in input isn't a A, C, T, G, a, c, t or g  return false
valid_DNA_with_N(input: &[u8]) -> bool // If anything in input isn't a A, C, T, G, N, a, c, t, g, n  return false

In discussion:

comp(&mut Vec<u64>): // Make complement in place or build a new vector ?

natir · 2020-07-30T14:15:26Z

Can we let the user choose the type used to store compacted nucleotide? We can use u16, u32, u64 or u128 to store more nucleotide in one variable. If it's possible encode signature can become:

encode<T>(input: &[u8]) -> Vec<T>

I think we should provide in addition to these functions an iterator to facilitate the use of nuc2bit in some scenarios. The iterator would apply the function encode/decode to the data and only provide one nucleotide at a time. Something like:

let it = nuc2bit::Encode("ACTG"):

assert_eq(Some(0b00), it.next());
assert_eq(Some(0b01), it.next());
assert_eq(Some(0b10), it.next());
assert_eq(Some(0b11), it.next());
assert_eq(None, it.next());

Daniel-Liu-c0deb0t · 2020-07-31T22:01:28Z

Rust doesn't play well with generic numeric types, so we have to pick a single type, like u64. The alternative is macros, but I don't see much benefit from having it generic.

I think it would be easy to have two features:
iterator-based (similar to your nuc2bit implementation and your previous comment):

encode(input: &[u8]) -> impl Iterator<u8>;

and bulk-processing (similar to my implementations):

encode(input: &[u8]) -> Vec<u64>;
decode(input: &[u64]) -> Vec<u8>;

These all encode/decode ATCG (and U) to binary strings, and they would detect if SIMD features are available. If they are, then they will make use of those features. Essentially, it would choose the fastest function for every case. We don't have to implement other slower functions that we tested. I think the documentation mentions how to do CPU feature detection. Eventually, we will probably need a scalar, SSE, and AVX implementation for each function.

We can also have a check function to verify it a string is valid nucleotides, and another function to get the complement of a string of nucleotides. We can do this later. Also, let's not worry about implementing the triplet encoding for undetermined nucleotides that I talked about in my repo.

natir · 2020-08-03T09:44:20Z

Rust doesn't play well with generic numeric types, so we have to pick a single type, like u64. The alternative is macros, but I don't see much benefit from having it generic.

Ok I agree. (I edit first comment)

I think it would be easy to have two features:

Why did you want to have two features, for me iterator it's just sugar on top of encode/decode` function. Compute all elements at building and give elements one by one during iteration. This use more memory, but not too much, and it's probably the most efficient way.

These all encode/decode ATCG (and U) to binary strings, and they would detect if SIMD features are available …

This plan is perfect for me.

We can also have a check function to verify it a string is valid nucleotides, and another function to get the complement of a string of nucleotides.

Maybe add a revcomp function too.

Also, let's not worry about implementing the triplet encoding for undetermined nucleotides that I talked about in my repo.

Hey I see, but you didn't want add this work in nuc2bit ?

Daniel-Liu-c0deb0t · 2020-08-03T22:26:46Z

Using an iterator means that the nucleotides cannot be packed into u64 chunks. For each byte character in, it would lazily output one byte character, without compressing four nucleotides into a single byte.

The undetermined nucleotide encoding technique is a bit annoying to implement in a cross-platform way, so we will worry about it in the future.

natir · 2020-08-04T08:40:06Z

Using an iterator means that the nucleotides cannot be packed into u64 chunks. For each byte character in, it would lazily output one byte character, without compressing four nucleotides into a single byte.

I disagree with you we can pack into u64 chunks and unpack it on demand.
I write just a python generator as an exemple to show what I have in my mine:

def nuc2bit_iterator(dna):
    u64_vec = encode(dna)
    for u64_val in u64_vec:
        for _ in range(0, 32):
            yield u64_val & 0b11000000_00000000_00000000_00000000_00000000_00000000_00000000_00000000 >> 62
            u64_val << 2

(we can probably do better stuff)

Ok we lose the advantage of compression but we keep the speedup of encoding all sequences speedily and gain the flexibility for the user. My actual main usage of nuc2bit is iterate over 2 bit encodes nucleotide one by one, but I can understand if you think it's out of nuc2bit scope.

The undetermined nucleotide encoding technique is a bit annoying to implement in a cross-platform way, so we will worry about it in the future.

Ok (I remove it from the actual public api)

Daniel-Liu-c0deb0t · 2020-08-06T22:13:13Z

Ok, I think that encoding all the sequences using my SIMD methods and then unpacking it is slow. Most of the time spent is packing it into u64 chunks. If we don't have to pack it, then we save a lot of time. For the iterator implementation, it is better to just use a lookup table to convert byte -> byte, like what you already have.

natir · 2020-08-11T14:20:09Z

Ok perfect.

We have, an interface to work at the chunk level and one to work at nucleotide level, it seems good for me.

Plus function to validate DNA and function to perform complement it seems a good API for me.

encode(input: &[u8]) -> Vec<u64> // each cell of output table contains 32 nucleotide
decode(input: &[u8]) -> Vec<u64> // each cell of input table contains 32 nucleotide

EncodeIterator // An iterator to encode &[u8] DNA string in 2 bit
DecodeIterator // An iterator to decode Vec<u64> of 2 bit in DNA

valid_DNA(input: &[u8]) -> bool // If anything in input isn't a A, C, T, G, a, c, t or g  return false
valid_DNA_with_N(input: &[u8]) -> bool // If anything in input isn't a A, C, T, G, N, a, c, t, g, n  return false

comp(&mut Vec<u64>): // Make complement in place
comp_one(&mut u64); // Make complement in place

Daniel-Liu-c0deb0t · 2020-08-14T04:40:14Z

Looks good, I'll start working on it when I have time.

Daniel-Liu-c0deb0t · 2020-08-21T06:17:00Z

I've started to work on this, and I've realized that to have efficient complement, we have to encode A=00, T=11, C=01, G=10 or something like that, so complement can just be binary negation.

natir · 2020-08-21T20:10:36Z

Are you sure the xor with 10 n time is really slower than the binary negation?

And in fact I use the A-> 00, C -> 01, T -> 10 and G -> 11 have some cool property if we work with sub-sequence with odd length see

Maybe we can allow the user to choose is encoding? It's probably very hard to implement this functionality

Daniel-Liu-c0deb0t · 2020-08-21T23:28:42Z

Yeah, I realized right after I made that comment that XORing with 0b10 is possible. I think we could just keep the original encoding without any changes.

Daniel-Liu-c0deb0t · 2020-08-22T22:00:02Z

Do you want to implement a function for checking whether a sequence of bytes represents a valid nucleotide sequence? It shouldn't be too hard to implement, by using _mm256_shuffle_epi8 to lookup each nucleotide.

Daniel-Liu-c0deb0t · 2020-08-26T05:11:04Z

I added a popcount implementation, which should make Hamming distance calculations very fast. This isn't directly part of our API, but I think it would be very useful.

natir · 2020-08-26T08:08:56Z

Yes ok no problem, I didn't see you made lot of work !

natir · 2020-08-26T18:29:28Z

I add a first version of Encode and Decode iterator.

I add a benchmark to check the effect of length on computation time and we have a with different GC%. For this, I need to create bench-internals feature to be able to benchmark private function.

I create some issues and a milestone to follow our progression.

Daniel-Liu-c0deb0t · 2020-08-26T23:42:48Z

Alright, that is great! I like how internal benchmarks are done; I haven't thought of doing it that way.

Daniel-Liu-c0deb0t · 2020-08-27T22:23:35Z

I think we should consider changing the API to take &mut [u8] or &mut [u64] instead of allocating memory for every operation. The only problem is aligning the slices to 16- or 32-byte boundaries, so stores can be done efficiently.

natir changed the title ~~Define API~~ Define public API Jul 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define public API #1

Define public API #1

natir commented Jul 30, 2020 •

edited

Loading

natir commented Jul 30, 2020 •

edited

Loading

natir commented Jul 30, 2020 •

edited

Loading

Daniel-Liu-c0deb0t commented Jul 31, 2020

natir commented Aug 3, 2020 •

edited

Loading

Daniel-Liu-c0deb0t commented Aug 3, 2020

natir commented Aug 4, 2020 •

edited

Loading

Daniel-Liu-c0deb0t commented Aug 6, 2020

natir commented Aug 11, 2020

Daniel-Liu-c0deb0t commented Aug 14, 2020

Daniel-Liu-c0deb0t commented Aug 21, 2020

natir commented Aug 21, 2020 •

edited

Loading

Daniel-Liu-c0deb0t commented Aug 21, 2020

Daniel-Liu-c0deb0t commented Aug 22, 2020

Daniel-Liu-c0deb0t commented Aug 26, 2020

natir commented Aug 26, 2020

natir commented Aug 26, 2020 •

edited

Loading

Daniel-Liu-c0deb0t commented Aug 26, 2020

Daniel-Liu-c0deb0t commented Aug 27, 2020 •

edited

Loading

Define public API #1

Define public API #1

Comments

natir commented Jul 30, 2020 • edited Loading

natir commented Jul 30, 2020 • edited Loading

natir commented Jul 30, 2020 • edited Loading

Daniel-Liu-c0deb0t commented Jul 31, 2020

natir commented Aug 3, 2020 • edited Loading

Daniel-Liu-c0deb0t commented Aug 3, 2020

natir commented Aug 4, 2020 • edited Loading

Daniel-Liu-c0deb0t commented Aug 6, 2020

natir commented Aug 11, 2020

Daniel-Liu-c0deb0t commented Aug 14, 2020

Daniel-Liu-c0deb0t commented Aug 21, 2020

natir commented Aug 21, 2020 • edited Loading

Daniel-Liu-c0deb0t commented Aug 21, 2020

Daniel-Liu-c0deb0t commented Aug 22, 2020

Daniel-Liu-c0deb0t commented Aug 26, 2020

natir commented Aug 26, 2020

natir commented Aug 26, 2020 • edited Loading

Daniel-Liu-c0deb0t commented Aug 26, 2020

Daniel-Liu-c0deb0t commented Aug 27, 2020 • edited Loading

natir commented Jul 30, 2020 •

edited

Loading

natir commented Jul 30, 2020 •

edited

Loading

natir commented Jul 30, 2020 •

edited

Loading

natir commented Aug 3, 2020 •

edited

Loading

natir commented Aug 4, 2020 •

edited

Loading

natir commented Aug 21, 2020 •

edited

Loading

natir commented Aug 26, 2020 •

edited

Loading

Daniel-Liu-c0deb0t commented Aug 27, 2020 •

edited

Loading