Extending ncbi_byid #101

boopsboops · 2017-10-01T18:21:49Z

I just stumbled on the function ncbi_byid after it cropped in the r-sig-phylo list.

I'm working on assembling reference libraries for eDNA metabarcoding, and being able to curate, update and review the quality of your reference libraries is really important. Therefore having this data in a table, rather than fasta format is essential.

Your function seems super quick, and gives back a lovely table. However, it's lacking a lot of the fields I would require for better filtering and quality control of reference libraries, such as lat_lon, country, specimen_voucher, publication status etc etc etc.

All this kind of info is usually in the GenBank metadata, and I've already implemented such a function to "tabulize" it (gb2df), but I think you will agree, it's an absolute abomination. I used EBI rather than NCBI as it's much faster to download large numbers of sequences to a local tempfile, and then did a lot of multithreaded XML scraping (which is very inefficient).
https://github.com/boopsboops/SeaDNA/blob/master/scripts/gb2df_example.R https://github.com/boopsboops/SeaDNA/blob/master/scripts/gb2df.R

So, if you guys think that these would be important or appropriate additions to ncbi_byid, I'll be more than happy to help (although I confess to not really getting my head around how the function works yet).

Cheers,

Rupert

The text was updated successfully, but these errors were encountered:

sckott · 2017-10-02T18:12:19Z

thanks for the issue @boopsboops

There's a few things going on here, so separating them out:

Add more fields to the output of ncbi_byid
EBI vs. NCBI: do you get the same data? or are they two complementary data sources (my ignorance of this space is coming through)
parallel XML processing: how much of a diff. does this make? curious cause it adds additional complexity so would only use if was really necessary

also:

would you be okay with xml2 instead of XML
see also related fxns https://github.com/ropensci/traits/blob/master/R/ncbi_byname.R and https://github.com/ropensci/traits/blob/master/R/ncbi_searcher.R (may or may not be of interest here)

@dwinter is there anything else to mention here?

boopsboops · 2017-10-03T20:53:04Z

Hi @sckott

EBI vs. NCBI: Yes, in theory the European Nucleotide Archive (ENA) at the EBI is a mirror of NCBI GenBank. Although sometimes I find data on NCBI that hasn't found its way onto ENA yet. I used it because it serves up data quicker than NCBI, possibly because the servers are closer to me and/or that it gets a lot less traffic.
Parallel XML processing: it made a pretty big difference to my inefficient code, but the difference to yours is minimal. Took about 40 sec for 500 sequences with ncbi_byid, and about 30 sec with gb2df. Clearly that extra bloat isn't helping much. In fact, by multithreading ncbi_byid I managed to download 6,500 sequences in about three minutes, which is more than acceptable.

So, I had a much better look at your code, and it seems fairly trivial to insert the things I need, and have successfully done so. What threw me was that you were using "GBSeq" XML style, which I have never seen before, and did not know was available.

Anyway, if you think it will be valuable to offer these extra fields in your function, I can post the code here or do a pull request. If not, I'm happy to just fork and use it locally for my own work.

Cheers!

sckott · 2017-10-04T21:03:46Z

Thanks for the follow up. A PR with your changes sounds good

sckott · 2017-10-06T19:35:52Z

@boopsboops does your merged PR solve all issues you had?

boopsboops · 2017-10-11T21:09:44Z

Sorry for the delay. Yes, it did. Thanks for the help :)

boopsboops mentioned this issue Oct 6, 2017

Added 10 new fields to ncbi_byid #102

Merged

boopsboops closed this as completed Oct 11, 2017

sckott added this to the v0.4 milestone Oct 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending ncbi_byid #101

Extending ncbi_byid #101

boopsboops commented Oct 1, 2017 •

edited by sckott

Loading

sckott commented Oct 2, 2017

boopsboops commented Oct 3, 2017

sckott commented Oct 4, 2017

sckott commented Oct 6, 2017

boopsboops commented Oct 11, 2017

Extending ncbi_byid #101

Extending ncbi_byid #101

Comments

boopsboops commented Oct 1, 2017 • edited by sckott Loading

sckott commented Oct 2, 2017

boopsboops commented Oct 3, 2017

sckott commented Oct 4, 2017

sckott commented Oct 6, 2017

boopsboops commented Oct 11, 2017

boopsboops commented Oct 1, 2017 •

edited by sckott

Loading