Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

paleobioDB data download functions seem to drop empty variables (unexpectedly?) #18

Closed
dwbapst opened this issue Feb 24, 2015 · 2 comments

Comments

@dwbapst
Copy link

dwbapst commented Feb 24, 2015

While comparing paleobioDB data downloads to direct calls to the PBDB API, I discovered that some paleobioDB functions could have quite different output from their corresponding calls with the raw API.

Here's an example with occurrence data for a graptolite genus:

#with the raw API
rawAPI<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident&limit=all")

#with paleobioDB package
library(paleobioDB)
## Loading required package: raster
## Loading required package: sp
## Loading required package: maps
PBDBpack<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="pbdb")

# same number of columns in both?
ncol(rawAPI)==ncol(PBDBpack)
## [1] FALSE
# Nope! Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])
## [1] "reid_no"       "superceded"    "subgenus_name" "subgenus_reso"

My hypothesis of what is going on is that empty variables aren't being returned, either not being output at all by the JSON requests made to the PBDB's API or they are getting dropped when paleobioDB transforms the JSON requests into a data table. I looked over the code in the package's repo for any evidence of variable dropping but couldn't find any.

# Are the missing columns just full of NAs?
sapply(missingCol,function(x) all(is.na(rawAPI[,x])))
##       reid_no    superceded subgenus_name subgenus_reso 
##          TRUE          TRUE          TRUE          TRUE

There's no mention to this phenomenon that I can find in paleobioDB's manual or the documentation for the Paleobiology Database's API, however. It's a little odd because it means one could query, for example, for abundance data (show=c("abund")) and not get back the columns relating to abundance because they are empty. This could create issues for code that expects variables to always be present in a given data download.

For example, here is an extremely contrived example:

rawAPI<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Reteograptus&show=abund&limit=all")
PBDBpack<-pbdb_occurrences(limit="all", base_name="Reteograptus", show=c("abund"), vocab="pbdb")

# Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])
## [1] "reid_no"     "superceded"  "abund_value" "abund_unit"
#let's look at the abundance values from the rawAPI; are they all NAs?
apply(rawAPI[,c("abund_value","abund_unit")],2,function(x) all(is.na(x)))
## abund_value  abund_unit 
##        TRUE        TRUE

This may not be a solvable issue, as I realize it may stem from the behavior of the API. However, at the very least, this variable dropping behavior should be discussed in the package's documentation to alert users.

@javigzz
Copy link
Contributor

javigzz commented Feb 24, 2019

Hello @dwbapst,

Thanks for your detailed report... filed exactly four years ago.

In fact, the txt endopoint of the API returns the empty columns:
http://paleobiodb.org/data1.1/occs/list.json?limit=all&base_name=Dicellograptus&show=ident&vocab=pbdb
while the json doesn't
http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident&limit=all
So it is out of the control of this client. As you suggest I am adding a short warning in the documentation of the pbdb_occurrences function as part of the PR for version 0.6
Regards

@javigzz
Copy link
Contributor

javigzz commented Feb 24, 2019

Reported to the API devs paleobiodb/data_service#45

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants