You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While comparing paleobioDB data downloads to direct calls to the PBDB API, I discovered that some paleobioDB functions could have quite different output from their corresponding calls with the raw API.
Here's an example with occurrence data for a graptolite genus:
#with the raw APIrawAPI<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident&limit=all")
#with paleobioDB package
library(paleobioDB)
PBDBpack<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="pbdb")
# same number of columns in both?
ncol(rawAPI)==ncol(PBDBpack)
## [1] FALSE
# Nope! Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])
My hypothesis of what is going on is that empty variables aren't being returned, either not being output at all by the JSON requests made to the PBDB's API or they are getting dropped when paleobioDB transforms the JSON requests into a data table. I looked over the code in the package's repo for any evidence of variable dropping but couldn't find any.
# Are the missing columns just full of NAs?
sapply(missingCol,function(x) all(is.na(rawAPI[,x])))
There's no mention to this phenomenon that I can find in paleobioDB's manual or the documentation for the Paleobiology Database's API, however. It's a little odd because it means one could query, for example, for abundance data (show=c("abund")) and not get back the columns relating to abundance because they are empty. This could create issues for code that expects variables to always be present in a given data download.
For example, here is an extremely contrived example:
rawAPI<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Reteograptus&show=abund&limit=all")
PBDBpack<-pbdb_occurrences(limit="all", base_name="Reteograptus", show=c("abund"), vocab="pbdb")
# Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])
#let's look at the abundance values from the rawAPI; are they all NAs?
apply(rawAPI[,c("abund_value","abund_unit")],2,function(x) all(is.na(x)))
## abund_value abund_unit
## TRUE TRUE
This may not be a solvable issue, as I realize it may stem from the behavior of the API. However, at the very least, this variable dropping behavior should be discussed in the package's documentation to alert users.
The text was updated successfully, but these errors were encountered:
Thanks for your detailed report... filed exactly four years ago.
In fact, the txt endopoint of the API returns the empty columns: http://paleobiodb.org/data1.1/occs/list.json?limit=all&base_name=Dicellograptus&show=ident&vocab=pbdb
while the json doesn't http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident&limit=all
So it is out of the control of this client. As you suggest I am adding a short warning in the documentation of the pbdb_occurrences function as part of the PR for version 0.6
Regards
While comparing
paleobioDB
data downloads to direct calls to the PBDB API, I discovered that somepaleobioDB
functions could have quite different output from their corresponding calls with the raw API.Here's an example with occurrence data for a graptolite genus:
My hypothesis of what is going on is that empty variables aren't being returned, either not being output at all by the JSON requests made to the PBDB's API or they are getting dropped when
paleobioDB
transforms the JSON requests into a data table. I looked over the code in the package's repo for any evidence of variable dropping but couldn't find any.There's no mention to this phenomenon that I can find in paleobioDB's manual or the documentation for the Paleobiology Database's API, however. It's a little odd because it means one could query, for example, for abundance data (
show=c("abund")
) and not get back the columns relating to abundance because they are empty. This could create issues for code that expects variables to always be present in a given data download.For example, here is an extremely contrived example:
This may not be a solvable issue, as I realize it may stem from the behavior of the API. However, at the very least, this variable dropping behavior should be discussed in the package's documentation to alert users.
The text was updated successfully, but these errors were encountered: