paleobioDB data download functions seem to drop empty variables (unexpectedly?) #18

dwbapst · 2015-02-24T07:29:14Z

While comparing paleobioDB data downloads to direct calls to the PBDB API, I discovered that some paleobioDB functions could have quite different output from their corresponding calls with the raw API.

Here's an example with occurrence data for a graptolite genus:

#with the raw API
rawAPI<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident&limit=all")

#with paleobioDB package
library(paleobioDB)

## Loading required package: raster
## Loading required package: sp
## Loading required package: maps

PBDBpack<-pbdb_occurrences(limit="all", base_name="Dicellograptus", show=c("ident"), vocab="pbdb")

# same number of columns in both?
ncol(rawAPI)==ncol(PBDBpack)

## [1] FALSE

# Nope! Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])

## [1] "reid_no"       "superceded"    "subgenus_name" "subgenus_reso"

My hypothesis of what is going on is that empty variables aren't being returned, either not being output at all by the JSON requests made to the PBDB's API or they are getting dropped when paleobioDB transforms the JSON requests into a data table. I looked over the code in the package's repo for any evidence of variable dropping but couldn't find any.

# Are the missing columns just full of NAs?
sapply(missingCol,function(x) all(is.na(rawAPI[,x])))

##       reid_no    superceded subgenus_name subgenus_reso 
##          TRUE          TRUE          TRUE          TRUE

There's no mention to this phenomenon that I can find in paleobioDB's manual or the documentation for the Paleobiology Database's API, however. It's a little odd because it means one could query, for example, for abundance data (show=c("abund")) and not get back the columns relating to abundance because they are empty. This could create issues for code that expects variables to always be present in a given data download.

For example, here is an extremely contrived example:

rawAPI<-read.csv("http://paleobiodb.org/data1.1/occs/list.txt?base_name=Reteograptus&show=abund&limit=all")
PBDBpack<-pbdb_occurrences(limit="all", base_name="Reteograptus", show=c("abund"), vocab="pbdb")

# Which columns aren't in the paleobioDB output
(missingCol<-colnames(rawAPI)[sapply(colnames(rawAPI),function(x) !any(x==colnames(PBDBpack)))])

## [1] "reid_no"     "superceded"  "abund_value" "abund_unit"

#let's look at the abundance values from the rawAPI; are they all NAs?
apply(rawAPI[,c("abund_value","abund_unit")],2,function(x) all(is.na(x)))

## abund_value  abund_unit 
##        TRUE        TRUE

This may not be a solvable issue, as I realize it may stem from the behavior of the API. However, at the very least, this variable dropping behavior should be discussed in the package's documentation to alert users.

The text was updated successfully, but these errors were encountered:

javigzz · 2019-02-24T03:19:08Z

Hello @dwbapst,

Thanks for your detailed report... filed exactly four years ago.

In fact, the txt endopoint of the API returns the empty columns:
http://paleobiodb.org/data1.1/occs/list.json?limit=all&base_name=Dicellograptus&show=ident&vocab=pbdb
while the json doesn't
http://paleobiodb.org/data1.1/occs/list.txt?base_name=Dicellograptus&show=ident&limit=all
So it is out of the control of this client. As you suggest I am adding a short warning in the documentation of the pbdb_occurrences function as part of the PR for version 0.6
Regards

javigzz · 2019-02-24T03:41:53Z

Reported to the API devs paleobiodb/data_service#45

javigzz closed this as completed Feb 24, 2019

javigzz mentioned this issue Feb 24, 2019

Empty columns are not returned in json view of pbdb_occurrences paleobiodb/data_service#45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paleobioDB data download functions seem to drop empty variables (unexpectedly?) #18

paleobioDB data download functions seem to drop empty variables (unexpectedly?) #18

dwbapst commented Feb 24, 2015

javigzz commented Feb 24, 2019

javigzz commented Feb 24, 2019

paleobioDB data download functions seem to drop empty variables (unexpectedly?) #18

paleobioDB data download functions seem to drop empty variables (unexpectedly?) #18

Comments

dwbapst commented Feb 24, 2015

javigzz commented Feb 24, 2019

javigzz commented Feb 24, 2019