Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arguments to assessments function cause error #31

Closed
TraciPopejoy opened this issue Mar 7, 2023 · 5 comments
Closed

Arguments to assessments function cause error #31

TraciPopejoy opened this issue Mar 7, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@TraciPopejoy
Copy link

TraciPopejoy commented Mar 7, 2023

Hello,

I am hoping to use this package to query the ATTAINS data for specific areas across the eastern United States. I think most of the data I'm interested in is in the use_assessment tibble produced by assessments(), but I'm having difficulty using the assessments() function with assessment_unit_id designated. I would also like to know EPA IR categories since ~2010, which I think means I'll need to run assessments() multiple times, as shown in your example tutorial (though at the state level).

I can only get the assessments() function to complete a query with an assessment_unit_id listed when both state_code and organization_id are also listed. Is this the expected function? I also run into an error when I try to specify any reporting cycle other than 2020 (which I think is the most recent one based on the default arguments). Adding any agency_code arguments doesn't affect the error.

assessments(assessment_unit_id='KY-1749', state_code='KY', organization_id='21KY') # works assessments(assessment_unit_id='KY-1749', state_code='KY', organization_id='21KY', reporting_cycle='2020') # works assessments(assessment_unit_id='KY-1749', state_code='KY', organization_id='21KY', reporting_cycle='2016') # does not work

The error I recieve is : "Error in dplyr::select():
! Can't subset columns that don't exist.
x Column agencyCode doesn't exist.
Run rlang::last_error() to see where the error occurred. "
I downloaded the CRAN version of the package this morning and am running R version 4.1.1 on Windows.

I appreciate your help with this issue and for writing this package! It is going to immensely help and speed up the process -- we have about 70 sites across 13 states. Thanks!
Traci

@mps9506
Copy link
Owner

mps9506 commented Mar 7, 2023

Hi,
Apologies for the error you encountered, I know that message is completely unhelpful. Sometime the ATTAINS API returns inconsistent results based on states or if there is no data. I'm struggling with how to handle that and provide actionable info for users. That said I dug into the results a little bit and I hope this helps:

We can use the tidy = FALSE argument to return the plain JSON data that is sent by the ATTAINS webservice whenever there is an error in parsing the data. The [tidyjson](https://github.com/colearendt/tidyjson) package provides functions to explore this data. When I used assessments(assessment_unit_id='KY-1749', state_code='KY', organization_id='21KY', reporting_cycle='2016', tidy = FALSE) and parsed it with tidyjson, I got zero results. Is it possible that the KY-1749 assessment unit was not assessed for the 303(d) in 2014?

Here is another way I checked. The following will query all the assessment results for 2020 in Kentucky as a JSON, and I can filter by assessment unit:

library(tidyjson)
library(dplyr)

df_2020 <- assessments(state_code='KY', organization_id='21KY', reporting_cycle='2020', tidy = FALSE)


df_2020 |> 
  enter_object("items") |> 
  gather_array() |> 
  spread_all() |> 
  select(-c("array.index", "document.id")) |> 
  enter_object("assessments") |> 
  gather_array() |> 
  spread_all(recursive = TRUE) |> 
  filter(assessmentUnitIdentifier == 'KY-1749')

That returned a tibble with one row with the assessment results for the KY-1749 assessment unit:

# A tbl_json: 1 x 15 tibble with a "JSON" attribute
  ..JSON       organ…¹ organ…² organ…³ repor…⁴ repor…⁵ array…⁶ asses…⁷ agenc…⁸ troph…⁹ ratio…˟ epaIR…˟ overa…˟ cycle…˟ yearL…˟
  <chr>        <chr>   <chr>   <chr>   <chr>   <chr>     <int> <chr>   <chr>   <chr>   <lgl>   <chr>   <chr>   <chr>   <chr>  
1 "{\"assessm… 21KY    Kentuc… State   2020    EPA Fi…     806 KY-1749 S       <NA>    NA      5       Not Su… 2010    ""     
# … with abbreviated variable names ¹​organizationIdentifier, ²​organizationName, ³​organizationTypeText, ⁴​reportingCycleText,
#   ⁵​reportStatusCode, ⁶​array.index, ⁷​assessmentUnitIdentifier, ⁸​agencyCode, ⁹​trophicStatusCode, ˟​rationaleText,
#   ˟​epaIRCategory, ˟​overallStatus, ˟​cycleLastAssessedText, ˟​yearLastMonitoredText

Repeat this for 2014:

df_2014 <- assessments(state_code='KY', organization_id='21KY', reporting_cycle='2014', tidy = FALSE)

df_2014 |> 
  enter_object("items") |> 
  gather_array() |> 
  spread_all() |> 
  select(-c("array.index", "document.id")) |> 
  enter_object("assessments") |> 
  gather_array() |> 
  spread_all(recursive = TRUE) |> 
  filter(assessmentUnitIdentifier == 'KY-1749')

This returned a zero row tibble, indicating no results for that assessment unit. I am obviously not familiar with the particulars of that site, but it seems it was either not assessed by the state or not included in the data uploaded by the state.

alternative appraoch

I don't know your workflow, but it might work better to just download the assessment data for all the assessment units once and filter it locally instead of using the web api to filter it.

df_2020_tidy <- assessments(state_code='KY', organization_id='21KY', reporting_cycle='2020', tidy = TRUE)
df_2020_tidy$use_assessment |> filter(assessment_unit_identifier == "KY-1749")
# A tibble: 5 × 29
  organizati…¹ organ…² organ…³ repor…⁴ repor…⁵ asses…⁶ troph…⁷ ratio…⁸ epa_i…⁹ overa…˟ cycle…˟ year_…˟ sourc…˟ sourc…˟ assoc…˟
  <chr>        <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <lgl>  
1 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      NA      NA      NA     
2 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      NA      NA      NA     
3 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      NA      NA      NA     
4 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      NA      NA      NA     
5 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      NA      NA      NA     
# … with 14 more variables: use_name <chr>, use_attainment_code <chr>, threatened_indicator <chr>, trend_code <lgl>,
#   agency_code <chr>, assessment_metadata <lgl>, use_attainment_code_name <chr>,
#   assessment_metadata_assessment_basis_code <lgl>, assessment_metadata_assessment_types <lgl>,
#   assessment_metadata_monitoring_activity <lgl>, assessment_metadata_assessment_activity_assessment_date <chr>,
#   assessment_metadata_assessment_activity_assessor_name <lgl>, assessment_type_code <chr>,
#   assessment_confidence_code <chr>, and abbreviated variable names ¹​organization_identifier, ²​organization_name,
#   ³​organization_type_text, ⁴​reporting_cycle_text, ⁵​report_status_code, ⁶​assessment_unit_identifier, ⁷​trophic_status_code, …
# ℹ Use `colnames()` to see all variable names

and

df_2020_tidy$parameter_assessment |> filter(assessmentUnitIdentifier == "KY-1749")
## I just noticied I missed cleaning the column names here, sorry!!!
# A tibble: 7 × 28
  organizati…¹ organ…² organ…³ repor…⁴ repor…⁵ asses…⁶ troph…⁷ ratio…⁸ epaIR…⁹ overa…˟ cycle…˟ yearL…˟ param…˟ param…˟ pollu…˟
  <chr>        <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      Cause   FECAL … Y      
2 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      Cause   CHLORI… Y      
3 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      Cause   HABITA… N      
4 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      Cause   NUTRIE… Y      
5 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      Cause   ORGANI… Y      
6 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      Cause   SEDIME… Y      
7 21KY         Kentuc… State   2020    EPA Fi… KY-1749 NA      NA      5       Not Su… 2010    ""      Cause   TOTAL … Y      
# … with 13 more variables: impairedWatersInformation.category4BInformation <lgl>,
#   impairedWatersInformation.listingInformation.agencyCode <chr>,
#   impairedWatersInformation.listingInformation.cycleFirstListedText <chr>,
#   impairedWatersInformation.listingInformation.cycleScheduledForTMDLText <chr>,
#   impairedWatersInformation.listingInformation.CWA303dPriorityRankingText <chr>,
#   impairedWatersInformation.listingInformation.consentDecreeCycleText <chr>,
#   impairedWatersInformation.listingInformation.alternateListingIdentifier <lgl>, associatedUseName <chr>, …
# ℹ Use `colnames()` to see all variable names

BUT when I went back to 2014 and 2012 with this approach, I got an error....

## Of course this returns an error
df_2014_tidy <- assessments(state_code='KY', organization_id='21KY', reporting_cycle='2014', tidy = TRUE)

So we can write a little function to tidy up the raw json instead:

## make sure we have the packages we need:
library(tidyjson)
library(dplyr)
library(tidyr)
library(purrr)

tidy_assessment_data <- function(content) {
  ## return documents
  content %>%
    enter_object("items") %>%
    gather_array() %>%
    spread_all() %>%
    select(-c("array.index", "document.id")) %>%
    enter_object("documents") %>%
    gather_array() %>%
    spread_all(recursive = TRUE) %>%
    select(-"array.index") %>%
    as_tibble() -> content_docs
  
  ## return use assessment data
  content %>%
    enter_object("items") %>%
    gather_array() %>%
    spread_all() %>%
    select(-c("array.index", "document.id")) %>%
    enter_object("assessments") %>%
    gather_array() %>%
    spread_all(recursive = TRUE) %>%
    select(-c("array.index", "agencyCode")) %>%
    mutate(
      probableSources = map(.data$..JSON, ~{
        .x[["probableSources"]] %>% {
          tibble(
            sourceName = map_chr(., "sourceName"),
            sourceConfirmedIndicator = map_chr(., "sourceConfirmedIndicator"),
            associatedCauseName = map(., ~{
              .x[["associatedCauseNames"]] %>% {
                tibble(
                  causeName = map_chr(., "causeName")
                )}
            })) %>%
            unnest("associatedCauseName", keep_empty = TRUE)
        }})
    ) %>%
    tibble::as_tibble() %>%
    janitor::clean_names()-> content_use_assessments
  
  
  ## return parameter assessment data
  content %>%
    enter_object("items") %>%
    gather_array() %>%
    spread_all() %>%
    select(-c("array.index", "document.id")) %>%
    enter_object("assessments") %>%
    gather_array() %>%
    spread_all(recursive = TRUE) %>%
    select(-c("array.index", "agencyCode")) %>%
    enter_object("parameters") %>%
    gather_array() %>%
    spread_all(recursive = TRUE) %>%
    select(-c("array.index")) %>%
    enter_object("associatedUses") %>%
    gather_array() %>%
    select(-"array.index") %>%
    spread_all(recursive = TRUE) %>%
    mutate(seasons = map(.data$..JSON, ~{
      .x[["seasons"]] %>% {
        tibble(
          seasonStartText = map_chr(., "seasonStartText"),
          seasonEndText = map_chr(., "seasonEndText")
        )
      }
    })) %>%
    unnest("seasons", keep_empty = TRUE) -> content_parameter_assessments
  
  return(list(documents = content_docs,
              use_assessment = content_use_assessments,
              parameter_assessment = content_parameter_assessments))
}

## now make the query again but return the raw json
df_2014_raw <- assessments(state_code='KY', organization_id='21KY', reporting_cycle='2014', tidy = FALSE)

## and tidy it
df_2014_tidy <- tidy_assessment_data(df_2014_raw)
df_2014_tidy
$documents
# A tibble: 0 × 5
# … with 5 variables: organizationIdentifier <chr>, organizationName <chr>, organizationTypeText <chr>,
#   reportingCycleText <chr>, reportStatusCode <chr>
# ℹ Use `colnames()` to see all variable names

$use_assessment
# A tibble: 2,572 × 13
   organization_identifier organiza…¹ organ…² repor…³ repor…⁴ asses…⁵ troph…⁶ ratio…⁷ epa_i…⁸ overa…⁹ cycle…˟ year_…˟ probab…˟
   <chr>                   <chr>      <chr>   <chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <chr>   <chr>   <chr>   <list>  
 1 21KY                    Kentucky   State   2014    Histor… KY0065… Unknown NA      2       Fully … 2008    ""      <tibble>
 2 21KY                    Kentucky   State   2014    Histor… KY0066… Unknown NA      2       Fully … 2008    ""      <tibble>
 3 21KY                    Kentucky   State   2014    Histor… KY0068… Unknown NA      2       Fully … 2009    ""      <tibble>
 4 21KY                    Kentucky   State   2014    Histor… KY1269… NA      NA      5       Not Su… 2012    ""      <tibble>
 5 21KY                    Kentucky   State   2014    Histor… KY1269… NA      NA      5       Not Su… 2012    ""      <tibble>
 6 21KY                    Kentucky   State   2014    Histor… KY1269… NA      NA      2       Fully … 2003    ""      <tibble>
 7 21KY                    Kentucky   State   2014    Histor… KY1270… NA      NA      5       Not Su… 2012    ""      <tibble>
 8 21KY                    Kentucky   State   2014    Histor… KY1278… NA      NA      2       Fully … 2001    ""      <tibble>
 9 21KY                    Kentucky   State   2014    Histor… KY1291… NA      NA      2       Fully … 2003    ""      <tibble>
10 21KY                    Kentucky   State   2014    Histor… KY1305… NA      NA      2       Fully … 2007    ""      <tibble>
# … with 2,562 more rows, and abbreviated variable names ¹​organization_name, ²​organization_type_text, ³​reporting_cycle_text,
#   ⁴​report_status_code, ⁵​assessment_unit_identifier, ⁶​trophic_status_code, ⁷​rationale_text, ⁸​epa_ir_category,
#   ⁹​overall_status, ˟​cycle_last_assessed_text, ˟​year_last_monitored_text, ˟​probable_sources
# ℹ Use `print(n = ...)` to see more rows

$parameter_assessment
# A tibble: 4,066 × 28
   organizat…¹ organ…² organ…³ repor…⁴ repor…⁵ asses…⁶ troph…⁷ ratio…⁸ epaIR…⁹ overa…˟ cycle…˟ yearL…˟ param…˟ param…˟ pollu…˟
   <chr>       <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <lgl>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
 1 21KY        Kentuc… State   2014    Histor… KY0066… Unknown NA      2       Fully … 2008    ""      Observ… MERCUR… NA     
 2 21KY        Kentuc… State   2014    Histor… KY0068… Unknown NA      2       Fully … 2009    ""      Observ… MERCUR… NA     
 3 21KY        Kentuc… State   2014    Histor… KY1269… NA      NA      5       Not Su… 2012    ""      Cause   CAUSE … Y      
 4 21KY        Kentuc… State   2014    Histor… KY1269… NA      NA      5       Not Su… 2012    ""      Cause   ESCHER… Y      
 5 21KY        Kentuc… State   2014    Histor… KY1270… NA      NA      5       Not Su… 2012    ""      Cause   CAUSE … Y      
 6 21KY        Kentuc… State   2014    Histor… KY1548… NA      NA      5       Not Su… 2003    ""      Cause   POLYCH… Y      
 7 21KY        Kentuc… State   2014    Histor… KY2355… NA      NA      5       Not Su… 1998    ""      Cause   NUTRIE… Y      
 8 21KY        Kentuc… State   2014    Histor… KY2355… NA      NA      5       Not Su… 1998    ""      Cause   FECAL … Y      
 9 21KY        Kentuc… State   2014    Histor… KY2543… NA      NA      5       Not Su… 2009    ""      Cause   SEDIME… Y      
10 21KY        Kentuc… State   2014    Histor… KY2564… Unknown NA      5       Not Su… 2011    ""      Cause   DISSOL… Y      
# … with 4,056 more rows, 13 more variables: impairedWatersInformation.category4BInformation <lgl>,
#   impairedWatersInformation.listingInformation.agencyCode <chr>,
#   impairedWatersInformation.listingInformation.cycleFirstListedText <chr>,
#   impairedWatersInformation.listingInformation.cycleScheduledForTMDLText <chr>,
#   impairedWatersInformation.listingInformation.CWA303dPriorityRankingText <lgl>,
#   impairedWatersInformation.listingInformation.consentDecreeCycleText <chr>,
#   impairedWatersInformation.listingInformation.alternateListingIdentifier <chr>, associatedUseName <chr>, …
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

This is pretty close to what the other functions return. I don't have any idea why it isn't returning a 303(d) document for 2014 though.

@mps9506 mps9506 added the bug Something isn't working label Mar 7, 2023
@TraciPopejoy
Copy link
Author

Thank you so much for your quick and thorough response!! I think you are right --- downloading all assessments and then filtering in R might be the best approach since it reduces the number of queries to the API. Thank you for providing a function to clean up the raw JSON results. I am adding your name to the acknowledgments and look forward to citing this package in our manuscript!

@mps9506
Copy link
Owner

mps9506 commented Mar 8, 2023

Thanks! Look forward to seeing the paper! Feel free to reach out if there are any other issues, this package is definitely a work in progress.

@mps9506
Copy link
Owner

mps9506 commented Mar 11, 2023

@TraciPopejoy you may want to try the development version in the dev-flatten branch:

remotes::install_github("mps9506/rATTAINS", ref = "dev-flatten")

I is going to be a while before I can upload to CRAN because I'm waiting on the tibblify package this release depends on to mature. Hopefully this returns more consistent data structures across all queries to EPA ATTAINS.

assessments(assessment_unit_id='KY-1749', state_code='KY', organization_id='21KY', reporting_cycle='2016')
$documents
# A tibble: 8 × 13
  organizationIdentifier organizatio…¹ organ…² repor…³ repor…⁴ agenc…⁵ docum…⁶ docum…⁷ docum…⁸ docum…⁹ docum…˟ docum…˟ docum…˟
  <chr>                  <chr>         <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1 21KY                   Kentucky      State   2016    EPA Fi… S       Other … applic… KY 201… KY 201… NA      NA      https:…
2 21KY                   Kentucky      State   2016    EPA Fi… S       Integr… applic… 2016 I… 2016 I… NA      Respon… https:…
3 21KY                   Kentucky      State   2016    EPA Fi… S       Assess… applic… Kentuc… Kentuc… NA      NA      https:…
4 21KY                   Kentucky      State   2016    EPA Fi… E       EPA De… applic… Kentuc… Kentuc… NA      NA      https:…
5 21KY                   Kentucky      State   2016    EPA Fi… S       Other … applic… Progra… Progra… NA      NA      https:…
6 21KY                   Kentucky      State   2016    EPA Fi… S       Other … applic… 2016IR… 2016 3… NA      Includ… https:…
7 21KY                   Kentucky      State   2016    EPA Fi… S       Other … applic… 2016IR… 2016 3… NA      NA      https:…
8 21KY                   Kentucky      State   2016    EPA Fi… S       Other … applic… kdow_3… 2016 3… NA      NA      https:…
# … with abbreviated variable names ¹​organizationName, ²​organizationTypeText, ³​reportingCycleText, ⁴​reportStatusCode,
#   ⁵​agencyCode, ⁶​documentTypeCode, ⁷​documentFileType, ⁸​documentFileName, ⁹​documentName, ˟​documentDescription,
#   ˟​documentComments, ˟​documentURL

$use_assessment
# A tibble: 1 × 17
  organizati…¹ organ…² organ…³ repor…⁴ repor…⁵ asses…⁶ agenc…⁷ troph…⁸ useAt…⁹ param…˟ proba…˟ docum…˟ ratio…˟ epaIR…˟ overa…˟
  <chr>        <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <list<> <list<> <list<> <list<> <chr>   <chr>   <chr>  
1 21KY         Kentuc… State   2016    EPA Fi… NA      NA      NA                                      NA      NA      NA     
# … with 2 more variables: cycleLastAssessedText <chr>, yearLastMonitoredText <chr>, and abbreviated variable names
#   ¹​organizationIdentifier, ²​organizationName, ³​organizationTypeText, ⁴​reportingCycleText, ⁵​reportStatusCode,
#   ⁶​assessmentUnitIdentifier, ⁷​agencyCode, ⁸​trophicStatusCode, ⁹​useAttainments, ˟​parameters, ˟​probableSources, ˟​documents,
#   ˟​rationaleText, ˟​epaIRCategory, ˟​overallStatus
# ℹ Use `colnames()` to see all variable names

$delisted_waters
# A tibble: 1 × 10
  organizationIdentifier organizationName organizationTypeText reportingCycl…¹ repor…² asses…³ cause…⁴ agenc…⁵ delis…⁶ delis…⁷
  <chr>                  <chr>            <chr>                <chr>           <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1 21KY                   Kentucky         State                2016            EPA Fi… NA      NA      NA      NA      NA     
# … with abbreviated variable names ¹​reportingCycleText, ²​reportStatusCode, ³​assessmentUnitIdentifier, ⁴​causeName,
#   ⁵​agencyCode, ⁶​delistingReasonCode, ⁷​delistingCommentText

@mps9506 mps9506 closed this as completed Mar 11, 2023
@TraciPopejoy
Copy link
Author

Thank you so much for this update!! It works perfectly and has made my life a lot easier. I really appreciate all the hard work you've put into this r package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants