Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic (file size?)-related 404 errors from ERDDAP datasets when running IOOS 1.2 test #804

Open
mwengren opened this issue May 7, 2020 · 6 comments
Labels
IOOS:1.2 Issues relating to the IOOS Metadata Profile v1.2

Comments

@mwengren
Copy link
Member

mwengren commented May 7, 2020

With recent changes to download .ncCF file format from ERDDAP URLs, I get occasional HTTP 404 responses, in particular for large datasets. Here's a PacIOOS example:

$ compliance-checker --version
IOOS compliance checker version 4.3.3rc2+46.g3dfe1f8
$ compliance-checker -t ioos -f html https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04 > WQB-04-ioos-rc2.46.test.html
Running Compliance Checker on the datasets from: ['https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04']
syntax error, unexpected $end, expecting ';'
context: Error { code=404; message="Not Found: Currently unknown datasetID=WQB-04.ncCF";}^
Traceback (most recent call last):
  File "/home/mwengren/miniconda3/envs/cc-test-dev/bin/compliance-checker", line 11, in <module>
    load_entry_point('compliance-checker', 'console_scripts', 'compliance-checker')()
  File "/home/mwengren/workspace/code/git/ioos/compliance-checker/cchecker.py", line 276, in main
    return_value, errors = ComplianceChecker.run_checker(
  File "/home/mwengren/workspace/code/git/ioos/compliance-checker/compliance_checker/runner.py", line 76, in run_checker
    ds = cs.load_dataset(loc)
  File "/home/mwengren/workspace/code/git/ioos/compliance-checker/compliance_checker/suite.py", line 759, in load_dataset
    return self.load_remote_dataset(ds_str)
  File "/home/mwengren/workspace/code/git/ioos/compliance-checker/compliance_checker/suite.py", line 788, in load_remote_dataset
    return Dataset(ds_str)
  File "netCDF4/_netCDF4.pyx", line 2321, in netCDF4._netCDF4.Dataset.__init__
  File "netCDF4/_netCDF4.pyx", line 1885, in netCDF4._netCDF4._ensure_nc_success
OSError: [Errno -90] NetCDF: file not found: b'https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04.ncCF?time%2Clatitude%2Clongitude%2Cdepth%2Cstation_name%2Ctemperature%2Csalinity%2Cturbidity%2Cchlorophyll%2Coxygen%2Coxygen_saturation%2Cph%2Ctemperature_raw%2Ctemperature_dm_qd%2Ctemperature_qc_agg%2Ctemperature_qc_gap%2Ctemperature_qc_syn%2Ctemperature_qc_loc%2Ctemperature_qc_rng%2Ctemperature_qc_clm%2Ctemperature_qc_spk%2Ctemperature_qc_rtc%2Ctemperature_qc_flt%2Ctemperature_qc_mvr%2Ctemperature_qc_atn%2Ctemperature_qc_nbr%2Ctemperature_qc_crv%2Ctemperature_qc_din%2Csalinity_raw%2Csalinity_dm_qd%2Csalinity_qc_agg%2Csalinity_qc_gap%2Csalinity_qc_syn%2Csalinity_qc_loc%2Csalinity_qc_rng%2Csalinity_qc_clm%2Csalinity_qc_spk%2Csalinity_qc_rtc%2Csalinity_qc_flt%2Csalinity_qc_mvr%2Csalinity_qc_atn%2Csalinity_qc_nbr%2Csalinity_qc_crv%2Csalinity_qc_din%2Cturbidity_raw%2Cturbidity_dm_qd%2Cturbidity_qc_agg%2Cturbidity_qc_gap%2Cturbidity_qc_syn%2Cturbidity_qc_loc%2Cturbidity_qc_rng%2Cturbidity_qc_clm%2Cturbidity_qc_spk%2Cturbidity_qc_rtc%2Cturbidity_qc_flt%2Cturbidity_qc_mvr%2Cturbidity_qc_atn%2Cturbidity_qc_nbr%2Cchlorophyll_raw%2Cchlorophyll_dm_qd%2Cchlorophyll_qc_agg%2Cchlorophyll_qc_gap%2Cchlorophyll_qc_syn%2Cchlorophyll_qc_loc%2Cchlorophyll_qc_rng%2Cchlorophyll_qc_clm%2Cchlorophyll_qc_spk%2Cchlorophyll_qc_rtc%2Cchlorophyll_qc_flt%2Cchlorophyll_qc_mvr%2Cchlorophyll_qc_atn%2Cchlorophyll_qc_nbr%2Coxygen_raw%2Coxygen_dm_qd%2Coxygen_qc_agg%2Coxygen_qc_gap%2Coxygen_qc_syn%2Coxygen_qc_loc%2Coxygen_qc_rng%2Coxygen_qc_clm%2Coxygen_qc_spk%2Coxygen_qc_rtc%2Coxygen_qc_flt%2Coxygen_qc_mvr%2Coxygen_qc_atn%2Coxygen_qc_nbr%2Coxygen_saturation_raw%2Coxygen_saturation_dm_qd%2Coxygen_saturation_qc_agg%2Coxygen_saturation_qc_gap%2Coxygen_saturation_qc_syn%2Coxygen_saturation_qc_loc%2Coxygen_saturation_qc_rng%2Coxygen_saturation_qc_clm%2Coxygen_saturation_qc_spk%2Coxygen_saturation_qc_rtc%2Coxygen_saturation_qc_flt%2Coxygen_saturation_qc_mvr%2Coxygen_saturation_qc_atn%2Coxygen_saturation_qc_nbr%2Cph_raw%2Cph_dm_qd%2Cph_qc_agg%2Cph_qc_gap%2Cph_qc_syn%2Cph_qc_loc%2Cph_qc_rng%2Cph_qc_clm%2Cph_qc_spk%2Cph_qc_rtc%2Cph_qc_flt%2Cph_qc_mvr%2Cph_qc_atn%2Cph_qc_nbr%2Cplatform1%2Cinstrument1%2Ccrs'

Happens for other large(r) ERDDAP datasets as well.

Can we run all of our IOOS 1.2 checks if we don't request the full dataset worth of data, and instead filter by a recent slices of the time dimension?

ERDDAP includes the max() server-side function for this purpose. Request most recent 3 days of data.:

https://ferret.pmel.noaa.gov/generic/erddap/tabledap/sailbuoy_4803921.htmlTable?wmo_platform_code%2Ctime%2Clatitude%2Clongitude%2CTemperature&time%3E=max(time)-3days

Request the most recent time slice:

http://erddap.sensors.ioos.us/erddap/tabledap/ssbn7-sun2wave-sun2w-sunset-n.htmlTable?time&time=max(time)

or, the same for this particular PacIOOS dataset:

https://pae-paha.pacioos.hawaii.edu/erddap/tabledap/WQB-04.ncCF?time%2Clatitude%2Clongitude%2Cdepth%2Cstation_name%2Ctemperature%2Csalinity%2Cturbidity%2Cchlorophyll%2Coxygen%2Coxygen_saturation&time=max(time)

This is something we'll probably need to fix before next RC is put out. Not sure whether it's best to grab the most recent slice or use something like max(time)-7days, or which of these will be least brittle. Thoughts?

@daltonkell @benjwadams

@benjwadams
Copy link
Contributor

benjwadams commented May 7, 2020

You should be able to use the previously mentioned .ncCF URL at the end here as it returns Content-Type: application/x-netcdf headers, which will allow you to read the file in directly. The issue with using max(time) by default on an ERDDAP endpoint without format specified is that there isn't necessarily a guarantee that there's a time variable, so we might have to introspect the .das response or some similar approach. If you know it's there, it's fine, but if it's not it will error out on ERDDAP's side with a 500 Internal Server Error, which is arguably worse than occasionally timing out/404ing.

@mwengren
Copy link
Member Author

mwengren commented May 8, 2020

@benjwadams For ease-of-use, I suspect users will want to provide the root ERDDAP dataset URLs without extensions/query parameters in most cases (I know I will). If it works with user-provided extensions/query params, that's great and a good workaround, but I'd still like to find a solution for the simplest use case that works.

What would the best choice for filtering with a root dataset URL?

Good point about not depending on time dimension presence. Also, I don't think my suggestion of max(time) will work because that might affect the representation of the 'Platform' dimension in the output with only one time slice returned, so it would have to be something like max(time)-7days or similar period of time query.

Options:

  1. Try/catch with max(time)-7days parameter or similar, with a fallback to request the full dataset
  2. Introspect .das format as you suggested and then do the above?
  3. ...?

Some of the datasets I tested with were over 100 MB in size when downloaded in full, and I'm sure there are much larger. We need a sane default subset approach to make this option viable.

@benjwadams
Copy link
Contributor

I would personally probably go for option 2) possibly combined with some of the axis finding logic already present in compliance-checker.

@mwengren
Copy link
Member Author

mwengren commented May 8, 2020

Ok, works for me. Please go ahead with that approach at your earliest convenience.

@benjwadams
Copy link
Contributor

Related: #807

@mwengren mwengren added the IOOS:1.2 Issues relating to the IOOS Metadata Profile v1.2 label Jun 30, 2020
@daltonkell
Copy link
Contributor

@mwengren @benjwadams

After several profiling runs for the WQB-04 dataset, I'm still perplexed as to why sporadic 404's occur. It may be due to a deeper interaction between the Python URL library and the ERDDAP server, so more digging is needed.

@daltonkell daltonkell added this to the Release 4.3.4 Milestone milestone Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IOOS:1.2 Issues relating to the IOOS Metadata Profile v1.2
Projects
None yet
Development

No branches or pull requests

3 participants