-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrating Water Quality Portal (WQP) as new catalog #8
Comments
@jkreft-usgs I'm moving the conversation over to this issue that's more general about wqp access, rather than pywqp proper. Basically in this issue I'm capturing notes and exchanges regarding wqp access & ingest. And here is your comment from NWQMC/pywqp#6 (comment) (thanks for your offer to help!):
I assume that means that the web service gives access to all elements of the wqx xml, but in simpler forms (csv/json)? Sorry, I haven't read the wqp web service doc page yet ... Pasting my own last comment from that other issue:
cc @aufdenkampe so he's aware of this exchange and notes |
Yes, as I mentioned in NWQMC/pywqp#6 (comment), We want something that very performant, so I like the idea of avoiding XML. Do you just use JSON or CSV responses or is there a more direct integration with Pandas data frames? |
I'll drop the same comment over here instead: The fastest way to get data into a Pandas dataframe would be to stream the tsv into pandas, which is the approach we take to populate a redis cache quickly with a reasonable memory footprint. Something to note is that it is good to pay attention to the total counts that come back in the http headers, since that is a decent basic checksum to ensure that at least all the rows you were expecting to come across the wire did in fact come across the wire |
The other thing to note- please do not spin up a multi-threaded download, it is not at all hard to DOS the WQP if you are trying to get a ton of data all at once. |
We'll keep that in mind. Thanks also for the additional info about |
You might look into some tooling that (bizarrely enough) an old roommate of mine who is now working at Anaconda is doing around streaming tabular data into Pandas: http://matthewrocklin.com/blog/work/2017/10/16/streaming-dataframes-1 |
I've put together a simple Jupyter notebook demonstrating GET requests for the WQP For convenience, comparison and coherence, I reused the 1deg x 1 deg AOI box scheme from #10. @aufdenkampe, you may be tickled to see that there's Stroud data in the this PA/NJ/DE AOI, in the WQP! |
I forgot to ping @lsetiawan |
One more thing, mainly geared to @lsetiawan for now, but ultimately for discussion later with the rest of you (@aufdenkampe and @jkreft-usgs): The WQP API's are like the CUAHSI HIS catalog API's (and unlike the CINERGI and HydroShare API's) in that -- as far as I can tell -- they don't have the capability to search on free text. They have a bunch of vocabularies ("domains") that can be searched on, but using them in the search would run us into the same kind of problems with the CUAHSI API. So, the initial implementation that @lsetiawan and I are working on will ignore the text entered into the search box on the Wikiwateshed App. The parameters issued to the |
You are correct that WQP does not have a free text search. A couple recommendations/ideas/requests-
|
Thanks @jkreft-usgs. We'll make sure to use Regarding the geospatial query requests, I can tell you that all requests by the application will be geospatial queries! However, it'll be quite a while (2 months?) before we're ready to go live, so before that all queries will be simply during tests and development. We'll keep you informed every step of the way, as we make significant progress and have more questions. Finally: @lsetiawan thanks for sharing your progress via this issue!! That's really awesome. I'm on vacation for the next week, so I won't be commenting for a while. |
ok, so there is no chance for the app to be hydrologically aware and use something like HUC instead? |
I think that this is possible since the app is able to use HUC boundaries to do its modeling. So if somehow the frontend can spit out those HUC ID to the backend, we should be able to query WQP using the
Thanks @jkreft-usgs! Wow, it really filtered a lot of sites! 😄 |
I think it's possible that we could identify all the HUC12s that overlap with a WikiWatershed web app user's Area of Interest, fetch those, then do spatial cropping/filtering on our side similar to what we've done for our our WDC spatial searches. However, we do not presently have that capability within the WikiWatershed API: https://app.wikiwatershed.org/api/docs/, although we've been talking about adding such functionality for quite some time. |
@aufdenkampe Hmm... that's interesting to do. I think it would be really cool.
I see how Azavea is getting their Huc information. I now have a simple way to get the huc id! Now I think I will try to do the spatial cropping/filtering you're talking about, and see if I can implement this. Stay tuned! 😄 |
So this is the query I have now using |
One question I have for @jkreft-usgs is the root search url, should it be https://www.waterqualitydata.us/data/Station/search or https://www.waterqualitydata.us/Station/search |
@lsetiawan Both work, but /data is more future focused- we are getting ourselves out of URL mapping difficulties as we keep adding more endpoints |
Thanks @lsetiawan and @jkreft-usgs for the work on HUC-based searching. Cool! @lsetiawan, keep in mind that we'll still need Personally, I'd much rather focus on refining the WQP search results, and only then go back to HUC search customization, building on what you've already done. I'll get back to the former when I'm back, but most likely not until the week after next. |
It's doing both searching... If you use huc, it'll do a search only on that huc. And if you use free-draw and other, it will search on all the hucs that has intersections to that AOI. But at the end it gets filtered and you only get locations within the AOI. I think what @jkreft-usgs said previously is that passing hucid's is better than doing actual bbox geospatial search right now. |
@jkreft-usgs I've run some tests for the
Thanks! |
Good questions. You can see the different data elements in the documentation. https://www.waterqualitydata.us/portal_userguide/ The default result output does indeed include many activity elements, because until recently, there were only two endpoints- result and station, and for result to make any sense, it needed sampling activity data. However, there is a result output that is just result information, which you access with dataProfile=narrowResult. https://www.waterqualitydata.us/data/Result/search?statecode=US%3A55&countycode=US%3A55%3A025&siteType=Stream&mimeType=csv&zip=yes&sorted=no&dataProfile=narrowResult A service that we are working on for this year will be a summary service, which will hopefully help with this exact use case. Right now, if you want to get an overview of data at a site, you really have do do quite a lot of crunching first. |
Thanks @jkreft-usgs. I tried So, it looks like using this option does more damage than not using it 😞
This would be fantastic, and is exactly what we're looking for! |
Follow up thoughts ... Maybe Still, it looks like some of the "activity" information that's dropped would be very helpful for a summary/discovery service. |
Some follow-up results. I ran a The request returned 1,570,698 records. Getting the results took 8 minutes. Converting to a pandas data frame (which includes unzipping) took another minute, maybe less. For reference, the The next thing to try, to speed up the response, is @jkreft-usgs 's recommendations:
Still, while that will possibly benefit users by not having to wait a long time before some results are shown, I would imagine it won't dramatically cut down the total response time. It's also likely that HUC-based (as opposed to bbox-based) searches will be much faster, based on what Jim has told us. But given that HUC searching is only one of several spatial search options in the Wikiwatershed App, this is not a great solution. Still, we should do some benchmarks. @jkreft-usgs, have you had a chance to look into what Regardless, if we want the richness of metadata available in the |
A couple of notes to self (and Don) for reference and use later on. Linking to WQP granular resources and information
Recent, nice publication about WQPRead, E. K., Carr, L., De Cicco, L., Dugan, H. A., Hanson, P. C., Hart, J. A., Kreft, J., Read, J., Winslow, L. A. (2017). Water quality data for national-scale aquatic research: The Water Quality Portal. Water Resources Research, 53(2), 1735–1745. https://doi.org/10.1002/2016WR019993 |
@emiliom, thanks for sharing those "notes to self". They're helpful for me to start exploring WQP and its metadata. |
@aufdenkampe: glad you found that useful. @lsetiawan is already working on implementing that new information into the detailed results view. An update on The request returned 357,176 records. Getting the results took a bit over 5 minutes (plus the time taken to convert to a pandas data frame, which includes unzipping). Compared to the previous 1° x 1° request, this AOI that's a quarter of the size returned a fifth of the records but took half the time, not 1/4 or 1/5 of the time! Darn. The |
@emiliom It looks like my response was lost in too many tabs! The narrowResult data profile is working as expected, it is just different from what you might be expecting. You can see the different content of the data profiles here: https://www.waterqualitydata.us/portal_userguide/ Essentially the narrowResult data profile is named that because it is almost exclusively content from the "Result" part of the WQX data model, whereas the "default" was a mix of Result and activity. Now that we serve Activity information separately, it makes more sense to just serve that information separately, at least for some use cases... Also, it looks like you might find the domain value services useful, which you can see here: https://www.waterqualitydata.us/webservices_documentation/#WQPWebServicesGuide-Domain |
Thanks for the follow-up @jkreft-usgs (I know first hand about the situation of too many tabs and github issue responses not being finalized!) Regarding narrowResult, I went through the documentation and my own API tests, and unless I'm missing something, the set of attributes returned are not simply a subset of the attributes returned by the unqualified |
yes indeed. It might be easier to understand this in a call, but here goes. WQX has a number of top-level domains in its data model
We are working toward having endpoints for all of these domains (along with additional subdomains) in WQP. However, WQP used to try to do everything with only 2 endpoints, Result and Station, and everything needed to be crammed into those two endpoints. Station had some stuff from Organization and Monitoring location, and Result had a mix of key elements from Activity, and Result- enough to describe most physical and chemical samples, but not biological samples. We first added the "biological" data profile to WQP, which added a few dozen additional columns to the result endpoint- it is basically the kitchen sink. Then we added the "Activity" endpoint, which meant that we didn't need to serve the activity data over and over again, a could drop a whole pile of columns from result- hence "narrowResult" which has almost exclusively columns from the result domain. However, to make real sense of that narrowResult data, you do need to also get activity data. Clear as mud? The next step is to deprecate the existing data profiles in favor of ones that will best support the user community, and also add more effective summary services. |
I have updated the details page of the USGS WQP Catalog. Users are able to go to the URL's mentioned in #8 (comment) and also download the sample results in a click of a button. |
@lsetiawan thanks for the screenshot and the enhancements you've implemented! @jkreft-usgs thanks for the explanations and background. I think we're on the same page now. The historical sequence may be clear as mud, but the current situation is clear. Still, it does mean that "narrowResult" doesn't offer much of a performance/payload advantage relative to the current default, unqualified I'll chew on this early next week, to get a better sense of near-term options for the Wikiwatershed App. |
pywqp
at Is this package used actively? NWQMC/pywqp#6, where I pinged Jim Kreftwqp
collectorwqx
parserdataRetrieval
The text was updated successfully, but these errors were encountered: