-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subsetting tool #57
Comments
Thanks for the project idea contribution @SorooshMani-NOAA! This sounds similar in my naive viewpoint to an existing NOAA HPCC-funded project being led by OR&R/Chris Barker and IOOS' RPS partners that is currently underway. @ChrisBarker-NOAA @mpiannucci @jonmjoyce How much overlap do you see between this proposed package and the work you're doing? If there is, can we consolidate efforts if one of you would be willing to mentor a student for this work (in addition to @SorooshMani-NOAA and @AtiehAlipour-NOAA) during this year's GSoC? The project could be scoped more narrowly, if so, to a particular piece of functionality you could use help with. Another piece that would help to accept this project for GSoC is an existing code base to cite and to build off of. I think the RPS folks have this code already if my impression of the similarity is correct, but I don't know where to point to. |
@mwengren thanks so much for the feedback. I think this project is linked to #42. We are working on developing a package for subsetting STOFS model output. At this point, the code is not ready to be shared. I am in the process of testing different packages, and we were thinking that with GSoC, we can have some help for code improvement. Do you think defining the project more narrowly for STOFS model output would help? Or do you think with such overlaps with other projects, it is possible to increase that project size so that we can cover subsetting STOFS model outputs as well? Thanks! |
@mwengren Thanks for tagging. The package we are developing is scoped to allow subsetting in space for UGRID and SGRID datasets to start. STOFS, I believe would fall within this scope as an unstructured model as long as the metadata is cf compliant for the mesh topology. Beyond that, we will also be deploying a cloud service to subset the data in the cloud directly from NODD. There is certainly a lot of overlap between the two ideas |
@mpiannucci Is there any public component of the code you're working on we could link to here? That would give @AtiehAlipour-NOAA a reference to look at to understand if that could be leveraged for their proposed work or not. I don't want to say no to this idea, but I've already set a precedent that projects need to have existing public code to start from in order to be included. So I think we'll need some sort of even initial code to be used as a reference for this idea to go forward. The easiest way meet that would be for this project to build off of the HPC project code, extending to STOFS use case, if appropriate. Or, we need to go back and accept some proposals I've declined previously to be fair, which we could do. Open to suggestions from our GSoC community on that. Thanks! |
I think that's the opposite of the right thing to do -- there have been a LOT of false starts -- one off codes in this space, and here we are now talking about at least two independent efforts that are duplicating each other and previous codes.
First -- yes. But I don't think that's way to frame it that way -- rather, the goal is a framework that could be used with any gridded model results that could conform to existing standards: CF, UGRID, SGRID. So getting it to work with STOFS should be trivial, once the framework is in place.
And if its not compliant (it's probably not) , the API should have a way to massage it to be usable. That's pretty key, actually. As far as I've seen, NONE of the operational models provide fully standards compliant output. That being said, nothing is known to work util it's been tested -- so "Getting the existing code to work with STOFS" is a fine goal -- it could be trivial, or maybe the code will need to be extended or refactored a bit, which could make a good GSoC project. |
@AtiehAlipour-NOAA wrote:
I don't think there's such a thing as "not ready to be shared" -- if this is going to be a community project, which I hope it will be, getting feedback early is better than later. And there's the effort @mpiannucci referred to - it's silly to have these as independent efforts. |
This part is the key btw:
We are injecting the cf compliant ugrid metadata into NOS models with kerchunk and pushing them to NODD to facilitate this workflow |
Not the place to discuss this, but:
Nice! and certainly applicable to STOFS. Though it's a bit less useful for using the same code outside the NODD (or similar Cloud systems) -- I'd love to have a library I can point at a pile of netcdf files on my machine without having to massage them first. (though perhaps having to provide some declarative data about how to interpret them) -CHB |
Sorry didn't mean to cross wires, it is applicable to STOFS so thought I would mention. |
@SorooshMani-NOAA No the code isn't part of an existing GSoC project, it's being developed as part of a separate effort at NOS (that @ChrisBarker-NOAA and @mpiannucci are involved with). We'll post that reference here as soon as we have it. I think this would be a useful compliment to that, and with assumption that the existing code will be up soon, and that this can project can be based on that, I think we should go ahead accept this so that potential students have time to review and apply before April 4. They will need code to base their applications on, so if for some reason we can't make that available, we may have to pull this project again. In any case, I see a lot of ideas for modifying the initial project in the comments above. Please update the initial issue comment with whatever changes you decide to to the scope make so it's clear to applicants what the expectation is in one location. If you want to change project size as well from something other than 175 hours, that is ok as well. Also, @ChrisBarker-NOAA mentioned he might be interested in being a co-mentor as well, I think. Please add any addition mentor(s) you want to include in the first comment. |
@AtiehAlipour-NOAA and I will update the description and add @ChrisBarker-NOAA as co-mentor. Please add the link as soon as you have it, thanks! |
@ChrisBarker-NOAA as far as I understand currently we're still in exploration phase for this subsetting effort, i.e. we're trying out different ways of subsetting, or different tools we know about, to see how fast they are, etc. I'm not sure if there's an actual packaged code from our efforts yet, and that's what @AtiehAlipour-NOAA meant, I believe. There's probably a single script file that we can share, but I don't believe it's in a shape or form to have a repo of its own. @AtiehAlipour-NOAA please correct me if I'm wrong. With that being said, we'd be happy to share anything we've tried so far with you or the contributors as a starting point. It's great that you are already developing a package, and even if nothing comes out of this project, we at least know about your effort and can learn from you and contribute to your code base. |
@SorooshMani-NOAA, thank you for the response. That is absolutely correct. @ChrisBarker-NOAA, thank you for the comments and feedback; we are not developing a new subsetting tool; instead, we want to use publicly available packages to subset STOFS model output. So far, we have been using different available packages and tested their performance as the subsetting tool in different scripts. Since there is no conclusion yet and it involves testing different packages, it is not in a state to have its repository, as @SorooshMani-NOAA mentioned. However, we would be happy to share those codes with you and the future contributors. Thanks again for your feedback and help. We are very excited to learn that you are already developing a package, and we look forward to learning more from you in the future. |
Hello mentors,
Edit : Thank you, |
@AtiehAlipour-NOAA wrote:
Got it, thanks. However, even if there's no new code, the information about what's been tried, and how it's worked, could be really helpful to all -- and particularly to this project. -- it would be great to put there somewhere it can be shared. |
Hello Omkar, Thank you for expressing interest in our project. We're glad you're excited to contribute. Below are the answers to your inquiries: Base Codes for Starting Point: Certainly, here is a link to a few examples where we've experimented with subsetting the STOFS-2D-Global data. The codes use XUgrid and Thalassa packages, designed to work with 2D unstructured grids. We are currently working on enhancing the code with various formats and tools like ZARR and DASK. We are also exploring options such as transposing datasets and using Kerchunk. Project Focus: At the moment, our project focus around STOFS-2D-Global. However, we have plans to expand our framework to include other types in the future. Libraries and Documentation: We do not currently use csdllib or autoval packages in our project. These packages are typically used for post-processing and model evaluation within the STOFS framework. For our development, we rely on open-source libraries like xarray and netcdf4. Regarding your application, it's great to hear about your background in data modeling, visualization, Python, and its libraries. We encourage you to highlight your skills and experiences in your application. Please let us know if you have any further questions or need additional information. |
@ChrisBarker-NOAA, You're absolutely right. Here is a link to a few examples where we've experimented with subsetting the STOFS-2D-Global data. These examples use the XUgrid and Thalassa packages. Please feel free to reach out if you have any questions. Thank you once again for your valuable contribution to this project. |
@AtiehAlipour-NOAA
I'll be going through the xarray-subset-grid notebooks and try it out myself, maybe use it with STOFS. I'll update my progress soon. Thank you! |
@omkar-334, great work! 1- That's a great observation. We are glad you tested it on Colab and noticed the significant difference in performance. 2- Good catch. You can disregard that part; there's no need to drop it for the Thalassa package. 3- You're free to define any subset box; the example provided is just a starting point. In the future, we aim to use any polygon/shapefile for data subsetting. 4- Let's discuss this further as a group, based on the specific features we want to implement.
Wonderful! If you have any questions, feel free to ask. Thank you once again for your dedication and hard work. |
@AtiehAlipour-NOAA Thank you for the feedback!
When I mention only ds without any parameter -> I've noticed that 'dataset.cf' returns Grid Mapping and Bounds both as N/A. Thank you very much for your patience and for answering my questions. |
A few good sources on zarr benchmarks that I found interesting - |
@omkar-334 it's really encouraging to see that you are already spending time on figuring out how to improve the sample codes provided. However I'd suggest that you don't share all of your findings on this ticket and instead email it to Atieh or me ([email protected], [email protected]). Another thing I'd like to point out is that the To answer a couple of your questions:
|
@SorooshMani-NOAA ,
Thank you! |
@omkar-334 as I mentioned earlier, this package is still in development. The main developers of it has informed me that at it's current state they don't expect it to just work out of the box, but by the start of the GSoC it will be in a better state. Anything that you notice now and is not fixed by the start date will be a part of your contribution for the GSoC project; and I and the other mentors will be more than happy to help the contributors with fixing it at that point! You can include what you find as bugs as a part of your proposal. |
Hi @SorooshMani-NOAA 1.b In the subsetting service mentioned in the project Description, would the zarr dataset references be generated each time or would we store them in case two users need to subset the same dataset? You've mentioned earlier that the idea is to move towards saving new data in Zarr format, so I think we could do with using virtual Zarr files of old data.
Thank you! |
Please note that while #42 is mentioned in this ticket and is related to this project, it’s a whole separate GSOC project. One idea behind developing this tool is to make the model results more accessible by for example making it easier to download for those who have low internet bandwidth. Sometimes model data is divided over multiple netCDF files. Suppose that someone needs to get multi-day timeseries of modeled water elevation, usually this means data is spread across multiple netCDF files that are run on different simulation cycles. The results of this subsetting are then going to be combining the data from all those datasets into a single one (ideally Zarr). I hope that answers your question.
The Zarr metadata for a given file needs to be stored, otherwise the whole file needs to be downloaded every time to be able to then chunk it, which defeats the purpose. In an ideal world, we’d be able to just use kerchunk and look into old netCDF results like a Zarr file optimally. But in reality there are few things that need to be considered. Most important of all, the underlying chunking (binary blobs) of the data in the file needs to align with what is optimal when retrieving either a long timeseries for a single location or a single time step of a large area. These two cases are two sides of the coin, but both are very valid use cases for different users. With that in mind, one might need to rechunk the original data before using kerchunk to generate virtual Zarr files.
There are no hard requirements. This could be a very basic portal with tools to define a region (as simple as lat/lon values or more sophisticated polygon drawing on a map) and then the submitted job runs the subsetting and finally the location of that subsetted dataset is emailed to the user to download. It also could be you ignore this requirement and provide dataset to the user as a service, through packages like
If you’re not already, I’d suggest that you familiarize yourself with how an unstructured mesh is defined (don’t worry about generating one!) One common way to represent unstructured mesh is by two tables: a coordinates table and an elements table. The coordinates table shows the lat/lon location for each mesh node; the element table shows what nodes are connected to create each element.
Thank you for documenting those errors. We will address them once the project begins.
I don’t see your proposal in the list, I think I only see fully submitted proposals. Please share your proposal in Google Docs for our review so that we can share our feedback before your final submission.Thank you for all your hard work on this project. Please let us know if there is anything else we can help with. |
@omkar-334 just FYI the |
You're right, It could be the grid logic that's causing the issues. However, when i use the subset_polygon method, Thank you! |
Yes, I do realise that #42 is a separate project. I was just going through the technical details and libraries, since they are similar to that of this project.
Yes, It does answer my question. Thank you.
Ok, got it. I'll read up more about this.
Noted, I will share my proposal in Google Docs for your review. |
Project Description
This project aims to develop use case-specific enhancements to the xarray-subset-grid library to improve functionality and support for STOFS data subsetting (https://registry.opendata.aws/noaa-gestofs/), including:
This project provides an opportunity to enhance accessibility to ocean water level forecast data, which is crucial across different sectors. Here is a link to a few examples where we've experimented with subsetting the STOFS-2D-Global data.
Expected Outcomes
Improved open-source code for efficiently subsetting STOFS model output with a demonstrated use-case example (STOFS data).
Skills required
Python; Libraries: Xarray, Dask, Zarr. Cloud Storage
Mentor(s)
Atieh Alipour (@AtiehAlipour-NOAA ), Chris Barker (@ChrisBarker-NOAA), Soroosh Mani (@SorooshMani-NOAA)
Expected Project Size
350
What is the difficulty of the project?
Intermediate
The text was updated successfully, but these errors were encountered: