Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simpler file navigation utilities #8

Open
bkatiemills opened this issue Mar 1, 2016 · 16 comments
Open

simpler file navigation utilities #8

bkatiemills opened this issue Mar 1, 2016 · 16 comments

Comments

@bkatiemills
Copy link
Member

Regarding #6, it's not as simple as it could be for a user to walk around the file. For example, is_last_profile_in_file works in a guessable way on a WodProfile class, but advance_file_position_to_next_profile appears to do nothing, thanks (I think) to some pass-by-value-ism in how Python is thinking of the fid variable.

It'd be nice to have a set of functions with the semantic meaning:

  • skip to the next profile
  • rewind to the start of this profile
  • skip to the previous profile
  • skip to the first profile
  • skip to the last profile

Much of this functionality already exists, but as detailed above, isn't totally obvious in usage.

@s-good
Copy link
Contributor

s-good commented Mar 1, 2016

Hi @BillMills, the WOD format puts limitations on what we can do with regards this kind of thing. Probably we can tighten things up a bit but the main thing will be to improve documentation I think.

  • Skipping to the next profile: advance_file_position_to_next_profile - I think this is working as expected but probably not well explained. Actually I don't think there is a need for anyone to use it so maybe it should have an underscore at the beginning of its name to indicate it's an internal function. Using the WodProfile init method (with load_profile_data = False if the profile data are not wanted) is the best way to advance to the next profile as we have to read at least part of the profile data in order to work out where the next profile starts. The init method calls the advance_file_position_to_next_profile function at its end.
  • Rewind to the start of this profile - return_file_position_to_start_of_profile does this (needs documenting though). I think this is mostly useful if the profile has previously been read with load_profile_data = False and we want to go back and reread with load_profile_data = True.
  • Skip to the previous profile - this is only possible if we save the profile headers (or at minimum the positions of each profile as it is read) as otherwise we don't know the length of the previous profile in the file so don't know how far to go back.
  • Skip to the first profile - equivalent to doing fid.seek(0) so could be implemented easily.
  • Skip to the last profile - this is only possible by reading through the data file until is_last_profile_in_file becomes True and then rewinding the file position to the start of that profile but could be implemented.

@rowanwins
Copy link
Contributor

Gday guys,

Just a quick one to say that it would be great to see some improved doco around this. Perhaps a few example scripts could be added to the repo that show common tasks (eg iterating through profiles in a file to populate a netcdf)?

I'll try to dig in to the code a bit over the next few days and if I can contribute anything I'll send through some pull requests (although I am a bit of a python noob!).

Cheers
Rowan

@bkatiemills
Copy link
Member Author

@s-good ah yes, I see what you're saying; there's no unambiguous flag denoting the start of a profile (there is for WOD01, 05, 09 and 13, since they all start with A, B or C after a newline, but WOD98 blows it).

I think the reason for advance_file_position_to_next_profile not seeming to actually advance the file (as @rowanwins found in #6) was that it relies on the primary_header and file_position getting updated first, as is done only in the constructor. So in a case like

profile = wod.WodProfile(fid)
print 'uid', profile.uid()

profile.advance_file_position_to_next_profile(fid)

profile = wod.WodProfile(fid)
print 'uid', profile.uid()

Where one might have imagined that middle call to skip a profile, no profile is skipped since profile still has the same values of primary_header and file_position as were found when advance_file_position_to_next_profile was called at the end of the first constructor - meaning that middle call 'advances' to the exact same place. Not wrong - but easy to guess wrong!

I haven't dug into the other functions, but I suspect there's similar stories for the ones with surprising behavior. But anyway, I think you're right - put underscores in front of all of these and discourage endusers from poking at them is the easiest and most robust solution.

@rowanwins that would be awesome! @s-good is correct that a number of the functions I described exist, but as I discuss above, they rely on being used in a very particular context. The canonical usage example is:

f = open('myWODfile.dat')
profile = wod.WodProfile(f)
... do stuff with first profile...

while profile.is_last_profile_in_file(f) == False:
    profile = wod.WodProfile(f)
    ... do stuff with nth profile...

I'd be delighted to get some PRs with example usage - you've got the right idea, to think of the simplest relevant minimal working examples, so that wodpy usage doesn't get too muddled up with whatever we're doing alongside of it. Thanks, let me know!

@rowanwins
Copy link
Contributor

Hmmm well I'm still not getting very far on the looping front I'm afraid, I can get through the first two profiles but then the third record freaks out

from wodpy import wod

fid = open("example2.dat")

profile = wod.WodProfile(fid)

while profile.is_last_profile_in_file(fid) == False:
    print profile.latitude()
    profile = wod.WodProfile(fid)

The error seems to be something along these lines...

  File "C:\Python27\ArcGIS10.3\lib\site-packages\wodpy\wod.py", line 167, in _read_primary_header
    self._interpret_data(fid, prhFormat, primary_header)
  File "C:\Python27\ArcGIS10.3\lib\site-packages\wodpy\wod.py", line 98, in _interpret_data
    value = item[2](chars)
ValueError: invalid literal for int() with base 10: ''

:(

@bkatiemills
Copy link
Member Author

Have a look at the end of your datafile and check:

  • is the last line 80 characters long (filling in with spaces if there isn't that much data)
  • does the file end with one blank line?

The wod spec requires ascii wod files to consist exclusively of 80-character lines terminated in a newline character - which is easy to miss! I can reproduce your error iff I remove my trailing whitespace.

@rowanwins
Copy link
Contributor

Ok so had a look at that and all seemed ok. Rather it appears im somehow being foiled by the WODselect download tool, I tried again with one of the pre-canned geographic datasets and it worked fine.

With my completely uneducated guess I think what is happening is when I run my WODselect the resulting dataset is too large and so its being split over multiple files (eg file1.dat, file2.dat, file3.dat). I suspect I somehow need to append these files back together before passing them into wodpy.

All a good learning experience, it's as much about wodpy as it is the wod downloaded tools!

@s-good
Copy link
Contributor

s-good commented Mar 2, 2016

Hi @rowanwins, the WODselect tool does split the data into multiple files but I think that each file should be readable without appending them together. @BoyerWOD, do you have any advice?

@BillMills, sorry to be pedantic but I think that advance_file_position_to_next_profile can only work as it currently does as it is a method of an instance of WodProfile. It's function is to move the file pointer to the end of where the data record corresponding to that instance of WodProfile occurs in the data file and it wouldn't make sense for it to do anything else. Maybe we need to start a utils module that does file manipulations that are not tied to a particular profile.

@bkatiemills
Copy link
Member Author

@s-good I think you're right - long term, a module along those lines could be good; might even be worth thinking about smart ways to unpack into a database there, like our recent discussion with Gui, potentially after AutoQC 1.0.

@Thomas-Moore-Creative
Copy link

Thomas-Moore-Creative commented Jun 14, 2018

Hi IQuOD/wodpy people. Firstly thanks for all your efforts on these packages. Great to see.

Hopefully this is an appropriate question: WODselect has many options. Is their any guidance on what settings work best with wodpy? I intend to give it a go and see if I can build some tools with wodpy as the base. Thanks.

[edit]
I also note the very excellent recent release of the International Quality Controlled Ocean Database (IQuOD) version 0.1. Are their already plans / efforts to make wodpy compatible with these NetCDF files?

@BoyerWOD
Copy link

Right now wodpy works only with the WOD native ASCII option. I do not know if a netCDF option will be added, but this would be a nice feature. Other options:

  • XBT/MBT correction: wodpy works with any correct. For IQuOD we are recommending the Cheng et al. (2014) correction.
  • observed or standard levels: wodpy will work with either. standard levels are more specialized in that this is an intermediate step in a specific application - calculating mean climatological fields. For this reason standard level data are available with XBT/MBT correction Levitus et al. (2009). For IQuOD observed levels are recommended.
  • Flag type: wodpy will work with either. IQuOD choice also includes uncertainties and intelligent metadata.
  • Data in files by instrument or all together: wodpy works with both, no recommendation other than user preference.

I think that covers all relevant options. Let me know if I missed any.

Thanks,
Tim

@Thomas-Moore-Creative
Copy link

Hi Tim,

Thanks for the quick and very useful response - much appreciated.

I guess starting out building some tools in python the question is: should one start from a base of (A) WODselect files with IQuOD flags and wodpy or (B) IQuOD netcdf files w/something else (like XARRAY). [This is largely a rhetorical question - will work through these options].

@BoyerWOD
Copy link

Yes, X-ARRAY would work. There is still the need to translate the netcdf files into X-ARRAY and make sure the depths relate to the other measured variables even though the other variables may have different dimensions. The netcdf files produced by WODselect are IQuOD (and WOD) netcdf files. The full set of IQuOD and WOD contiguous ragged array files as they currently stand can be found through their landing pages on a THREDDS server.:
IQuOD https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.nodc:0170893
WOD https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.nodc:0171598

Tim

@bkatiemills
Copy link
Member Author

Hi @Thomas-Moore-Creative @BoyerWOD - re: netcdf compatibility, sure, I’m game to implement this for WOD data if there’s demand for it. To get things started, it’d be helpful if you provided a netcdf profile or two, with a description of the ‘right answer’ - what it should decode to, so we can build unit tests to validate this properly. Make that available and netcdf support can be next up.

@Thomas-Moore-Creative
Copy link

Thanks for the reply Bill. It's great having someone with your IT experience working with oceanographers - hopefully I / we don't frustrate you with our poor programming practices! =)

I don't yet have a 'right answer' for my current task. I'm waiting for that myself - and it's very specific to our current uses. And I'm not looking at NC "profile" files but the yearly files recently released by IQuOD.

What I'm currently working on is sucking in 30 years of observations at one time (about 400M obs and 1M casts), create some xarray datasets and / or pandas dataframes that make sense given the different dimensions, and write some basic tools that allow me to slice & dice by typical things like time, space, and flags.

Tim and I are just working through some of the questions I have about what I'm seeing in the data that doesn't make immediate sense to me.

BUT - this is not meant to be discouraging of the need to follow your suggestions above.

@Thomas-Moore-Creative
Copy link

At the risk of getting off-topic here is the toy analogous problem I worked through (VERY SIMPLE problem and approach) to get some code that can help me "merge" the "casts" data with the "obs" data in the current crop of v0.1 IQuOD datasets. I'm not sure how useful it is for others but I'm pushing code up to a public repo in the chance it's useful or might spark discussion > https://github.com/Thomas-Moore-Creative/IQuOD_scratch/blob/master/Toy_problem_merge_by_row_size.ipynb

@bkatiemills
Copy link
Member Author

@Thomas-Moore-Creative thanks for sharing, and sorry for the slow reply - what you've got there is almost an outer join between your two input tables, but it assumes some things about the ordering of rows in your dataframes, specifically that if the first child has n pets, those are the first n rows of the pets table, etc. This makes me nervous, since if anyone ever re-sorts those tables, information gets lost. Would it be possible to introduce a foreign key into one table or another? Then:

import pandas as pd


child_df = pd.DataFrame({
    'Child ID':[1,2,3],
    'Child Birth Year':[2010,2014,2015],
    '# of pets':[2,1,3]})


pet_df = pd.DataFrame({
    'Child ID':[1,1,2,3,3,3],
    'Pet ID':[1,2,3,4,5,6],
    'Pet Age':[10,6,8,1,15,4]})


print child_df.set_index('Child ID').join(pet_df.set_index('Child ID'))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants