Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

patch related to Issue258 #319

Closed
wants to merge 55 commits into from
Closed

patch related to Issue258 #319

wants to merge 55 commits into from

Conversation

wkliao
Copy link
Contributor

@wkliao wkliao commented Oct 6, 2016

This patch addresses the special case when calling _uchar APIs to access NC_BYTE attributes or variables in CDF-1 and CDF-2 files.

The special case description can be found in http://www.unidata.ucar.edu/software/netcdf/docs_rc/data_type.html#type_conversion
Quoted here: "In netcdf-3, we treat NC_BYTE as signed for the purposes of conversion to short, int, long, float, or double. (Of course, no conversion takes place when the internal type is signed char.) In the _uchar functions, we treat NC_BYTE as if it were unsigned. Thus, no NC_ERANGE error can occur converting between NC_BYTE and unsigned char."

In other words, if called from signed APIs, NC_BYTE variables are treated as signed. If called from unsigned APIs (i.e. _uchar APIs, the only unsigned APIs in netCDF-3) they are unsigned. NetCDF-3 specifically makes an exception to skip NC_ERANGE check when calling _uchar APIs to access NC_BYTE variables.

However, in netCDF-4 and CDF-5, because of the introduction of the new data type NC_UBYTE, an unsigned 8-bit integer, which makes NC_BYTE an signed 8-bit integer and thus renders the above exception less sense. In this patch, regular NC_ERANGE check is performed in all APIs that access NC_BYTE variables for CDF-5 files.

On the other hand, it still honors the above exception for CDF-1 and 2 files.

Note PnetCDF 1.7.0 does not handle this exception completely. The fix will appear in the next release 1.7.1. If netCDF decides to incorporated this patch in its next release, please suggest users to use PnetCDF 1.7.1.

@WardF WardF added this to the 4.4.1.1 milestone Oct 6, 2016
@WardF WardF self-assigned this Oct 6, 2016
@WardF
Copy link
Member

WardF commented Oct 6, 2016

Seeing some failures on 32-bit ARM.

@wkliao
Copy link
Contributor Author

wkliao commented Oct 6, 2016

I don't understand the issue of arm.
There is a long discussion in issue 159.
Could you briefly explain why we need to care about char being unsigned by default?

Wei-keng

On Oct 6, 2016, at 1:38 PM, Ward Fisher wrote:

Seeing some failures on 32-bit ARM.

http://my.cdash.org/viewBuildError.php?buildid=1062141

You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@WardF
Copy link
Member

WardF commented Oct 6, 2016

I don't think that this has anything to do with issue #159, or char being unsigned by default. In the link I provided to the test failures here, I'm seeing the following:

Error   
 netcdf-c/libsrc/ncx.c:19279:1: error: conflicting types for ‘ncx_getn_text’
  ncx_getn_text(const void **xpp, size_t nelems, char *tp)

I have not had any time to investigate this and the error will likely be easy enough to resolve. I was simply making a note here of the failure on a platform which isn't part of the travis-ci checks, so that I didn't merge before resolving the issue.

@wkliao
Copy link
Contributor Author

wkliao commented Oct 6, 2016

In master branch, the definition of ncx_getn_text in ncx.m4 is
wrapped around by #ifdef arm, I think it has something to
do with #159.

Wei-keng

On Oct 6, 2016, at 2:33 PM, Ward Fisher wrote:

I don't think that this has anything to do with issue #159, or char being unsigned by default. In the link I provided to the test failures here, I'm seeing the following:

Error
netcdf-c/libsrc/ncx.c:19279:1: error: conflicting types for �ncx_getn_text�
ncx_getn_text(const void **xpp, size_t nelems, char *tp)

I have not had any time to investigate this and the error will likely be easy enough to resolve. I was simply making a note here of the failure on a platform which isn't part of the travis-ci checks, so that I didn't merge before resolving the issue.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@WardF
Copy link
Member

WardF commented Oct 6, 2016

Ok, that does indeed tie it back to #159, thanks; I hadn't started poking into it yet. I'll have to refer back to my notes. The unsigned by default ended up being a red herring for #159 as I recall, however; the main issue was undefined behavior manifesting in one way on intel chips and another on ARM. I don't recall specifics, but will when I have a chance to dig into it later this week :).

@wkliao
Copy link
Contributor Author

wkliao commented Oct 7, 2016

Hi, Ward,

I noticed your comments on Dec 23, 2015.
#159 (comment)

You indicates the cause of problem is "on ARM, the cast results in a value of 0"
instead of -1 or 255.

while(nelems-- != 0)
{
    if(*tp > X_UCHAR_MAX || *tp < 0)
        status = NC_ERANGE;
    *xp++ = (uchar)*tp++;  //<- THIS IS THE PROBLEM.
}

When *tp < 0, the if condition above has already detected NC_ERANGE.
However, but the assignment below it is still got executed.

So, the question is "why did it not stop there to break the loop to
skip the assignment?"

If it goes ahead with the assignment, then the outcome of this assignment
should not matter, i.e. nc_test should not check the contents of elements
that cause NC_ERANGE.

I may not fully understand the ARM issue, but I feel the issue is more in
nc_test than in the ncx.m4. Wish I had an access to an ARM machine
to test for other possible errors.

Wei-keng

On Oct 6, 2016, at 2:41 PM, Ward Fisher wrote:

Ok, that does indeed tie it back to #159, thanks; I hadn't started poking into it yet. I'll have to refer back to my notes. The unsigned by default ended up being a red herring for #159 as I recall, however; the main issue was undefined behavior manifesting in one way on intel chips and another on ARM. I don't recall specifics, but will when I have a chance to dig into it later this week :).


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.

@edhartnett
Copy link
Contributor

As I recall the ERANGE error is different, in that if it occurs, the data will still be copied, but the error will be returned.

These were the rules built into netCDF-3.

@wkliao
Copy link
Contributor Author

wkliao commented Feb 9, 2017

Here is my analysis.
In put_vars() of utils.c, variable c1's content is set by a call to hash(). The value returned by hash() is -128, X_CHAR_MIN, which is then type-casted from double to char to text[j], resulting text[j] being 128.
The value 128 of NC_CHAR type is printed by ncdump into "\200" on screen.
Now the question is why ghpull-319-arm branch writes "\0" instead of "\200" on ARM.
You might want to add some printf statement to check.

@wkliao
Copy link
Contributor Author

wkliao commented Feb 9, 2017

One question. Is the following setting from nc_test/tests.h intended?

#ifdef` __CHAR_UNSIGNED__
#define X_CHAR_MIN	SCHAR_MIN
#define X_CHAR_MAX	SCHAR_MAX
#else
#define X_CHAR_MIN	CHAR_MIN
#define X_CHAR_MAX	CHAR_MAX
#endif //__unsigned_char__

I assume on ARM, SCHAR_MIN is -128 which makes X_CHAR_MIN negative.

@WardF
Copy link
Member

WardF commented Feb 9, 2017

NC_ECHAR error code will be thrown and should be thrown way before any attempt of type casting.

As I recall from a conversation with Russ, NC_ECHAR is returned, but the conversion is still attempted, at least in the netcdf library. Regarding nc_test, it was written long before my involvement, and I have not had the resources to go through and do any serious revisions to it yet. Perhaps that will change soon!

Now the question is why ghpull-319-arm branch writes "\0" instead of "\200" on ARM.

That is my conclusion as well. I haven't had a chance to come back to this issue yet, but I will try to do some more debugging later today or tomorrow. When this issue was last observed, it was the result of casting from a double to an unsigned char; undefined behavior which, on ARM, results in a '0'.

@wkliao
Copy link
Contributor Author

wkliao commented Feb 9, 2017

If you check netcdf-3.6.1/src/libsrc/putget.c, NC_ECHAR is thrown right after a call to NC_lookupvar() for both put and get APIs, and the API returns immediately. (This also leads to the issue of ERANGE_FILL I raised earlier in this discussion thread.)

As for type casting from a double to an unsigned char on ARM, The same casting is done for both master and ghpull-319-arm branches (I checked that part of codes and they are the same). I am puzzled why casting 128 results in '\0' on ghpull-319-arm but "\200" on master. Only a printf can confirm the case.

…on, may need byte-swap when retrieved to a memory buffer (internal representation)
@wkliao
Copy link
Contributor Author

wkliao commented Feb 10, 2017

I just found a serious bug that needs to be fixed. Sorry to break the promise not to touch branch issue258.

@WardF
Copy link
Member

WardF commented Feb 14, 2017

Merged patch into local branches, continuing to debug.

WardF added a commit that referenced this pull request Feb 14, 2017
@WardF
Copy link
Member

WardF commented Feb 16, 2017

Step one (things compile and run across platforms) nearly complete; dealing with some visual studio related issues right now. Then the next step will be to take a step back and evaluate what is going on, adjust tests as need be, and look at the broader scope of the pull request.

@wkliao
Copy link
Contributor Author

wkliao commented Mar 9, 2017

Hi, Ward
I wonder if I can have an account to access your raspberry pi and help with debugging.

@WardF
Copy link
Member

WardF commented Mar 10, 2017

Thanks for your help debugging this. Hope to have the issues resolved today, then will review functionality and hopefully be able to resolve the pull request and move on to the full release!

@wkliao
Copy link
Contributor Author

wkliao commented Mar 10, 2017

FYI.
I just created a new branch based on your ghpull-319-arm branch and incorporated the fix.
https://github.com/wkliao/netcdf-c/tree/ghpull-319-arm

Tested it on a Redhat box and your raspberry pi without errors.
Two tests were carried out on both machines:

  1. built without netcdf-4, without PnetCDF
  2. build without netcdf-4, but with PnetCDF

@WardF
Copy link
Member

WardF commented Mar 11, 2017

Seeing a build failure under Visual Studio/Windows with the ghpull-319-arm branch from @wkliao but at a glance it looks straight forward to fix. Will address and hopefully put this to bed by early next week. Confirmed tests work on ARM and Linux and OSX.

@wkliao
Copy link
Contributor Author

wkliao commented Mar 11, 2017

Can you point me to the build outputs on the Visual Studio/Windows?

@wkliao
Copy link
Contributor Author

wkliao commented Mar 11, 2017

Just found a bug in ncx.m4 and the bug is due to the way config.h is included. See 7aa8fc3.
I also removed all special treatment for ARM architecture, i.e. arm, in 08aef9a.
All run passed on raspberry pi and redhat.

@WardF
Copy link
Member

WardF commented Mar 11, 2017

Can you open a new pull request based on your new branch and I will evaluate as soon as I can, as well as get the visual studio output posted somewhere.

@wkliao
Copy link
Contributor Author

wkliao commented Mar 11, 2017

Done. The new pull request is #375

@WardF
Copy link
Member

WardF commented Apr 5, 2017

Dennis and I have discussed this (in terms of pull request #375 and we are going to accept it with the following caveats), based on the summary in this comment:

  • The default for erange-fill will be switched to disabled from enabled, to retain default behavior.
  • Options for relaxed coordinate will be renamed to something along the lines of enable zero length coords or similar; based on the comments in Fix variable bounds check for parallel output #243 we want to narrow the implications of the name.

@wkliao a quick question; while we were going back through the issues you reference, it appears that the relaxed coordinate bounds option is specific to pnetcdf related use cases. Is there any reason you can think of not to make this option conditional on pnetcdf support? I.e., this option is not available if you aren't linking against pnetcdf?

Thanks again for your patience and help with this; once this is done we can start looking at additional pull requests; we really appreciate the series of smaller pull requests you've been issuing, it makes things much faster on our end with our limited resources :).

@wkliao
Copy link
Contributor Author

wkliao commented Apr 5, 2017

The default for erange-fill will be switched to disabled from enabled, to retain default behavior.

FYI, if a user is building netCDF using PnetCDF with erange-fill enabled, then netCDF will automatically enable it (I believe I added that in configure.ac and CMakeList.txt).

it appears that the relaxed coordinate bounds option is specific to pnetcdf related use cases. Is there any reason you can think of not to make this option conditional on pnetcdf support? I.e., this option is not available if you aren't linking against pnetcdf?

I believe RELAX_COORD_BOUND feature is also available for netCDF when built without PnetCDF. Try "git grep RELAX_COORD_BOUND". You will see it is used in libsrc and libsrc4.

@WardF
Copy link
Member

WardF commented May 8, 2017

Merged upstream, see #375 comments for more info.

@WardF WardF closed this May 8, 2017
@wkliao wkliao deleted the issue258 branch September 15, 2018 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants