Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failure with 4.4.0-rc5 on arm #159

Closed
opoplawski opened this issue Nov 20, 2015 · 24 comments
Closed

Test failure with 4.4.0-rc5 on arm #159

opoplawski opened this issue Nov 20, 2015 · 24 comments

Comments

@opoplawski
Copy link
Contributor

I'm seeing a test failure on the Fedora arm builder:

FAIL: nc_test

re-running now to try to get more information...

@opoplawski
Copy link
Contributor Author

Full logs: https://kojipkgs.fedoraproject.org//work/tasks/6563/11926563/build.log

*** testing nc_get_var_schar ... 
    FAILURE at line 1189 of ../../nc_test/test_get.c: Range error: status = 0
    FAILURE at line 1189 of ../../nc_test/test_get.c: Range error: status = 0
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_get_var_schar! ###
*** testing nc_get_var1_schar ... 
    FAILURE at line 231 of ../../nc_test/test_get.c: Range error: status = 0
    FAILURE at line 231 of ../../nc_test/test_get.c: Range error: status = 0
    FAILURE at line 231 of ../../nc_test/test_get.c: Range error: status = 0
    FAILURE at line 231 of ../../nc_test/test_get.c: Range error: status = 0
 305 good comparisons. 
    ### 4 FAILURES TESTING nc_get_var1_schar! ###
*** testing nc_get_vara_schar ... 
    FAILURE at line 2496 of ../../nc_test/test_get.c: Range error: status = 0
    FAILURE at line 2496 of ../../nc_test/test_get.c: Range error: status = 0
    FAILURE at line 2496 of ../../nc_test/test_get.c: Range error: status = 0
 305 good comparisons. 
    ### 3 FAILURES TESTING nc_get_vara_schar! ###
*** testing nc_get_vars_schar ... 
    FAILURE at line 4504 of ../../nc_test/test_get.c: Range error: status = 0
    FAILURE at line 4504 of ../../nc_test/test_get.c: Range error: status = 0
    FAILURE at line 4504 of ../../nc_test/test_get.c: Range error: status = 0
    FAILURE at line 4504 of ../../nc_test/test_get.c: Range error: status = 0
 305 good comparisons. 
    ### 4 FAILURES TESTING nc_get_vars_schar! ###
*** testing nc_get_varm_schar ... 
    FAILURE at line 6619 of ../../nc_test/test_get.c: Range error: status = 0
    FAILURE at line 6619 of ../../nc_test/test_get.c: Range error: status = 0
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_get_varm_schar! ###
*** testing nc_put_var_schar ... 
    FAILURE at line 3610 of ../../nc_test/test_put.c: range error: status = 0
    FAILURE at line 3610 of ../../nc_test/test_put.c: range error: status = 0
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_put_var_schar! ###
*** testing nc_put_var1_schar ... 
    FAILURE at line 2394 of ../../nc_test/test_put.c: Range error: status = 0
        for type NC_UBYTE value -1.00000000000000000e+00 -1
    FAILURE at line 2394 of ../../nc_test/test_put.c: Range error: status = 0
        for type NC_UBYTE value -1.00000000000000000e+00 -1
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_put_var1_schar! ###
*** testing nc_put_vara_schar ... 
    FAILURE at line 5404 of ../../nc_test/test_put.c: range error: status = 0
    FAILURE at line 5404 of ../../nc_test/test_put.c: range error: status = 0
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_put_vara_schar! ###
*** testing nc_put_vars_schar ... 
    FAILURE at line 7452 of ../../nc_test/test_put.c: range error: status = 0
    FAILURE at line 7452 of ../../nc_test/test_put.c: range error: status = 0
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_put_vars_schar! ###
*** testing nc_put_varm_schar ... 
    FAILURE at line 9543 of ../../nc_test/test_put.c: range error: status = 0
    FAILURE at line 9543 of ../../nc_test/test_put.c: range error: status = 0
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_put_varm_schar! ###
*** testing nc_put_att_schar ... 
    FAILURE at line 11358 of ../../nc_test/test_put.c: range error: status = 0
    FAILURE at line 11358 of ../../nc_test/test_put.c: range error: status = 0
 39 good comparisons. 
    ### 2 FAILURES TESTING nc_put_att_schar! ###

@WardF
Copy link
Member

WardF commented Nov 21, 2015

Thanks for the bug report. I'm curious how I'll debug this; I've got a raspberry pi in the office I can try on if physical hardware is required. Or perhaps simply cross-compiling for 32 bit arm. I'll see what I can sort out. It will be about a week before I can address it in detail as I'll be out of the office next week traveling; also, I need to split my time preparing for AGU. Thanks again for the bug report!

@opoplawski
Copy link
Contributor Author

I have access to a machine if you have any pointers for debugging this.

@opoplawski
Copy link
Contributor Author

So with test_put.c:3604:

       err = nc_put_var_schar(ncid, i, value);
        if (canConvert) {
            if (allInExtRange) {
                IF (err)
                    error("%s", nc_strerror(err));
            } else {
                IF (err != NC_ERANGE && var_dimid[i][0] != RECDIM)
                    error("range error: status = %d", err);
            }

The problem appears to be that allInExtRange is 0 but err = 0 rather than NC_ERANGE? Adding some more debug output:

Switching to 64-bit data format.
*** testing nc_put_var_schar ... 
        FAILURE at line 3610 of ../../nc_test/test_put.c: range error: status = 0, i = 50, var_type[i] = 7, var_rank[i] = 1, var_nels[i] = 3, nels = 3
        FAILURE at line 3610 of ../../nc_test/test_put.c: range error: status = 0, i = 61, var_type[i] = 7, var_rank[i] = 1, var_nels[i] = 4, nels = 4

@opoplawski
Copy link
Contributor Author

And a little more for those failures:

inRange3(0, 7, NCT_SCHAR) = 1
inRange3(127, 7, NCT_SCHAR) = 1
inRange3(-1, 7, NCT_SCHAR) = 0

        FAILURE at line 3611 of ../../nc_test/test_put.c: range error: status = 0, i = 50, var_type[i] = 7, var_rank[i] = 1, var_nels[i] = 3, nels = 3inRange3(0, 8, NCT_SCHAR) = 1

So inRange3() is failing?

@opoplawski
Copy link
Contributor Author

Okay, I think I found the underlying cause here - http://blog.cdleary.com/2012/11/arm-chars-are-unsigned-by-default/
I've run into this before as well, and I suspect I'll run into it again....

@opoplawski
Copy link
Contributor Author

So I think this is because the test incorrectly assumes you can convert schar to char, which doesn't work if char == unsigned char.

@WardF
Copy link
Member

WardF commented Dec 2, 2015

Sorry for the silence this week, I've been getting caught up on the week off and also preparing for AGU.

That's a good catch, thanks for your help with debugging and sorting out the issue! I've got ARM hardware at my disposal (in the form of a Raspberry PI), I just need to dust it off and get it up and running so that I can implement a fix for this.

Thanks again for the reference and your investigation; it wouldn't have been the first thing I considered when trying to diagnose this :).

@WardF
Copy link
Member

WardF commented Dec 3, 2015

UPDATE: Withdrawn, I've duplicated the issue.

Ok, I got the raspberry pi up and running, got dependencies installed so that I could run the tests and fix this issue and... I'm unable to duplicate the problem, naturally.

I'm working on a raspberry pi 2 model B, where the CPU is a 32-bit 900MHz quad-core ARM Cortex-A7 CPU. I'm working in the latest raspbian OS, where uname is Linux rasberrypi 4.1.13-v7+ SMP PREEMPT Fri Nov 13 20:19:03 GMT 2015 armv7l GNU/Linux. Running gcc --version shows gcc (Raspbian 4.9.2-10) 4.9.2.

I have an older raspberry pi (using an ARM A6, I believe?) that I will try for the sake of thoroughness, but it should probably wait a bit until I at least get a draft of my AGU talk finished.

@opoplawski can you provide some information regarding the environment where you're seeing this issue? I don't doubt your diagnosis, I just don't want to try to fix it without being able to check my work.

@WardF
Copy link
Member

WardF commented Dec 3, 2015

I withdraw my previous comment, but I didn't want to silently delete it. I have duplicated the issue.

@opoplawski
Copy link
Contributor Author

great. Just for the record my machine is an ARMv7 Calxeda Highbank processor (armv7l) running Fedora 23 with gcc 5.1.1 20150618 (Red Hat 5.1.1-4)

@WardF
Copy link
Member

WardF commented Dec 4, 2015

Ok, I have a fix in place for cmake-based builds; we check during configuration time whether char is signed or unsigned. I just need to wire a similar check into the autoconf build and I'll merge the changes into master, hopefully early next week.

@WardF
Copy link
Member

WardF commented Dec 7, 2015

Ok, the fix I made was not comprehensive so, as always, work continues.

@WardF
Copy link
Member

WardF commented Dec 22, 2015

This issue is proving more troublesome than expected; there is an element of it which is related to chars being unsigned by default, however other issues arise when this is fixed. You can see this for yourself by setting CPPFLAGS=-fsigned-char.

In a nutshell, the cdf5 test file created by nc_test is not being written correctly. I've confirmed (via gdb) that the values in memory appear to be correct up to the point they are written to file; when the file is inspected, it contains an incorrect value for variable y3 (amongst others).

So, debugging this issue continues but may be put on the back burner for a little while to catch up on other issues that are accruing.

@WardF
Copy link
Member

WardF commented Dec 22, 2015

There were a number of issues in nc_test which I were able to debug and resolve on OSX linux via the use of -funsigned-char to force chars to be unsigned by default.

We are still seeing failures (e.g. expected -128, got 0, etc) when running the tests on ARM. These failures are happening even when I force signed chars via -fsigned-char. This supports my conclusion that the ARM issue is not just the difference between default char signedness, although that was an issue.

@WardF
Copy link
Member

WardF commented Dec 23, 2015

The root of the issue has been identified, finally, and a fix has been tested. The issue was in ncx.c:ncx_putn_uchar_double():

ncx_putn_uchar_double(void **xpp, size_t nelems, const double *tp)
{
    int status = ENOERR;
    uchar *xp = (uchar *) *xpp;

    while(nelems-- != 0)
    {
        if(*tp > X_UCHAR_MAX || *tp < 0)
            status = NC_ERANGE;
        *xp++ = (uchar)*tp++; //<- THIS IS THE PROBLEM.
    }

    *xpp = (void *)xp;
    return status;
}

The specific issue is that casting a negative double to an unsigned char is undefined behavior. Indications are that on intel-based hardware, the bit pattern will be preserved, and so thus will the expected value (-1 or 255 depending on how you look at it, in this case). However, on ARM, the cast results in a value of 0. This is not what the test expects and thus we see the failures observed after we have eliminated the the range errors stemming from the original char is unsigned by default on ARM issue.

Note: The error was only present in the test itself; in normal use an error ERANGE (-60) would be thrown when this conversion is attempted. nc_test explicitly ignores ERANGE errors.

The fix for this is to explicitly cast the source data type to (signed), e.g.

    *xpp++ = (uchar)(signed)*tp++;

This results in the proper value being preserved.

TODO

To fix this issue, I need to clean up my working branch, remove stray files, clean up the tests I've added, etc. Once done and I'm convinced this works and has not introduced additional issues, I will merge it into master and close this issue out.

WardF added a commit that referenced this issue Dec 23, 2015
… overly broad; I will refine it once I've had a chance to read up on m4.
@DennisHeimbigner
Copy link
Collaborator

Nice catch!

@WardF
Copy link
Member

WardF commented Dec 23, 2015

Thank you. In testing the fix I've uncovered a couple further issues that weren't cropping up before, but once again they seem tied to the signed/unsigned char thing. A workaround would be to recommend/enforce compiling with -fsigned-char, but these all look familiar with other issues I've squashed so I'm taking a stab at just getting them fixed now.

@WardF
Copy link
Member

WardF commented Dec 28, 2015

After several more trips through the *"Fix bug/run tests/identify new bug" cycle, it appears that everything is now working properly across various platforms. This may be premature as I still have to test Windows, but that aside, I'm happy with the fix. The issues turned out to be a combination of:

  1. Assuming char is signed.
  2. implicit casting from a negative double to an unsigned char. (from the standard such behavior is undefined but seems to work on intel architectures. On ARM, the recipient is assigned a value of 0).

The fix is a mix of #ifdef's checking for

  1. __UNSIGNED_CHAR__
  2. __arm__

The latter issue turned out to much more difficult to track down, as many of the assignments where this happened weren't obvious; they were usually of the form unsigned char = (double*)(pointer + large offset).

I'm going through and cleaning up a lot of the detritus from the debugging process, and then I will document the fix/issue fully, as well as providing information re: expected behavior, etc. Once this is finished I will finally address the stack of other issues that have opened up, starting with pull request @dmh recently opened and that should be pretty straightforward.

I will close out this issue when the fix is merged into master; due to the nature of the fix, I may issue it as a pull request instead of a merge, so that @dmh and whomever else may review it. I haven't looked at it broadly, but I worry that the applied fix may be overly broad.

@opoplawski : I only have access to the Raspberry Pi mentioned above; if this fix does not address the issue on your platform, please let me know.

@WardF WardF mentioned this issue Dec 29, 2015
@WardF
Copy link
Member

WardF commented Dec 30, 2015

Closing issue; resolved in merge #180 unless I find out otherwise.

@opoplawski
Copy link
Contributor Author

I'm afraid we're back to having problems with 4.4.0 on arm. First off, it looks like libsrc/ncx.c in the 4.4.0 tarball was not generated with the latest ncx.m4 source, as it's missing the #ifdef arm conditionals around the ncx_*n_text() functions.

After fixing that I get:

*** testing nc_get_var_schar ... 
    FAILURE at line 1200 of ../../nc_test/test_get.c: value read not that expected
    FAILURE at line 1200 of ../../nc_test/test_get.c: value read not that expected
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_get_var_schar! ###
*** testing nc_get_var1_schar ... 
    FAILURE at line 223 of ../../nc_test/test_get.c: expected: -1, got: 0
    FAILURE at line 223 of ../../nc_test/test_get.c: expected: -1, got: 0
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_get_var1_schar! ###
*** testing nc_get_vara_schar ... 
    FAILURE at line 2510 of ../../nc_test/test_get.c: value read not that expected
    FAILURE at line 2510 of ../../nc_test/test_get.c: value read not that expected
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_get_vara_schar! ###
*** testing nc_get_vars_schar ... 
    FAILURE at line 4520 of ../../nc_test/test_get.c: value read not that expected
    FAILURE at line 4520 of ../../nc_test/test_get.c: value read not that expected
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_get_vars_schar! ###
*** testing nc_get_varm_schar ... 
    FAILURE at line 6637 of ../../nc_test/test_get.c: value read not that expected
    FAILURE at line 6637 of ../../nc_test/test_get.c: value read not that expected
 305 good comparisons. 
    ### 2 FAILURES TESTING nc_get_varm_schar! ###
*** testing nc_get_att_schar ... 
    FAILURE at line 8538 of ../../nc_test/test_get.c: value read not that expected
    FAILURE at line 8538 of ../../nc_test/test_get.c: value read not that expected
 39 good comparisons. 
    ### 2 FAILURES TESTING nc_get_att_schar! ###
*** Total number of failures: 12
*** nc_test FAILURE!!!
FAIL nc_test (exit status: 2)

@WardF
Copy link
Member

WardF commented Jan 22, 2016

No need to capture my heavy sigh in the issue notes, so I'll simply go fix this. It will be in the upcoming 4.4.1, which will be a maintenance release in the short-term future for this and other small things that have cropped up.

@WardF WardF reopened this Jan 22, 2016
@WardF
Copy link
Member

WardF commented Jan 22, 2016

In the meantime you should be able to regenerate the ncx.c manually and have it work.

@WardF WardF modified the milestones: 4.4.1, 4.4.0 Jan 22, 2016
@WardF
Copy link
Member

WardF commented Jan 22, 2016

Actually, I'm going to close this back out, as the fix is in the release process not the code itself. I've updated my (internal) release documentation to emphasize ensuring that the latest generated files are present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants