-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LittleEndianInputStream EOF test is incorrect #1614
Comments
This is more complicated than originally described because the file encoding isn't really described and java bytes include negative values. Currently EOFException seems to be triggered correctly when the end of file is reached prematurely, but it's also triggered incorrectly on some non-ASCII characters which truncate to negative byte values. It's unclear if this is meant to only read ASCII, ISO 8859-1, or UTF-8 but nothing outside of the ascii space works correctly. How to fix it depends on what file encoding we're intending to support which means digging into what this thing actually is used for. |
@lbergelson File encoding should be irrelevant, this is a raw byte stream. Regardless of other issues casting in.read() to a byte before that test is incorrect, I think. A positive int can become a negative byte. |
InputStream.read() is supposed to return an integer between 0 and 255, or -1 if EOF, there is no encoding involved. So 1/2 of the values it can legally return will be negative when cast to a byte. |
It's a weird convention, but its been like that since the beginning. I suppose the contract is for an "int" specifically so they can have a return value (-1) which indicates EOF. If the contract was for a byte there would be no available values to indicate EOF. |
@jrobinso I agree that it's currently wrong and the fix you proposed is an improvement. What I meant to say is that we have to decide what the file encoding should be when we do the conversion to String because we will get incorrect/inconsistent answers if there's a mismatch and the file contains non-ascii values even once your fix goes in. |
@lbergelson I still don't see what encoding has to do with this class, which is a stream reader for raw bytes. You can decode these bytes anyway you please, but that is downstream. In many cases the bytes will not be decoded to strings at all, but again what is done with the bytes is no concern here. |
@jrobinso It's relevant because the method in question returns a string as it's result, so it has to make a decision about what the bytes mean. |
@lbergelson OK, yes, sorry, I was focused on the test in question, the part below. It is most definitely a bug to cast the result from in.read() to a byte before testing for b<0. An EOF exception should not be thrown unless -1 is returned as an int. If there is a bug decoding the string later that would be a different problem. This method probably doesn't belong on LittleEndianInputStream, but since it's there maybe it should take the encoding as an argument, defaulting to utf-8. In practice this method is used to read genomic file formats, many which specify ascii text. If some other character encoding is used we might get a garbled string, or some other error, but we shouldn't see an EOFException.
|
Here's the relevant text for GFF format, which I think is pretty typical among genomic file formats. The term "plain text" might be ambiguous, but I think is understand to mean ascii " GFF3 files are 9-column, tab-delimited plain text files. Literal tabs, newline, carriage return, percent sign and control characters must be encoded using RFC 3986 Percent-Encoding " The SAM specification is more specific "Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1". WRT "specified elsewhere", a number of fields allow UTF-8. |
This is more complicated than originally described because the file encoding isn't really described and java bytes include negative values. EOF is triggered correctly, but it's also triggered on various potential characters which become negative values when they are truncated to byte. It's unclear if this is meant to only read ASCII, ISO 8859-1, or UTF-8 but nothing outside of the ascii space works correctly.
This is more complicated than originally described because the file encoding isn't really described and java bytes include negative values. EOF is triggered correctly, but it's also triggered on various potential characters which become negative values when they are truncated to byte. It's unclear if this is meant to only read ASCII, ISO 8859-1, or UTF-8 but nothing outside of the ascii space works correctly.
As reported by @jrobinso: The test for EOF in
LittleEndianInputStream.readString
is wrong because in.read() is cast to a byte. The contract for read() returns an "int", and the signal for EOF is the return value is -1. However this is no longer a valid test after casting to a byte. The correct thing to do here, I think, is type "b" as an int, the cast to byte only when written to "bis".The text was updated successfully, but these errors were encountered: