You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: I've found some minor issues in dsv_generic whose fixes (introduced in the temporary script dsv_generic.new) need to be discussed here to make sure the right choices were made. Sorry for the lengthy text, but it is unfortunately required to accurately describe these kind of issues.
I stumbled on some minor issues in dsv_generic which I propose to fix in 2542a69. This commit introduces a new version under the name dsv_generic.new that can be tested on selected procs but should be renamed once we agree on the changes and fixes discussed below.
Note: trying to fix the first issue without breaking anything else led me to create some automated tests for dsv_generic (introduced by a065cc9), which are a really useful tool to avoid regressions and decide and document the desired behaviour on a series of corner cases. (See the README.md file for details on how to use this tool.)
The first issue is about fields in source CSV with the value "-INF", which is currently transformed into "-" on the output, and breaks the proc execution as it is not a numerical value. The proposed update changes the code to replace by NaN not only empty fields, but also fields that consist in characters used in numerical values but that have no digit. This is done using this line (where $i is a field): gsub(/^[\seE.-]*$/, "NaN", $i)
Another problem I found is that the code is meant to remove commas from the input fields (as it is a non-numerical character) but does not, due to a bug on line 75: gsub(sprintf("[^0-9eE.+-%s]", OFS), "")
which should really be written: gsub(sprintf("[^0-9eE.+%s-]", OFS), "")
The first line above unwittingly introduces a range of characters [+-;] that will exclude from the removal any character whose ASCII code is between + and ; (the value of OFS in the script), which are +,-./0123456789:. Thankfully, only the comma seems to be a concern for our use.
Before the above correction, a field of 1,2 (while using a field separator differente from the comma) would be kept as is: 1,2, which results in a non-numerical value on the output which can crash the proc.
Now, a value of 1,2 is transformed to 12, which is a syntaxically correct though meaningless numerical value. At least this behaviour is now consistent with any other non-numerical character than , and with the general approach of readfmtdata_dsv that supposes that the input data are well formed.
I could be nitpicking here, but while testing dsv_generic, I realised that when no number of fields is provided in argument, the script will discard lines having 3 fields on output, given line 68: !test_fields && NF <= 3 { next }
However, one could think of correct lines with only 3 columns, as probably the shortest timestamp should be something like YYYY-DDD and a line like 2019;296;3 on output could potentially form valid data. I therefore think this default limit should be lowered with this simple correction: !test_fields && NF < 3 { next }
This is however an arbitrary choice, as we could think of a proc that accepts timestamps like YYYYDDD, so a line like 2019296;3 could be valid too, although dsv_generic will reject it by default. This exclusion test is therefore probably not really meaningful.
Finally, following my personal habit, dsv_generic.new reorganises things a bit to make it even easier to read and to maintain, and introduces more comments to help remember the purpose of each instruction and avoid breaking things simply because we forgot it was there for a specific reason.
The newly introduced tests for dsv_generic take into account the choices made for the fixes as discussed above. Should these choices be modified, the tests will need to be updated to reflect the new choices. This would follows the workflow that should be used for updating any program that uses unit tests, which is more or less:
add/fix a test;
watch the test fail (this validates the correctness of the test);
fix the program;
watch the test pass (this validates the correction of the program).
Working with such tests adds a tad more work, but it's really worth it as it efficiently prevents against errors and regressions.
The text was updated successfully, but these errors were encountered:
I stumbled on some minor issues in
dsv_generic
which I propose to fix in 2542a69. This commit introduces a new version under the namedsv_generic.new
that can be tested on selected procs but should be renamed once we agree on the changes and fixes discussed below.Note: trying to fix the first issue without breaking anything else led me to create some automated tests for
dsv_generic
(introduced by a065cc9), which are a really useful tool to avoid regressions and decide and document the desired behaviour on a series of corner cases. (See the README.md file for details on how to use this tool.)The first issue is about fields in source CSV with the value "-INF", which is currently transformed into "-" on the output, and breaks the proc execution as it is not a numerical value. The proposed update changes the code to replace by
NaN
not only empty fields, but also fields that consist in characters used in numerical values but that have no digit. This is done using this line (where$i
is a field):gsub(/^[\seE.-]*$/, "NaN", $i)
Another problem I found is that the code is meant to remove commas from the input fields (as it is a non-numerical character) but does not, due to a bug on line 75:
gsub(sprintf("[^0-9eE.+-%s]", OFS), "")
which should really be written:
gsub(sprintf("[^0-9eE.+%s-]", OFS), "")
The first line above unwittingly introduces a range of characters
[+-;]
that will exclude from the removal any character whose ASCII code is between+
and;
(the value ofOFS
in the script), which are+,-./0123456789:
. Thankfully, only the comma seems to be a concern for our use.Before the above correction, a field of
1,2
(while using a field separator differente from the comma) would be kept as is:1,2
, which results in a non-numerical value on the output which can crash the proc.Now, a value of
1,2
is transformed to12
, which is a syntaxically correct though meaningless numerical value. At least this behaviour is now consistent with any other non-numerical character than,
and with the general approach ofreadfmtdata_dsv
that supposes that the input data are well formed.I could be nitpicking here, but while testing
dsv_generic
, I realised that when no number of fields is provided in argument, the script will discard lines having 3 fields on output, given line 68:!test_fields && NF <= 3 { next }
However, one could think of correct lines with only 3 columns, as probably the shortest timestamp should be something like
YYYY-DDD
and a line like2019;296;3
on output could potentially form valid data. I therefore think this default limit should be lowered with this simple correction:!test_fields && NF < 3 { next }
This is however an arbitrary choice, as we could think of a proc that accepts timestamps like
YYYYDDD
, so a line like2019296;3
could be valid too, althoughdsv_generic
will reject it by default. This exclusion test is therefore probably not really meaningful.Finally, following my personal habit,
dsv_generic.new
reorganises things a bit to make it even easier to read and to maintain, and introduces more comments to help remember the purpose of each instruction and avoid breaking things simply because we forgot it was there for a specific reason.The newly introduced tests for
dsv_generic
take into account the choices made for the fixes as discussed above. Should these choices be modified, the tests will need to be updated to reflect the new choices. This would follows the workflow that should be used for updating any program that uses unit tests, which is more or less:Working with such tests adds a tad more work, but it's really worth it as it efficiently prevents against errors and regressions.
The text was updated successfully, but these errors were encountered: