Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor issues in the dsv_generic script #36

Open
XavierBeguin opened this issue Oct 23, 2019 · 0 comments
Open

Minor issues in the dsv_generic script #36

XavierBeguin opened this issue Oct 23, 2019 · 0 comments

Comments

@XavierBeguin
Copy link
Collaborator

TL;DR: I've found some minor issues in dsv_generic whose fixes (introduced in the temporary script dsv_generic.new) need to be discussed here to make sure the right choices were made. Sorry for the lengthy text, but it is unfortunately required to accurately describe these kind of issues.

I stumbled on some minor issues in dsv_generic which I propose to fix in 2542a69. This commit introduces a new version under the name dsv_generic.new that can be tested on selected procs but should be renamed once we agree on the changes and fixes discussed below.

Note: trying to fix the first issue without breaking anything else led me to create some automated tests for dsv_generic (introduced by a065cc9), which are a really useful tool to avoid regressions and decide and document the desired behaviour on a series of corner cases. (See the README.md file for details on how to use this tool.)

  1. The first issue is about fields in source CSV with the value "-INF", which is currently transformed into "-" on the output, and breaks the proc execution as it is not a numerical value. The proposed update changes the code to replace by NaN not only empty fields, but also fields that consist in characters used in numerical values but that have no digit. This is done using this line (where $i is a field): gsub(/^[\seE.-]*$/, "NaN", $i)

  2. Another problem I found is that the code is meant to remove commas from the input fields (as it is a non-numerical character) but does not, due to a bug on line 75:
    gsub(sprintf("[^0-9eE.+-%s]", OFS), "")
    which should really be written:
    gsub(sprintf("[^0-9eE.+%s-]", OFS), "")
    The first line above unwittingly introduces a range of characters [+-;] that will exclude from the removal any character whose ASCII code is between + and ; (the value of OFS in the script), which are +,-./0123456789:. Thankfully, only the comma seems to be a concern for our use.
    Before the above correction, a field of 1,2 (while using a field separator differente from the comma) would be kept as is: 1,2, which results in a non-numerical value on the output which can crash the proc.
    Now, a value of 1,2 is transformed to 12, which is a syntaxically correct though meaningless numerical value. At least this behaviour is now consistent with any other non-numerical character than , and with the general approach of readfmtdata_dsv that supposes that the input data are well formed.

  3. I could be nitpicking here, but while testing dsv_generic, I realised that when no number of fields is provided in argument, the script will discard lines having 3 fields on output, given line 68:
    !test_fields && NF <= 3 { next }
    However, one could think of correct lines with only 3 columns, as probably the shortest timestamp should be something like YYYY-DDD and a line like 2019;296;3 on output could potentially form valid data. I therefore think this default limit should be lowered with this simple correction:
    !test_fields && NF < 3 { next }
    This is however an arbitrary choice, as we could think of a proc that accepts timestamps like YYYYDDD, so a line like 2019296;3 could be valid too, although dsv_generic will reject it by default. This exclusion test is therefore probably not really meaningful.

Finally, following my personal habit, dsv_generic.new reorganises things a bit to make it even easier to read and to maintain, and introduces more comments to help remember the purpose of each instruction and avoid breaking things simply because we forgot it was there for a specific reason.

The newly introduced tests for dsv_generic take into account the choices made for the fixes as discussed above. Should these choices be modified, the tests will need to be updated to reflect the new choices. This would follows the workflow that should be used for updating any program that uses unit tests, which is more or less:

  1. add/fix a test;
  2. watch the test fail (this validates the correctness of the test);
  3. fix the program;
  4. watch the test pass (this validates the correction of the program).

Working with such tests adds a tad more work, but it's really worth it as it efficiently prevents against errors and regressions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant