-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can't read file with quotes in comments #788
Comments
Hmmmm, yeah, I can see why this is a bit confusing. The problem is that the commented row essentially "doesn't count" towards the row counts for julia> CSV.File(IOBuffer("""
# 1'2"
name
junk
1
"""), comment="#", header=1, datarow=3)
1-element CSV.File{false}:
CSV.Row: (name = 1,) Maybe we can make this clearer in the documentation that when specifying row numbers, commented rows won't count and should be ignored. |
Then I'm even more confused... Note that the only difference between two examples in the first post is the quotation mark, all comments and row indices are the same. So sometimes the rows are counted including comments, sometimes not. |
Hmmm.......you're right; there's still something fishy going on here. |
Also, the original example where I noticed such a behaviour had many thousands of rows, and none of those were actually read. So I still think the issue is that (sometimes?) quotes in comments are treated as "real" quotes. E.g., see a longer example:
|
Improves #788. In the original issue, a quote character on a commented row messes the parsing positioning up because it's looking for a closing quote character. By checking for and skipping commented rows, no matter the characters present, we ensure parsing integrity. One ramification of this, however, is that commented rows now "no longer count" when considering row numbers, i.e. when specifying the `header=2` or `datarow=4` keyword arguments, because the commented rows are literally ignored when parsing. This seems fine to me, but probably warrants some documentation so it's clear.
Ok, so with the change/fix in #789, I get consistent results in that commented rows are completely ignored and dont' count towards "row number". I'm trying to think through whether that's fine or really confusing for users. |
I would say consistency is important, both within the library and to other common implementations. For now empty lines are not counted in |
Ok, I've updated the PR and actually went the other direction: commented rows and empty rows do count when considering |
Without quotes everything works as expected:
Adding a quote to the first line results in no rows read:
I would guess the issue is somewhere around here: https://github.com/JuliaData/CSV.jl/blob/master/src/detection.jl#L202-L217, but don't know if it's possible to easily fix.
The text was updated successfully, but these errors were encountered: