Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[real data test] Small dataset warnings etc #523

Closed
tlienart opened this issue Oct 30, 2019 · 8 comments
Closed

[real data test] Small dataset warnings etc #523

tlienart opened this issue Oct 30, 2019 · 8 comments

Comments

@tlienart
Copy link
Contributor

tlienart commented Oct 30, 2019

Hello, I encountered a few issues when trying to read a simple UCI dataset and thought I'd report what happened / what I ended up doing (this is with CSV 5.14)

using HTTP, CSV
req = HTTP.get("https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data")
CSV.read(req.body)

this by itself works in that it returns a DataFrame with the right dimensions but gives out

┌ Warning: 2; something went wrong trying to determine row positions for multithreading; it'd be very helpful if you could open an issue at https://github.com/JuliaData/CSV.jl/issues so package authors can investigate
└ @ CSV ~/.julia/packages/CSV/skXyc/src/detection.jl:369

and also

thread = 1 warning: parsed expected 28 columns, but didn't reach end of line on data row: 1. Ignoring any extra columns on this row
thread = 1 warning: parsed expected 28 columns, but didn't reach end of line on data row: 2. Ignoring any extra columns on this row
thread = 1 warning: parsed expected 28 columns, but didn't reach end of line on data row: 3. Ignoring any extra columns on this row
thread = 1 warning: parsed expected 28 columns, but didn't reach end of line on data row: 4. Ignoring any extra columns on this row
thread = 2 warning: parsed expected 28 columns, but didn't reach end of line on data row: 1. Ignoring any extra columns on this row

(one per line)

To get this to fully work and not display a series of warnings I had to write

CSV.read(req.body, delim=' ', missingstring="?", 
                threaded=false, silencewarnings=true)

Did I do something wrong? is there something weird that prevents the data from being read properly while still getting the right final answer? also I don't understand the warnings about not getting to the end of line and still managing to get the file parsed fine

@tlienart tlienart changed the title Small dataset warnings etc [real data test] Small dataset warnings etc Oct 30, 2019
@quinnj
Copy link
Member

quinnj commented Oct 30, 2019

Screen Shot 2019-10-30 at 5 32 17 PM

So this dataset is a bit odd; here's a screenshot of the file in my editor and you can see that the first row doesn't have a trailing space, while the subsequent rows do (the little dot is the space at the end). There are some ~15 rows in the dataset that don't have the trailing space, so it's a bit unclear if there should be 29 columns, or just 28 and the extra space is part of the 28th integer column. This is leading to all the parsing warnings.

I'm still looking into why we're seeing the really scary "something went wrong" warning and I'll report back; just wanted to post what I noticed so far and see if @tlienart knew a little more what the dataset shape/size should be.

@quinnj
Copy link
Member

quinnj commented Oct 30, 2019

Oh, I can answer the next part; because CSV.jl can't quite figure out the correct # of columns to expect, it shows the really scare "something went wrong" error. I think we can avoid the scary warning in this part, because we already decided how many columns we're going with in the first detection phase, so in chunking up the file, we can apply the same logic for ignoring extra/too few columns for a row.

That will avoid printing the scary warnings for this file, though as I noted before, unless we know a little more about the true shape/size of the data, we'll still get some warnings.

@tlienart
Copy link
Contributor Author

Sounds like this file is a potentially useful test case then 😄

It all sounds good, with respect to the trailing whitespace, I understand that's odd and all but shouldn't we wish for CSV to not be thrown off by this? I don't know the internals of CSV so maybe it's really difficult, but isn't it a case of how you detect end of line where the regex would be \s*\n or something like it?

@quinnj
Copy link
Member

quinnj commented Oct 31, 2019

Well, as far as I can tell, this is an actual ambiguous case, not just something CSV.jl can't handle. So it's not just CSV.jl being thrown off. Particularly because the delimiter is ' '; that would be like having:

a,b,c
1,2,3,
4,5,6
7,8,9
10,11,12,

note that the 1st and 4th rows of data have a trailing comma, which indicates a 4th column with a missing value.

We do have the ignorerepeated=true option, which allows for cases like ignoring a ' ' followed by a \n, but this file still has the problem of being inconsistent between rows, so we detect 28 columns on the first row and then warn because there's a 29th column on other rows.

@tlienart
Copy link
Contributor Author

tlienart commented Oct 31, 2019

Yeah that makes sense thanks for explaining; out of curiosity how would you generally handle corrupt files like this?

@quinnj
Copy link
Member

quinnj commented Nov 5, 2019

Ok, after staring at this file and related code between CSV.jl and Parsers.jl for a couple days (somewhat in a fog, I admit), I think I've found an actual bug and solution. Interestingly enough, I even mention the correct behavior in my last comment above, by saying ignorerepeated=true should ignore a delimiter followed by a newline. Here's a PR that provides a fix in Parsers.jl, where if you pass ignorerepeated=true, a newline directly following a matched delimiter will be correctly ignored. This is correct because, in addition to the provided delimiter char or string, Parsers.jl automatically treats newlines as valid delimiters, and so should equally ignore them if they "repeat" the delimiter (i.e. directly follow).

What all that means is that for this file, you can use CSV.read(file, missingstring="?", delim=' ', ignorerepeated=true, header=false) and it correctly reads the file with no warnings.

@tlienart
Copy link
Contributor Author

tlienart commented Nov 5, 2019

fantastic, thanks

@quinnj
Copy link
Member

quinnj commented Nov 5, 2019

Closing this as the new Parsers release has been made and there's nothing else I think we should do in CSV.jl here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants