[real data test] Small dataset warnings etc #523

tlienart · 2019-10-30T17:10:40Z

Hello, I encountered a few issues when trying to read a simple UCI dataset and thought I'd report what happened / what I ended up doing (this is with CSV 5.14)

using HTTP, CSV
req = HTTP.get("https://archive.ics.uci.edu/ml/machine-learning-databases/horse-colic/horse-colic.data")
CSV.read(req.body)

this by itself works in that it returns a DataFrame with the right dimensions but gives out

┌ Warning: 2; something went wrong trying to determine row positions for multithreading; it'd be very helpful if you could open an issue at https://github.com/JuliaData/CSV.jl/issues so package authors can investigate
└ @ CSV ~/.julia/packages/CSV/skXyc/src/detection.jl:369

and also

thread = 1 warning: parsed expected 28 columns, but didn't reach end of line on data row: 1. Ignoring any extra columns on this row
thread = 1 warning: parsed expected 28 columns, but didn't reach end of line on data row: 2. Ignoring any extra columns on this row
thread = 1 warning: parsed expected 28 columns, but didn't reach end of line on data row: 3. Ignoring any extra columns on this row
thread = 1 warning: parsed expected 28 columns, but didn't reach end of line on data row: 4. Ignoring any extra columns on this row
thread = 2 warning: parsed expected 28 columns, but didn't reach end of line on data row: 1. Ignoring any extra columns on this row

(one per line)

To get this to fully work and not display a series of warnings I had to write

CSV.read(req.body, delim=' ', missingstring="?", 
                threaded=false, silencewarnings=true)

Did I do something wrong? is there something weird that prevents the data from being read properly while still getting the right final answer? also I don't understand the warnings about not getting to the end of line and still managing to get the file parsed fine

The text was updated successfully, but these errors were encountered:

quinnj · 2019-10-30T23:37:06Z

So this dataset is a bit odd; here's a screenshot of the file in my editor and you can see that the first row doesn't have a trailing space, while the subsequent rows do (the little dot is the space at the end). There are some ~15 rows in the dataset that don't have the trailing space, so it's a bit unclear if there should be 29 columns, or just 28 and the extra space is part of the 28th integer column. This is leading to all the parsing warnings.

I'm still looking into why we're seeing the really scary "something went wrong" warning and I'll report back; just wanted to post what I noticed so far and see if @tlienart knew a little more what the dataset shape/size should be.

quinnj · 2019-10-30T23:44:14Z

Oh, I can answer the next part; because CSV.jl can't quite figure out the correct # of columns to expect, it shows the really scare "something went wrong" error. I think we can avoid the scary warning in this part, because we already decided how many columns we're going with in the first detection phase, so in chunking up the file, we can apply the same logic for ignoring extra/too few columns for a row.

That will avoid printing the scary warnings for this file, though as I noted before, unless we know a little more about the true shape/size of the data, we'll still get some warnings.

tlienart · 2019-10-31T10:49:52Z

Sounds like this file is a potentially useful test case then 😄

It all sounds good, with respect to the trailing whitespace, I understand that's odd and all but shouldn't we wish for CSV to not be thrown off by this? I don't know the internals of CSV so maybe it's really difficult, but isn't it a case of how you detect end of line where the regex would be \s*\n or something like it?

quinnj · 2019-10-31T15:20:27Z

Well, as far as I can tell, this is an actual ambiguous case, not just something CSV.jl can't handle. So it's not just CSV.jl being thrown off. Particularly because the delimiter is ' '; that would be like having:

a,b,c
1,2,3,
4,5,6
7,8,9
10,11,12,

note that the 1st and 4th rows of data have a trailing comma, which indicates a 4th column with a missing value.

We do have the ignorerepeated=true option, which allows for cases like ignoring a ' ' followed by a \n, but this file still has the problem of being inconsistent between rows, so we detect 28 columns on the first row and then warn because there's a 29th column on other rows.

tlienart · 2019-10-31T16:03:26Z

Yeah that makes sense thanks for explaining; out of curiosity how would you generally handle corrupt files like this?

quinnj · 2019-11-05T06:09:00Z

Ok, after staring at this file and related code between CSV.jl and Parsers.jl for a couple days (somewhat in a fog, I admit), I think I've found an actual bug and solution. Interestingly enough, I even mention the correct behavior in my last comment above, by saying ignorerepeated=true should ignore a delimiter followed by a newline. Here's a PR that provides a fix in Parsers.jl, where if you pass ignorerepeated=true, a newline directly following a matched delimiter will be correctly ignored. This is correct because, in addition to the provided delimiter char or string, Parsers.jl automatically treats newlines as valid delimiters, and so should equally ignore them if they "repeat" the delimiter (i.e. directly follow).

What all that means is that for this file, you can use CSV.read(file, missingstring="?", delim=' ', ignorerepeated=true, header=false) and it correctly reads the file with no warnings.

tlienart · 2019-11-05T08:15:25Z

fantastic, thanks

quinnj · 2019-11-05T22:59:08Z

Closing this as the new Parsers release has been made and there's nothing else I think we should do in CSV.jl here.

tlienart changed the title ~~Small dataset warnings etc~~ [real data test] Small dataset warnings etc Oct 30, 2019

quinnj closed this as completed Nov 5, 2019

tlienart mentioned this issue Nov 6, 2019

horse data csv fix JuliaAI/DataScienceTutorials.jl#16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[real data test] Small dataset warnings etc #523

[real data test] Small dataset warnings etc #523

tlienart commented Oct 30, 2019 •

edited

Loading

quinnj commented Oct 30, 2019

quinnj commented Oct 30, 2019

tlienart commented Oct 31, 2019

quinnj commented Oct 31, 2019

tlienart commented Oct 31, 2019 •

edited

Loading

quinnj commented Nov 5, 2019

tlienart commented Nov 5, 2019

quinnj commented Nov 5, 2019

[real data test] Small dataset warnings etc #523

[real data test] Small dataset warnings etc #523

Comments

tlienart commented Oct 30, 2019 • edited Loading

quinnj commented Oct 30, 2019

quinnj commented Oct 30, 2019

tlienart commented Oct 31, 2019

quinnj commented Oct 31, 2019

tlienart commented Oct 31, 2019 • edited Loading

quinnj commented Nov 5, 2019

tlienart commented Nov 5, 2019

quinnj commented Nov 5, 2019

tlienart commented Oct 30, 2019 •

edited

Loading

tlienart commented Oct 31, 2019 •

edited

Loading