-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[real data test] Small dataset warnings etc #523
Comments
So this dataset is a bit odd; here's a screenshot of the file in my editor and you can see that the first row doesn't have a trailing space, while the subsequent rows do (the little dot is the space at the end). There are some ~15 rows in the dataset that don't have the trailing space, so it's a bit unclear if there should be 29 columns, or just 28 and the extra space is part of the 28th integer column. This is leading to all the parsing warnings.
I'm still looking into why we're seeing the really scary "something went wrong" warning and I'll report back; just wanted to post what I noticed so far and see if @tlienart knew a little more what the dataset shape/size should be. |
Oh, I can answer the next part; because CSV.jl can't quite figure out the correct # of columns to expect, it shows the really scare "something went wrong" error. I think we can avoid the scary warning in this part, because we already decided how many columns we're going with in the first detection phase, so in chunking up the file, we can apply the same logic for ignoring extra/too few columns for a row. That will avoid printing the scary warnings for this file, though as I noted before, unless we know a little more about the true shape/size of the data, we'll still get some warnings. |
Sounds like this file is a potentially useful test case then 😄 It all sounds good, with respect to the trailing whitespace, I understand that's odd and all but shouldn't we wish for CSV to not be thrown off by this? I don't know the internals of CSV so maybe it's really difficult, but isn't it a case of how you detect end of line where the regex would be |
Well, as far as I can tell, this is an actual ambiguous case, not just something CSV.jl can't handle. So it's not just CSV.jl being thrown off. Particularly because the delimiter is a,b,c
1,2,3,
4,5,6
7,8,9
10,11,12, note that the 1st and 4th rows of data have a trailing comma, which indicates a 4th column with a missing value. We do have the |
Yeah that makes sense thanks for explaining; out of curiosity how would you generally handle corrupt files like this? |
Ok, after staring at this file and related code between CSV.jl and Parsers.jl for a couple days (somewhat in a fog, I admit), I think I've found an actual bug and solution. Interestingly enough, I even mention the correct behavior in my last comment above, by saying What all that means is that for this file, you can use |
fantastic, thanks |
Closing this as the new Parsers release has been made and there's nothing else I think we should do in CSV.jl here. |
Hello, I encountered a few issues when trying to read a simple UCI dataset and thought I'd report what happened / what I ended up doing (this is with CSV 5.14)
this by itself works in that it returns a DataFrame with the right dimensions but gives out
and also
(one per line)
To get this to fully work and not display a series of warnings I had to write
Did I do something wrong? is there something weird that prevents the data from being read properly while still getting the right final answer? also I don't understand the warnings about not getting to the end of line and still managing to get the file parsed fine
The text was updated successfully, but these errors were encountered: