Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible string type detection bug #945

Closed
bkamins opened this issue Dec 4, 2021 · 1 comment · Fixed by #949
Closed

Possible string type detection bug #945

bkamins opened this issue Dec 4, 2021 · 1 comment · Fixed by #949

Comments

@bkamins
Copy link
Member

bkamins commented Dec 4, 2021

@quinnj it seems that in some cases CSV.read creates String63 columns under default settings (and AFAICT it should not - String should be used in such a case as we are using InlineStrings.jl up to String31 by default).

Here is a notebook https://github.com/JuliaAcademy/DataFrames/blob/main/3.%20Working%20with%20text%20files.ipynb reproducing the issue. See cell [12] for the result and cell [10] for the command I use to read in the data. To get the data use the code in cell [5].

@bkamins
Copy link
Member Author

bkamins commented Dec 12, 2021

@quinnj I have the same problem in https://github.com/bkamins/PyDataGlobal2020.

See file https://github.com/bkamins/PyDataGlobal2020/blob/main/police.ipynb, cell [4]. We create String63, which should not happen.

quinnj added a commit that referenced this issue Dec 14, 2021
Fixes #945. The core issue here is an inconsistency between the initial
type detection code and the later-on type promotion code while parsing.
The type detection code had a limit setup so that inline string column
types weren't allowed larger than `String31`. The promotion code,
however, allowed inline string types to continue to promote up to
`String255`. That means if a column type was at least initiall detected
as some smaller-than-`String31` type, then later on a value larger than
31 bytes was parsed, it would continue to promote up to larger inline
string types. The change proposed in this PR is to limit the promotion
code to be more consistent with the type detection code: if a parsed
value "overflows" the `String31` type, we'll just promote to a regular
`String` instead of promoting to larger inline string types.
quinnj added a commit that referenced this issue Dec 14, 2021
Fixes #945. The core issue here is an inconsistency between the initial
type detection code and the later-on type promotion code while parsing.
The type detection code had a limit setup so that inline string column
types weren't allowed larger than `String31`. The promotion code,
however, allowed inline string types to continue to promote up to
`String255`. That means if a column type was at least initiall detected
as some smaller-than-`String31` type, then later on a value larger than
31 bytes was parsed, it would continue to promote up to larger inline
string types. The change proposed in this PR is to limit the promotion
code to be more consistent with the type detection code: if a parsed
value "overflows" the `String31` type, we'll just promote to a regular
`String` instead of promoting to larger inline string types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant