Categorical variables for dta files #19

Nosferican · 2019-05-27T16:45:02Z

using ReadStat
file = download("http://www.stata-press.com/data/r15/fullauto.dta",
                "data/ologit.dta")
data = read_dta(file)
using StatFiles, DataFrames
output = load(file) |> DataFrame

If you take a look at data you will see that categorical variables have a mapping to labels given by val_labels_keys and val_label_dict. Without taking into account that nuance, the default behavior specified here yields the values instead of the labels (e.g., rep77 gives [1, 2, 3, 4, 5] instead of ["Poor", "Fair", "Average", "Good", "Excellent"]). It might be the case for other file formats, but this is confirmed for Stata's dta.

The text was updated successfully, but these errors were encountered:

nalimilan · 2022-02-04T07:59:46Z

Do you think it would be appropriate to return a CategoricalArray for such cases? Something that I've been wondering recently is whether CategoricalArrays should be able to preserve the original value code in addition to the label. In R the fact that you can represent those either as factors or as labelled numeric vectors creates a divide which isn't optimal IMO.

nalimilan · 2022-03-06T15:09:22Z

FWIW ReadStatTables.jl supports values labels via a special LabeledArray type. There's no way currently to convert these to CategoricalArray while preserving the ordering of levels though.

davidanthoff added the bug label May 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical variables for dta files #19

Categorical variables for dta files #19

Nosferican commented May 27, 2019

nalimilan commented Feb 4, 2022

nalimilan commented Mar 6, 2022

Categorical variables for dta files #19

Categorical variables for dta files #19

Comments

Nosferican commented May 27, 2019

nalimilan commented Feb 4, 2022

nalimilan commented Mar 6, 2022