Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwrite substitutes class integer rownames by 1,2,3... #4957

Closed
dgarrimar opened this issue Apr 19, 2021 · 2 comments · Fixed by #5098
Closed

fwrite substitutes class integer rownames by 1,2,3... #4957

dgarrimar opened this issue Apr 19, 2021 · 2 comments · Fixed by #5098
Labels
Milestone

Comments

@dgarrimar
Copy link

# Minimal reproducible example

The following R code:

library(data.table)
df <- data.frame(c1 = c(1, 2, 3), c2 = c("a",  "b",  "c"))
rownames(df) <- as.integer(c(10, 20, 30)) # class(attributes(df)$row.names) 
fwrite(df, "out_fwrite", quote = F, row.names = T, sep = "\t")
write.table(df, "out_wt", quote = F, row.names = T, sep = "\t") # added for comparison

generates two output files, whose content is shown below:

out_fwrite

	c1	c2
1	1	a
2	2	b
3	3	c

out_wt

c1	c2
10	1	a
20	2	b
30	3	c

I noticed fwrite is converting the actual rownames (i.e. 10, 20, 30) to 1, 2, 3 without warning. This does not happen when rownames(df) <- as.numeric(c(10, 20, 30)) or rownames(df) <- as.character(c(10, 20, 30)). This is also different from the behaviour of write.table.

# Output of sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.2 (Nitrogen)

Matrix products: default
BLAS:   /nfs/users2/rg/dgarrido/R/R-4.0.3/lib64/R/lib/libRblas.so
LAPACK: /nfs/users2/rg/dgarrido/R/R-4.0.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.13.6 optparse_1.6.6   

loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3    getopt_1.20.3 
@dgarrimar
Copy link
Author

It may be related to this:

args.rowNames = isString(rn) ? DATAPTR_RO(rn) : NULL;

@ColeMiller1
Copy link
Contributor

ColeMiller1 commented Aug 7, 2021

Thanks for the report. Interestingly, R stores the numeric, non-integer row names as characters:

DF = data.frame(x = 1)

## integer storage
rownames(DF) = 1L
class(attributes(DF)$row.names)
#> [1] "integer"

## numeric actually stored as character!!
rownames(DF) = 1
class(attributes(DF)$row.names)
#> [1] "character"

Some options to change behavior:

  1. write.table simply passes as.character(row.names(x)) to the C code. It would be simple to change fwrite.R to do the same and make small changes in fwriteR.c
  2. The code @dgarrimar pointed out in fwriteR.c could be change to convert the integer row names to a pointer of character row names.
  3. fwrite.c could be modified to allow for integer row names to directly be written. The would probably be the best for performance but also the most code change because the argument structure would likely need changed.

The first option seems good enough - I will work on a PR. It will be OK if the maintainers would prefer a different option as this should be pretty straightforward.

Changes to implement option 3:

Need to switch on type of rowNames so another arg may be needed. Meaning, if it is an integer row name, we need the maximum row name value to convert to character length.

data.table/src/fwrite.c

Lines 641 to 644 in 831013a

if (args.doRowNames) {
maxLineLen += args.rowNames ? getMaxStringLen(args.rowNames, args.nrow)*2 : 1+(int)log10(args.nrow); // the width of the row number
maxLineLen += 2*(doQuote!=0/*NA('auto') or true*/) + sepLen;
}

Other change - need to writeInt64 if we are doing row name for integer type.

data.table/src/fwrite.c

Lines 875 to 883 in 831013a

if (args.doRowNames) {
if (args.rowNames==NULL) {
if (doQuote!=0/*NA'auto' or true*/) *ch++='"';
int64_t rn = i+1;
writeInt64(&rn, 0, &ch);
if (doQuote!=0) *ch++='"';
} else {
writeString(args.rowNames, i, &ch);
}

@mattdowle mattdowle added this to the 1.14.1 milestone Aug 10, 2021
@jangorecki jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants