Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading large file using bigmemory #110

Open
bioinfonext opened this issue Jan 18, 2022 · 13 comments
Open

reading large file using bigmemory #110

bioinfonext opened this issue Jan 18, 2022 · 13 comments

Comments

@bioinfonext
Copy link

bioinfonext commented Jan 18, 2022

Hi,

I am trying to read a large file using bigmemory and I am getting errors as this file's first two columns are non-numeric, so I have deleted the second column but the first column I want to make as row names.

Is there any option in bigmemory to make the first column as row names and how I can avoid the below warning message?

>library("bigmemory")

> library("biganalytics")

> data.matrix - read.big.matrix("methylation.txt",header=T,sep='\t')

Error in data.matrix - read.big.matrix("methylation.txt", header = T,  :

  non-numeric argument to binary operator

In addition: Warning messages:

1: In na.omit(as.integer(firstLineVals)) : NAs introduced by coercion

2: In na.omit(as.double(firstLineVals)) : NAs introduced by coercion

3: In read.big.matrix("methylation.txt", header = T, sep = "\t") :

Many thanks,

@privefl
Copy link
Contributor

privefl commented Jan 18, 2022

I don't think you can have rownames for a big.matrix.
You should probably just store these somewhere else.

@bioinfonext
Copy link
Author

Hi @privefl,

I need to correlate this matrix data with phenotypic data so that's why I want to make SampleID as row names. Can I assign rownames after reading this matrix using a separate list.

Someone suggested some solutions here but I am not sure what does it mean.

https://stackoverflow.com/questions/12576735/bigmemory-and-rownames-dimnames-of-matrix

Many thanks

@privefl
Copy link
Contributor

privefl commented Jan 18, 2022

Just use match() to get the row indices that correspond to the external SampleID.

(or the opposite, i.e. reorder the phenotypic data instead)

@bioinfonext
Copy link
Author

Hi,
Thanks @privefl

We just have 2000 rows so we need these for further analysis.
Many thanks,

@bioinfonext
Copy link
Author

bioinfonext commented Jan 18, 2022

I have removed first two non-numeric column but still, it shows the same error;

> data.matrix - read.big.matrix("phylo.txt",header=T,sep='\t')
Error in data.matrix - read.big.matrix("phylo.txt", header = T, sep = "\t") :
  non-numeric argument to binary operator
In addition: Warning messages:
1: In na.omit(as.integer(firstLineVals)) : NAs introduced by coercion
2: In na.omit(as.double(firstLineVals)) : NAs introduced by coercion
3: In read.big.matrix("phylo.txt", header = T, sep = "\t") :
  Because type was not specified, we chose double based on the first line of data.

File look like this now after removing first two character column; it has around 80000 column and 2000 rows
cg02115394 cg12480843
0.974035    0.718462
0.967383    0.765799
0.961012    0.84822
0.960447    0.722946
0.963181    0.939808
0.940292    0.878546

@privefl
Copy link
Contributor

privefl commented Jan 18, 2022

Would be a good idea to read only the first e.g. 5 rows with data.table::fread() to have an idea of the number and types of columns.

@bioinfonext
Copy link
Author

bioinfonext commented Jan 18, 2022

I have removed all non-numeric column and I can able to read 5 rows using fread, but bigmemory don't work here.

mydt10 <- fread("phylo.num.txt", nrows = 5)
> dim(mydt10)
[1]      5 844488
> str(mydt10)
Classes ‘data.table’ and 'data.frame':  5 obs. of  844488 variables:
$ cg14361672      : num  0.974 0.967 0.961 0.96 0.963
$ cg12950382      : num  0.718 0.766 0.848 0.723 0.94
$ cg02115394      : num  0.0337 0.0258 0.025 0.0317 0.0357
$ cg12480843      : num  0.0182 0.0189 0.0137 0.0167 0.0151
$ cg26724186      : num  0.98 0.977 0.982 0.982 0.978
$ cg00617867      : num  0.96 0.979 0.98 0.977 0.977
$ cg13773083      : num  0.313 0.246 0.253 0.234 0.372
$ cg17236668      : num  0.974 0.975 0.975 0.979 0.978
$ cg19607165      : num  0.0866 0.0966 0.0804 0.1162 0.0792
$ cg08770523      : num  0.0243 0.0213 0.0203 0.0194 0.0197

@privefl
Copy link
Contributor

privefl commented Jan 18, 2022

table(sapply(mydt10, typeof))?

@bioinfonext
Copy link
Author

bioinfonext commented Jan 18, 2022

> table(sapply(mydt10, typeof))

double
844488

@privefl
Copy link
Contributor

privefl commented Jan 18, 2022

Hum..
Maybe worth trying bigstatsr::big_read() (https://privefl.github.io/bigstatsr/articles/read-FBM-from-file.html).

@bioinfonext
Copy link
Author

bioinfonext commented Jan 18, 2022

Still, getting errors even with bigreadr?

> data2 <- big_fread2("phylo.num.txt", nb_parts = NULL, .transform = identity,.combine = cbind_df, skip = 0, select = NULL, progress = FALSE, part_size = 500 * 1024^2)
 *** caught segfault ***
address 0x7f5e51c63df7, cause 'memory not mapped'

Traceback:
 1: data.table::fread(input, ..., data.table = data.table, nThread = nThread)
 2: fread2(file, skip = skip, select = cols, ..., showProgress = FALSE)
 3: .transform(fread2(file, skip = skip, select = cols, ..., showProgress = FALSE))
 4: FUN(X[[i]], ...)
 5: lapply(split_cols, function(cols) {    part <- .transform(fread2(file, skip = skip, select = cols,         ..., showProgress = FALSE))    already_read <<- already_read + length(cols)    if (progress)         utils::setTxtProgressBar(pb, already_read)    part})
 6: big_fread2("phylo.num.txt", nb_parts = NULL, .transform = identity,     .combine = cbind_df, skip = 0, select = NULL, progress = FALSE,     part_size = 500 * 1024^2)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection:

Many thanks,

@bioinfonext
Copy link
Author

Is it possible to run a loop to read this file using fread in R?

Many thanks,

@glm729
Copy link

glm729 commented Oct 31, 2022

@bioinfonext:

> data.matrix - read.big.matrix("phylo.txt",header=T,sep='\t')
Error in data.matrix - read.big.matrix("phylo.txt", header = T, sep = "\t") :
  non-numeric argument to binary operator

Is this meant to be:

data.matrix <- read.big.matrix("phylo.txt", header = TRUE, sep = "\t")
#           ^^

It looks like you had a typo, given your original error -- the assignment operator was missing the <.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants