Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv crashes RStudio Session - larger csv #1141

Closed
benjaminhlina opened this issue Oct 28, 2020 · 13 comments
Closed

read_csv crashes RStudio Session - larger csv #1141

benjaminhlina opened this issue Oct 28, 2020 · 13 comments
Labels
reprex needs a minimal reproducible example

Comments

@benjaminhlina
Copy link

benjaminhlina commented Oct 28, 2020

RStudio Session aborts when using read_csv on a batch of larger (200 - 500 mb) similar csv, all exported from a data logger, which open fine with read.csv(). I'm currently using R v4.0.3 and RStudio v1.3.1093. Prior to updating packages and R versions last week, this code ran completely fine so I'm not sure what the issue is. If I run this outside of RStudio just in R, it loads properly and if I load in smaller csv (1 mb) that don't require or display a progress bar it loads them just fine. I uninstalled and installed both R and RStudio as well.

I currently use the package here in tandem with Rprojects. I have tried removing the here() section of code and it still crashes in RStudio but not in R. I've provided my session_info() and would make a reprex of this but I'm unsure how to do that for this situation, since the files are large and I'm unable to provide access to the files. I could make a reprex using a large datafile I guess from online but I don't know if that would result in the same error.

 setting  value                       
 version  R version 4.0.3 (2020-10-10)
 os       Windows 10 x64              
 system   x86_64, mingw32             
 ui       RStudio                     
 language (EN)                        
 collate  English_United States.1252  
 ctype    English_United States.1252  
 tz       America/New_York            
 date     2020-10-28                  

- Packages --------------------------------------------------------------------------------------------------------------------
 package     * version date       lib source        
 assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.0.0)
 audio         0.1-7   2020-03-09 [1] CRAN (R 4.0.0)
 backports     1.1.10  2020-09-15 [1] CRAN (R 4.0.3)
 beepr         1.3     2018-06-04 [1] CRAN (R 4.0.0)
 callr         3.5.1   2020-10-13 [1] CRAN (R 4.0.3)
 class         7.3-17  2020-04-26 [2] CRAN (R 4.0.3)
 classInt      0.4-3   2020-04-07 [1] CRAN (R 4.0.0)
 cli           2.1.0   2020-10-12 [1] CRAN (R 4.0.3)
 colorspace    1.4-1   2019-03-18 [1] CRAN (R 4.0.0)
 crayon        1.3.4   2017-09-16 [1] CRAN (R 4.0.3)
 data.table    1.13.2  2020-10-19 [1] CRAN (R 4.0.2)
 DBI           1.1.0   2019-12-15 [1] CRAN (R 4.0.0)
 desc          1.2.0   2018-05-01 [1] CRAN (R 4.0.0)
 devtools      2.3.2   2020-09-18 [1] CRAN (R 4.0.3)
 digest        0.6.27  2020-10-24 [1] CRAN (R 4.0.3)
 dplyr       * 1.0.2   2020-08-18 [1] CRAN (R 4.0.2)
 e1071         1.7-4   2020-10-14 [1] CRAN (R 4.0.3)
 ellipsis      0.3.1   2020-05-15 [1] CRAN (R 4.0.0)
 fansi         0.4.1   2020-01-08 [1] CRAN (R 4.0.0)
 fs            1.5.0   2020-07-31 [1] CRAN (R 4.0.3)
 generics      0.0.2   2018-11-29 [1] CRAN (R 4.0.0)
 geosphere   * 1.5-10  2019-05-26 [1] CRAN (R 4.0.0)
 ggplot2     * 3.3.2   2020-06-19 [1] CRAN (R 4.0.2)
 glatos      * 0.4.2   2020-06-11 [1] url           
 glue          1.4.2   2020-08-27 [1] CRAN (R 4.0.2)
 gridExtra     2.3     2017-09-09 [1] CRAN (R 4.0.0)
 gtable        0.3.0   2019-03-25 [1] CRAN (R 4.0.0)
 here        * 0.1     2017-05-28 [1] CRAN (R 4.0.0)
 hms           0.5.3   2020-01-08 [1] CRAN (R 4.0.3)
 janitor     * 2.0.1   2020-04-12 [1] CRAN (R 4.0.0)
 KernSmooth    2.23-17 2020-04-26 [2] CRAN (R 4.0.3)
 knitr         1.30    2020-09-22 [1] CRAN (R 4.0.3)
 lattice       0.20-41 2020-04-02 [2] CRAN (R 4.0.3)
 lemon       * 0.4.5   2020-06-08 [1] CRAN (R 4.0.0)
 lifecycle     0.2.0   2020-03-06 [1] CRAN (R 4.0.3)
 lubridate   * 1.7.9   2020-06-08 [1] CRAN (R 4.0.2)
 magrittr      1.5     2014-11-22 [1] CRAN (R 4.0.0)
 memoise       1.1.0   2017-04-21 [1] CRAN (R 4.0.3)
 munsell       0.5.0   2018-06-12 [1] CRAN (R 4.0.0)
 pillar        1.4.6   2020-07-10 [1] CRAN (R 4.0.2)
 pkgbuild      1.1.0   2020-07-13 [1] CRAN (R 4.0.2)
 pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.0.0)
 pkgload       1.1.0   2020-05-29 [1] CRAN (R 4.0.0)
 plyr          1.8.6   2020-03-03 [1] CRAN (R 4.0.0)
 prettyunits   1.1.1   2020-01-24 [1] CRAN (R 4.0.0)
 processx      3.4.4   2020-09-03 [1] CRAN (R 4.0.2)
 ps            1.4.0   2020-10-07 [1] CRAN (R 4.0.3)
 purrr         0.3.4   2020-04-17 [1] CRAN (R 4.0.0)
 R6            2.4.1   2019-11-12 [1] CRAN (R 4.0.3)
 Rcpp          1.0.5   2020-07-06 [1] CRAN (R 4.0.2)
 readr       * 1.4.0   2020-10-05 [1] CRAN (R 4.0.3)
 remotes       2.2.0   2020-07-21 [1] CRAN (R 4.0.2)
 rlang         0.4.8   2020-10-08 [1] CRAN (R 4.0.3)
 rprojroot     1.3-2   2018-01-03 [1] CRAN (R 4.0.0)
 rstudioapi    0.11    2020-02-07 [1] CRAN (R 4.0.0)
 scales        1.1.1   2020-05-11 [1] CRAN (R 4.0.0)
 sessioninfo   1.1.1   2018-11-05 [1] CRAN (R 4.0.3)
 sf          * 0.9-6   2020-09-13 [1] CRAN (R 4.0.3)
 snakecase     0.11.0  2019-05-25 [1] CRAN (R 4.0.0)
 sp            1.4-4   2020-10-07 [1] CRAN (R 4.0.3)
 stringi       1.5.3   2020-09-09 [1] CRAN (R 4.0.2)
 stringr     * 1.4.0   2019-02-10 [1] CRAN (R 4.0.0)
 tibble        3.0.4   2020-10-12 [1] CRAN (R 4.0.3)
 tidyr         1.1.2   2020-08-27 [1] CRAN (R 4.0.2)
 tidyselect    1.1.0   2020-05-11 [1] CRAN (R 4.0.0)
 units         0.6-7   2020-06-13 [1] CRAN (R 4.0.2)
 usethis       1.6.3   2020-09-17 [1] CRAN (R 4.0.3)
 vctrs         0.3.4   2020-08-29 [1] CRAN (R 4.0.2)
 withr         2.3.0   2020-09-22 [1] CRAN (R 4.0.3)
 xfun          0.18    2020-09-29 [1] CRAN (R 4.0.3)

[1] C:/Users/benja/Documents/R/win-library/4.0
[2] C:/Program Files/R/R-4.0.3/library
@jimhester jimhester added the reprex needs a minimal reproducible example label Oct 28, 2020
@jimhester
Copy link
Collaborator

Try reinstalling readr, there was an interaction with cpp11 and how RStudio saves and restores environments that should now be resolved. But if the version of readr you have installed was compiled against the old version of cpp11 this could cause this behavior.

The current CRAN binaries should be ok, so reinstalling will hopefully fix your issue.

@benjaminhlina
Copy link
Author

benjaminhlina commented Oct 28, 2020

Thanks but that did not fix the issue. I also removed both cpp11 and readr and their respective dependencies and reinstalled them and it still crashes. I can provided a reprex as well but I don't know if that will simulate the same error considering the data is from an online source. I'll try to find a online free csv that is about the same size.

I also deleted and remade the Rproject file and the hidden folder that gets created with Rprojects and still hangs up.

@benjaminhlina
Copy link
Author

benjaminhlina commented Oct 28, 2020

Here is a reprex using an online source, file size is ~300 mb, however this doesn't cause the session to crash so I have no idea why it's crashing. I also reexported the csv from the the data logger and it still crashes. What is so strange to me is this was working last week. I'll try to find an online source that's around the same size that causes it to crash, as this example isn't really all that helpful. Are there any larger csv's that you use to test the functions that would be available to use? Thanks for your help in resolving this.

# load packages 

library(readr)

# bring in dataframe

df <- read_csv(file = "http://www2.census.gov/programs-surveys/bds/tables/time-series/bds2018_msa_sector_fage.csv")
#> 
#> -- Column specification --------------------------------------------------------
#> cols(
#>   .default = col_character(),
#>   year = col_double(),
#>   msa = col_double()
#> )
#> i Use `spec()` for the full column specifications.

Created on 2020-10-28 by the reprex package (v0.3.0)

@jimhester
Copy link
Collaborator

Does the example you posted in #1141 (comment) reproduce the issue or not, I am not clear?

If you download the file locally first and read it from the file read_csv("bds2018_msa_sector_fage.csv") does that reproduce the issue?
If you use read_csv(file("bds2018_msa_sector_fage.csv")) does that reproduce the issue?

@bbolker
Copy link

bbolker commented Oct 30, 2020

Similar problem, reinstalling (from source) helped but I'm still something a little weird going on. Input file is 60M (sensitive data so it will take me a little while to come up with a reproducible example). read_csv() works OK from an interactive session but in a batch session fails with the error below. (Also reinstalled CRAN versions of tibble and vctrs from source ...) Any ideas where to start looking?

update: if I increase guess_max so there are no parsing failures (i.e. read_csv reads far enough into the file to correctly identify all column types), the problem goes away. This solves my problem for now, hopefully it helps you diagnose what's going on. (Maybe I could construct a reproducible example by creating a big file with lots of NAs at the top to force parsing failures ...)

Error: Assigned data `all_colnames[problems$col]` must be compatible with existing data.Existing data has 1 row.Assigned data has 2408743 rows.Row updates require a list value. Do you need `list()` or `as.list()`?
Backtrace:1. ├─global::csvRead()
  2. │ └─readr::read_csv(matchFile(pat, fl, exts), ...)
  3. │   └─readr:::read_delimited(...)
  4. │     └─readr:::name_problems(out, names(spec$cols), name)
  5. │       ├─base::`$<-`(...)
  6. │       └─tibble:::`$<-.tbl_df`(...)
  7. │         └─tibble:::tbl_subassign(...)
  8. │           └─tibble:::vectbl_recycle_rhs(...)
  9. │             ├─base::withCallingHandlers(...)
 10. │             └─vctrs::vec_recycle(value[[j]], nrow)
 11. ├─vctrs:::stop_recycle_incompatible_size(...)
 12. │ └─vctrs:::stop_vctrs(...)
 13. │   └─rlang::abort(message, class = c(class, "vctrs_error"), ...)
 14. │     └─rlang:::signal_abort(cnd)
 15. │       └─base::signalCondition(cnd)
 16. └─(function (cnd) ...
Execution halted

@benjaminhlina
Copy link
Author

benjaminhlina commented Nov 3, 2020

Sorry for not getting back to you. I was in the field without internet or cell service the past few days. The reprex I supplied isn't helpful whether I download the file prior or run it from the web as it doesn't cause the session to crash. I'm working on a more specific reproducible example by making a large toy data file with similar str() to the file that's causing it to crash (I can't share the file that's causing it to crash).

Separately while making this file, I've noticed that in a clean session after loading packages if I read in the toy data first (550 mb; I'll put this toy file on github) then a smaller data file (200 mb) and then the one that's causing the session to crash (571 mb), all of them load properly and the session doesn't crash. If I create a new clean session, load the same packages and read the issue file (571 mb) first, it crashes. So something about the order in which the files are loaded is causing it to hang up.

The order in which the files import is important as the files are downloads at different times of the year that correspond to different meta data that later has to be added to the downloaded files and lined up properly.

@benjaminhlina
Copy link
Author

benjaminhlina commented Nov 16, 2020

I've been trying to come up with a better reprex for this issue but the problem is the error is super inconsistent.

I run it several times with readr isolated and it doesn't crash or it crashes the one off time. Reload the project after the session crashes and it runs or it just crashes again and again. I have no idea what it's doing and what the issue is, this is super frustrating as it's so inconsistent. A quick search for RStudio session crashes indicates that it's an issue with RAM, I have 24 GB of RAM so there's tons of RAM available for the session to use. I have ran the same files on several different windows computers some with older versions of both R (3.6.3) and RStudio (v1.2.5042), and it crashes. It sounds like readr and/or RStudio is leaking memory based rstudio/rstudio#8319

Here is a reprex using the data that's causing it. I've made the csv's downloadable from dropbox, see link here file 1, file 2, and file 3

# load packages

library(readr)

# load csv that is 570 mb this often crashes once its finished bringing in the file 

rd_8 <- read_csv("kenauk_download_08_July_3_2020.csv")

# this csv is only 16.6 mb

rd_9 <- read_csv("kenauk_download_09_Sept_28_2020.csv")

# this file is 46.4 mb but in this order it can randomly cause the session to crash

rd_10 <- read_csv("kenauk_download_10_Oct_18_2020.csv")

I usually use here, a pipe and clean_names() from janitor but I've removed that so it's isolated. I have no idea if this will cause the issue to occur on your end?

# load packages ----

library(dplyr)
library(here)
library(janitor)
library(readr)

# bring in file 
rd_8 <- read_csv(here("Fish and tagging data", 
                      "Receiver Downloads", 
                      "Downloads exported by VUE",
                      "kenauk_download_08_July_3_2020.csv")) %>% 
  clean_names()

This 100% of the time will crash

@jimhester
Copy link
Collaborator

jimhester commented Nov 16, 2020

Could you try re-installing readr and the cpp11 package? I have been trying to reproduce this crash and have been unable to do so with these files, it is possible it was an interaction with RStudio session restore that was fixed by the latest cpp11 release.

@benjaminhlina
Copy link
Author

I thought that the crash wouldn't happen on your end as it is so inconsistent on mine. I have made new projects and load in the same data and sometimes it crashes and sometimes it doesn't. I have removed and reinstalled both cpp11 and readr and it still crashes.

Last week I removed R, RStudio, and Rtools completely, deleted my entire package library and deleted the temporary files that a session creates that RStudio website says to use to reset RStudio (link here). One of the laptops that I tried to run it on, I upgraded R and RStudio version and freshly installed both cpp11 and readr as neither of those packages had been previously installed and it still crashed. I'm at the point of considering editing everything back to base as I need to keep working on stuff and this is holding me up. The one thing I really like about readr is that it recognized the POSIXct timestamp and imports that properly as well as the sensor unit column. Base r doesn't. I'm confused as to why this is happening as until 3 weeks ago prior to updating RStudio version as well as readr this was not happening. Thank you @jimhester for your help on this!

@jimhester
Copy link
Collaborator

You can install the prior version of readr, remotes::install_version("readr", "1.3.1"). Note this will require you to have a dev environment with Rtools installed on Windows.

@benjaminhlina
Copy link
Author

benjaminhlina commented Nov 17, 2020

Thank you for the suggestion, I have installed the previous version and it has yet to crash. Again not sure why it's causing it to crash but seems like some type of memory issue between readr and Rstudio.

Would installing both the developer version of cpp11 and readr potentially address this as suggested in closed issue #1145. Again I don't fully know the development side of readr but these seem semi related as its clearly a memory issue.

@jimhester
Copy link
Collaborator

Yes, installing the development version of cpp11 and readr would definitely be something to try if you were interested.

@benjaminhlina
Copy link
Author

benjaminhlina commented Nov 17, 2020

It appears as if the development versions of both cpp11 and readr have caused the crash to stop occurring. If this changes I'll reopen this. I'll be deleting the link to the files I shared. Thanks again for your help on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reprex needs a minimal reproducible example
Projects
None yet
Development

No branches or pull requests

3 participants