Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tar_make_clustermq() hangs and tar_make() works when building large targets #182

Closed
6 tasks done
mattwarkentin opened this issue Oct 5, 2020 · 5 comments
Closed
6 tasks done

Comments

@mattwarkentin
Copy link
Contributor

mattwarkentin commented Oct 5, 2020

Prework

  • Read and agree to the code of conduct and contributing guidelines.
  • If there is already a relevant issue, whether open or closed, comment on the existing thread instead of posting a new issue.
  • Post a minimal reproducible example so the maintainer can troubleshoot the problems you identify. A reproducible example is:
    • Runnable: post enough R code and data so any onlooker can create the error on their own computer.
    • Minimal: reduce runtime wherever possible and remove complicated details that are irrelevant to the issue at hand.
    • Readable: format your code according to the tidyverse style guide.

Description

This has been a recurring issue for me. There are a few select targets that will hang indefinitely when building with tar_make_clustermq(), but when I use tar_make() it works without issue. tar_make_clustermq() will hang for tens of minutes or hours, while tar_make() takes a few minutes to build the target. I may have even filed an issue about this before, but I am determined to try to make a reprex this time around. Possibly related to #169.

Building this particular target is simply just reading in a large tab-separated file with vroom::vroom(). The file is about 500,000 rows and 2,500 columns. The file size is 9.7GB. I have simulated some data to stand in its place.

Some notable features that may or may not be relevant to this issue:

  • The data are stored on a mounted network drive, and not directly on my Mac's hard drive

  • Some tar options that I set...

options(clustermq.scheduler = "multicore")

tar_option_set(
  format = "qs",
  memory = "transient",
  storage = "remote",
  retrieval = "remote",
  packages = pkgs
)
  • tar_make_clustermq() arguments:
targets::tar_make_clustermq(
  workers = 1L,
  garbage_collection = TRUE
)

Reproducible example

Brief Summary: Scenario 1 and 2 run fine with tar_make() and take around 5 minutes. Scenario 3 and 4 hang (or maybe just take a huge amount of time and I wasn't willing to wait) and I cancelled these after 15+ minutes of waiting. I posted the literal reprex::reprex code for 3, 4, and 5 because they never completed. Scenario 5 moved the data to a local directory and this did not make any difference; tar_make_clustermq() still hangs.

Simulate Data

Here is my attempt at a reprex. I simulated some data that is comparable in size and structure:

# This make a very large file, run with caution
rows = 5e5; cols = 1000
data <- data.frame(matrix(runif(rows*cols), nrow = rows))
vroom::vroom_write(data, "/Volumes/hung_lab/warkentin/targets-data.tsv")

Benchmark interactive loading

First, benchmarking simple interactive loading of the data set into R using two common packages

system.time(readr::read_tsv("/Volumes/hung_lab/warkentin/targets-data.tsv"))
Parsed with column specification:
cols(
  .default = col_double()
)
See spec(...) for full column specifications.
|==================================================| 100% 9189 MB
   user  system elapsed 
132.147  18.383 258.431 
system.time(vroom::vroom("/Volumes/hung_lab/warkentin/targets-data.tsv"))
Rows: 500,000                                                                                       
Columns: 1,000
Delimiter: "\t"
dbl [1000]: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X...

Use `spec()` to retrieve the guessed column specification
Pass a specification to the `col_types` argument to quiet this message
   user  system elapsed 
 13.668  10.695 110.535 

1. Reprex using {readr} and tar_make()

library(targets)

tar_script({
  options(clustermq.scheduler = "multicore")
  tar_option_set(
    memory = "transient",
    storage = "remote",
    retrieval = "remote",
    packages = "tidyverse"
  )
  
  tar_pipeline(
    tar_target(
      file,
      "/Volumes/hung_lab/warkentin/targets-data.tsv",
      format = "file"
    ),
    tar_target(
      data,
      readr::read_tsv(file),
      format = "fst_tbl"
    )
  )
})
system.time(tar_make())
#> �[34m●�[39m run target file
#> �[34m●�[39m run target data
#> Parsed with column specification:
#> cols(
#>   .default = col_double()
#> )
#> See spec(...) for full column specifications.
#>    user  system elapsed 
#> 118.728  25.097 365.559

2. Reprex using {vroom} and tar_make()

library(targets)

tar_script({
  options(clustermq.scheduler = "multicore")
  tar_option_set(
    memory = "transient",
    storage = "remote",
    retrieval = "remote",
    packages = "tidyverse"
  )
  
  tar_pipeline(
    tar_target(
      file,
      "/Volumes/hung_lab/warkentin/targets-data.tsv",
      format = "file"
    ),
    tar_target(
      data,
      vroom::vroom(file),
      format = "fst_tbl"
    )
  )
})
system.time(tar_make())
#> �[34m●�[39m run target file
#> �[34m●�[39m run target data
#> �[1mRows:�[22m 500,000
#> �[1mColumns:�[22m 1,000
#> �[1mDelimiter:�[22m "\t"
#> �[32mdbl�[39m [1000]: X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13, X14, X15, X16, X17, X18, X...
#> 
#> �[90mUse `spec()` to retrieve the guessed column specification�[39m
#> �[90mPass a specification to the `col_types` argument to quiet this message�[39m
#>    user  system elapsed 
#> 184.827  24.742 254.064

3. Reprex using {readr} and tar_make_clustermq()

reprex::reprex({
  library(targets)
  
  tar_script({
    options(clustermq.scheduler = "multicore")
    tar_option_set(
      memory = "transient",
      storage = "remote",
      retrieval = "remote",
      packages = "tidyverse"
    )
    
    tar_pipeline(
      tar_target(
        file,
        "/Volumes/hung_lab/warkentin/targets-data.tsv",
        format = "file"
      ),
      tar_target(
        data,
        readr::read_tsv(file),
        format = "fst_tbl"
      )
    )
  })
  system.time(tar_make_clustermq())
})

4. Reprex using {vroom} and tar_make_clustermq()

reprex::reprex({
  library(targets)
  
  tar_script({
    options(clustermq.scheduler = "multicore")
    tar_option_set(
      memory = "transient",
      storage = "remote",
      retrieval = "remote",
      packages = "tidyverse"
    )
    
    tar_pipeline(
      tar_target(
        file,
        "/Volumes/hung_lab/warkentin/targets-data.tsv",
        format = "file"
      ),
      tar_target(
        data,
        vroom::vroom(file),
        format = "fst_tbl"
      )
    )
  })
  system.time(tar_make_clustermq())
})

5. Reprex using {readr}, tar_make_clustermq(), and data stored locally

vroom::vroom_write(data, "~/Desktop/targets-data.tsv")
reprex::reprex({
  library(targets)
  
  tar_script({
    options(clustermq.scheduler = "multicore")
    tar_option_set(
      memory = "transient",
      storage = "remote",
      retrieval = "remote",
      packages = "tidyverse"
    )
    
    tar_pipeline(
      tar_target(
        file,
        "~/Desktop/targets-data.tsv",
        format = "file"
      ),
      tar_target(
        data,
        readr::read_tsv(file),
        format = "fst_tbl"
      )
    )
  })
  system.time(tar_make_clustermq())
})
@wlandau
Copy link
Member

wlandau commented Oct 5, 2020

Such a helpful reprex, thank you. I am almost positive this is because targets is incorrectly trying to send data over the network even if storage is remote. Commit forthcoming.

wlandau-lilly added a commit that referenced this issue Oct 5, 2020
@wlandau
Copy link
Member

wlandau commented Oct 5, 2020

1c3890a might have fixed it. Testing now.

@wlandau
Copy link
Member

wlandau commented Oct 5, 2020

Fixed.

rows <- 5e5
cols <- 1e3
data <- data.frame(matrix(runif(rows * cols), nrow = rows))
vroom::vroom_write(data, "data.tsv")

library(targets)
tar_script({
  options(clustermq.scheduler = "multicore", crayon.enabled = FALSE)
  tar_option_set(
    memory = "transient",
    storage = "remote",
    retrieval = "remote"
  )
  tar_pipeline(
    tar_target(
      file,
      "data.tsv",
      format = "file"
    ),
    tar_target(
      data,
      readr::read_tsv(file, col_types = readr::cols()),
      format = "fst_tbl"
    )
  )
})

tar_destroy()
system.time(tar_make())
#> ● run target file
#> ● run target data
#>    user  system elapsed 
#> 152.158  41.953 245.836

tar_destroy()
system.time(tar_make_clustermq())
#> ● run target file
#> ● run target data
#> Master: [231.6s 0.0% CPU]; Worker: [avg 74.4% CPU, max 554682585.0 Mb]
#>    user  system elapsed 
#> 150.492  23.516 233.179

Created on 2020-10-05 by the reprex package (v0.3.0)

@wlandau wlandau closed this as completed Oct 5, 2020
@mattwarkentin
Copy link
Contributor Author

Amazing! I'm glad I finally decided to take the time to figure out where the roadblock was and produce a reprex. This was by far the greatest source of confusion and friction for me, so I can't wait to install the dev version and test it out.

@mattwarkentin
Copy link
Contributor Author

For posterity, I think #157 was related to this fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants