-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling large data sets in a pipeline #157
Comments
While I'm thinking out loud...does it possibly make sense to avoid hashing a file when |
Would you share a simplified version of the pipeline code and a smaller dataset that reproduces the slowness? We need an empirical approach to identify bottlenecks like these. A profiler would be ideal, e.g. |
It could also be a silent crash from running out of memory on your machine. |
Would a silent crash not error out the pipeline? |
When you have a target with |
At least on clusters, memory issues can cause workers to just hang indefinitely. Not sure about exclusively local pipelines. Unless you do something fancy with cues, To avoid hashing that large tsv file, you could pursue a manual workaround with tar_change(
previously_slow_target,
command_that_uses_big_file("big_file_path.tsv"),
change = file.mtime("big_file_path.tsv"), # Rerun when the modification time changes.
format = "rds" # anything except "file", doesn't even need to be user-supplied
) |
Just realized the command for that |
I think I may have found the root cause of the original issue, and it wasn't |
Hi @wlandau, I keep getting this gigantic error. I tried to reproduce it with a simpler example, but it doesn't reproduce, so I am kinda at a loss. This error always occurs at the same spot in the pipeline. For what it's worth, when I load the targets needed to build this specific target, it works fine interactively. So I'm not sure why it fails when the pipeline is built. I can't make any sense of the traceback, hopefully you can...
|
The traceback helps. Could be an instance of #147, which was due to qsbase/qs#42 and fixed in traversc/stringfish@59eff54. The latest commit of |
If it still happens and remains elusive, let's keep an eye on it and try to reproduce it with just |
Okay, so...the issue has been fixed. My series of updates: remotes::install_github("traversc/stringfish")
remotes::install_github("traversc/qs")
remotes::install_github("wlandau/tarchetypes")
remotes::install_github("wlandau/targets") Then I rebuilt the pipeline and everything worked! |
As always, thanks again for saving me from myself. |
Possibly naive question: So now that I have everything working, and it's very important that this keeps working while I finish up this project, I want to use Since I define the packages used for my I thought if I attached all these packages before initializing |
For context: ?renv::dependencies
|
It's a good question, I'm running into that problem too. Not sure if I'll build something directly into Manual script that doesn't runKeep using Automatically written script that doesn't runThis one is like the previous workaround, except you automatically generate # _targets.R
library(targets)
tar_option_set(packages = c("tidyverse", "qs"))
lines <- paste0("library(", tar_option_get("packages"), ")")
writeLines(lines, "packages.R")
tar_pipeline(...) Just call library(pkg) instead of tar_option_set() for packagesIf you don't call # _targets.R
library(targets)
library(tidyverse)
library(qs)
tar_pipeline(...) The advantage here is I think all of these are doable and there's no right answer. Just depends on your preference. |
Thanks for the response. Hmm, those are all reasonable solutions. I think the second one is the most automated and avoids loading the packages too often. Would it be at all reasonable to perhaps write a function like tar_env <- function(pkgs) {
stopifnot(all(is.character(pkgs)))
con <- "_targets/meta/_library.R"
txt <- c("# Generated by targets::tar_env: do not edit by hand",
paste0("library(", pkgs, ")"))
writeLines(txt, con)
return(pkgs)
}
pkgs <- c("tidyverse", "qs")
tar_option_set(
packages = tar_env(pkgs)
) Which would create the following file...
I tested this out and it works as expected, I kinda like this solution because it's a very lightweight addition but makes the automation both simple and optional. Users who don't use What are your thoughts? I would be more than happy to spin this up into a PR. I am keen to contribute something more to this package than just nagging questions. |
I think that's an excellent idea. Sounds like you are eager to submit a PR, so I will plan review and approve it rather than implementing this feature myself. A few preferences:
lines <- c("# Generated by targets::tar_renv(). Do not edit by hand.",
paste0("library(", packages, ")")) let's write lines <- c(
"# Generated by targets::tar_renv(). Do not edit by hand.",
paste0("library(", packages, ")")
) |
Also, for the |
Everything you said sounds good - thanks for the advice. Also, thank you for letting me take a shot at this, I know you could implement yourself much faster, but I appreciate the chance to get some more experience contributing to open-source projects. |
Hi @wlandau, I was just about to submit the PR but it occurs to me that if the expected API is as follows (i.e. call library(targets)
tar_script({
tar_option_set(packages = "glue")
tar_renv()
tar_pipeline(
tar_target(foo, head(mtcars), packages = "fs")
)
})
tar_make()
#> ● run target foo
readLines("_packages.R")
#> [1] "# Generated by targets::tar_env: do not edit by hand"
#> [2] "library(glue)" That we actually miss the target-specific package deps. Thoughts? |
Since Actually maybe that won't work because |
This gets us the calling file contents... library(targets)
tar_script({
tar_option_set(packages = "glue")
x <- readLines(parent.frame(2)$ofile)
tar_pipeline(
tar_target(foo, x, packages = "fs")
)
})
tar_make()
#> �[34m●�[39m run target foo
tar_read(foo)
#> [1] "library(targets)"
#> [2] "tar_option_set(packages = \"glue\")"
#> [3] "x <- readLines(parent.frame(2)$ofile)"
#> [4] "tar_pipeline(tar_target(foo, x, packages = \"fs\"))" Created on 2020-09-22 by the reprex package (v0.3.0) |
Seems like manual control over the
I think this is reasonable because it supports the majority of use cases. The other option I see is to crawl through the pipeline object.
I would prefer not to use
I would prefer to avoid over-engineering a feature like this. I think it should be simple and cover just common use cases. Special circumstances would require special workarounds, and that's fine. Features are well calibrated when simple stuff is still simple and hard stuff is not necessarily fully automated. |
I strongly agree. So I think the function can be left as-is, and users can manually manage the I can add a line to the documentation about |
Yes, sounds great. |
Prework
Description
Hi @wlandau,
I just had a few questions about dealing with large data and how best to handle them. I have a dataset that is about 500K observations by 2000 variables, and it has seemed unexpectedly slow to build this target in my pipeline.
The pipeline has been stuck on this target for about 12 hours or so (when it only takes minutes to read the data into
R
and save to disk as a serialized object). On disk, the file is about a 10GB tsv file. The file already seems to be stored in_targets/objects/
as afst_tbl
(~6GB) but the pipeline is still building that target. My guess is that the since the object is already saved to disk, the target is spending this huge amount of time hashing the file, is that likely the issue?Any suggestions for how I should be handling these large data sets? It seems like hashing is the expensive step in building a target of this size.
The text was updated successfully, but these errors were encountered: