[help] `error = "continue"` does not apply if a job is out of memory, the whole pipeline crashes #1214

stemangiola · 2024-01-25T02:13:42Z

stemangiola
Jan 25, 2024

Help

I understand and agree to https://books.ropensci.org/targets/help.html.

Description

There is a job for which it is hard to predict memory usage. That job is likely causing the callr subprocess failed. I don't know for sure because the offending job ID is not returned.

However, my issue is that the shole pipeline crashes, eventhough I specified error = "continue" in the tar_option_set. I would prefer the whole pipeline (3-5 days of execution, with 15000+ targets) to continue, so as to save days in case I am far away from the computer.

Error below.

▶ dispatched branch pseudobulk_df_scaled_abundant_curated_formula_cell_type_sex_65ea1439
● completed branch pseudobulk_df_scaled_abundant_cell_type_6d560b91 [3.301 minutes]
▶ dispatched (pending) branch pseudobulk_df_scaled_abundant_curated_formula_cell_type_ethnicity_f5ca13ad
▶ dispatched branch pseudobulk_df_scaled_abundant_curated_formula_cell_type_sex_f5ca13ad
● completed branch pseudobulk_df_scaled_abundant_cell_type_1c29be94 [3.705 minutes]
▶ dispatched (pending) branch pseudobulk_df_scaled_abundant_curated_formula_cell_type_ethnicity_71b239c9
▶ dispatched branch pseudobulk_df_scaled_abundant_curated_formula_cell_type_sex_71b239c9
Submitted batch job 15278154
● completed branch pseudobulk_df_scaled_abundant_cell_type_d73e3203 [3.723 minutes]
▶ dispatched (pending) branch pseudobulk_df_scaled_abundant_curated_formula_cell_type_ethnicity_90fad634
▶ dispatched branch pseudobulk_df_scaled_abundant_curated_formula_cell_type_sex_90fad634
Submitted batch job 15278155
● completed branch pseudobulk_df_scaled_abundant_cell_type_387bec87 [3.821 minutes]
▶ dispatched (pending) branch pseudobulk_df_scaled_abundant_curated_formula_cell_type_ethnicity_6745ba18
▶ dispatched branch pseudobulk_df_scaled_abundant_curated_formula_cell_type_sex_6745ba18
Error in `get_result(output = out, options)`:
! callr subprocess failed: could not start R, exited with non-zero status, has crashed or was killed
ℹ See `$stdout` and `$stderr` for standard output and error.
Type .Last.error to see the more details.
Connected to your session in progress, last started 2024-Jan-25 01:25:06 UTC (42 minutes ago)

stuvet · 2024-01-25T04:41:01Z

stuvet
Jan 25, 2024

In case it helps, I also get OOM errors using crew.cluster (or batchtools) on SLURM when storage is full. I'll see if I can brew up a reprex a little later today. I guess the same reprex could have a memory bomb option.

It happens when workers write to /tmp, which is a tmpfs in our cluster so the OOM errors make sense. That was hard to troubleshoot. It should be isolated to the worker but have a feeling it takes down the pipeline as you've seen - I'll check it with the reprex.

Also happens when workers write to /home - a network share in our system so shared with main controller. No surprise that one takes down the whole pipeline, though in our cluster that version is a bit easier to troubleshoot unless cleanup processes remove the evidence. I think NFS would 'clean up' any files in /home that were left unclosed when the worker went down - would free up space on the network share & make it look like it never happened.

9 replies

stemangiola Jan 25, 2024
Author

The Disk nor Memory seem to be limiting, unless 30Gb object in this targets (pseudobulk_df) Is writing 80Gb file (pseudobulk_df_tissue) on disk and filling the partition.

The pipeline is submitting one job, the bottleneck that is failing, so I doubt there is file system overload somehow.

		# Group samples
		tarchetypes::tar_group_by(
		
		# sample dataset grouped, pretty simple...
			pseudobulk_df_tissue, 
			
			# 30Gb, we know this has memory leaks
			pseudobulk_df,   
			tissue_harmonised, is_immune,
			
			# Pretty bug node, 400Gb to be safe, but fails nonetheless
			resources = tar_resources(crew = tar_resources_crew("slurm_2_400"))
		),

Note: a similar target on a smaller object in the same pipeline worked, so it is related to the target size somehow.

The error log does not show anything much I

log:

R version 4.3.0 (2023-04-21) -- "Already Tomorrow"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> crew::crew_worker(settings = list(url = "wss://10.8.24.176:39868/1/271cbd637a371f972e26e816", autoexit = 15L, cleanup = 1L, output = TRUE, maxtasks = Inf, idletime = Inf, walltime = Inf, timerstart = 0L, tls = c("-----BEGIN CERTIFICATE-----\nMIIFPTCCAyWgAwIBAgIBATANBgkqhkiG9w0BAQsFADA2MRQwEgYDVQQDDAsxMC44\nLjI0LjE3NjERMA8GA1UECgwITmFub25leHQxCzAJBgNVBAYTAkpQMB4XDTAxMDEw\nMTAwMDAwMFoXDTMwMTIzMTIzNTk1OVowNjEUMBIGA1UEAwwLMTAuOC4yNC4xNzYx\nETAPBgNVBAoMCE5hbm9uZXh0MQswCQYDVQQGEwJKUDCCAiIwDQYJKoZIhvcNAQEB\nBQADggIPADCCAgoCggIBAJGqpgdnzVkTLRacw0g+5YQfhNnFG63ZP/4n8nW0wS5X\n8WD6gjpxOc7lHwLBqgrQ+8m0CbqxYurjvcLlcSN5TdKWESoy8Hy/cswuzUc8sEMx\n40OEMghSsBAT2lk4W5X+PvKlcEQ6xPgZTO3GJqwJ9Bz0r8Tf+g8JfD40aVCI0fwd\ngkZ1rOA3C0y3C7kIfYT8y2IsHXQfoz0qY3ZeG9qMNEJPzrazTZnoj2sVONDkqmEj\ngpqmKLdo2RkjU87Dj+gnEhiVE+Shi4KfuFVa0kWi+aY1yX3E3v4/E9wtnIyoeJ6U\npQ9SieJC5z03xQe7xuVyiYEjAAI0tvQyNtfCwLxrwdNzBV2JXLsrLcEpeYQ4O6fg\nQS1J9cXm/FD0E6X24ll4T8gVFvEulJ3PNj2qwbxA+s8NOfJf9EEG64E/Wzs8x2C9\no8/TOYHP69gDxBsO7smRGMi8K2B1l/YJ+35JXyp1r/aBWzu0DKrfVkRXbuxBoK+h\nWJQxX4G2MGRXmPVc/9BrHkbuzvWc9Pv1e7IbjZXjzLzvVHPJUquW3HYWyUutEoka\nT2FrH2OJaoIHW6ACaUqjnETjlpiT2BeHE0Csa6eP1zXnA8YIZU35Nb6BrII37JDH\nrwd0HokOjxr6hlZBtQY75lqrI2NqSfOCuXOmyzhlld8WBhkE2RaTosqwLPNcSLAX\nAgMBAAGjVjBUMBIGA1UdEwEB/wQIMAYBAf8CAQAwHQYDVR0OBBYEFNFd8s7e7yCH\n7mv8JGfo6tirWiT3MB8GA1UdIwQYMBaAFNFd8s7e7yCH7mv8JGfo6tirWiT3MA0G\nCSqGSIb3DQEBCwUAA4ICAQBlTc+Tr+pcq3M2tEGUivZmH3zI0A1ymahMysvkDoiL\nca1fwbFKFTY1vZgYmmoDHeJvc5CD9/G10U9+3W64YhCWfuXuPTmlFsc9F3uy050D\nK2msapA6QM+T7esFuQg9ZKO8zZlWqLc7Ns4pWMG0vJY9BoTSwv0/dmLl/Vos/+sG\n/rrfwF5p7sQIhmT5Va95NjD8TilQuiMFuWiQbR68YYfvNbgd5I6kK2HvZKAmTalp\np58M5QcK4ydLHvpVQBAjGz3a/GvxyUEk9OLTYHuESih7AD4s8c1iRiKSn6es9H9u\nsfjm/HnUZHWiKRU3TrAdqYsA6acVKKse6SvvyrRaYBwySgd/fn+O5eYbU7+3qWhm\nL1aHP/Khi3zZYwlNL9QOY2zfHswbmv9KUtH8oUYIJGkueCZGLWSR2xvsoYoJOvml\npQ7BywiRYep9mc3qNmgmh3IvmeXv7MveeOEzsNsKdZ3/Ku6OCg84dSewCgXjKIGo\nRcNCpEF1UZf8iiRLqhq3ce9uovCaFHma3F+G9WaTVLQ9VQIWaLjMdAcb0+CsoYL4\n32OCTLhWhwpNWVo1wMSnN+YLVi8Ud5ScukvpC4ZBSQQPWc39zyYBUNexILvExEJ1\nPvsAD3s72rgf7mfB9dlspi4WnROJa4y4N4aPMeIPGJ/PaqKDzLw+PBjwAeCmiAEk\n8w==\n-----END CERTIFICATE-----\n", ""), rs = c(10407L, 88985838L, 64644247L, -1852491796L, 1614857213L, 567134874L, 1336872787L)), launcher = "slurm_2_400", worker = 1L, instance = "271cbd637a371f972e26e816")

error log, just with a benign warning

[Previously saved workspace restored]

Warning: Package version inconsistency detected.
glmmTMB was built with TMB version 1.9.6
Current TMB version is 1.9.9
Please re-install glmmTMB from source or restore original ‘TMB’ package (see '?reinstalling' for more information)

stuvet Jan 25, 2024

I had some issues with the reprex. #1215 (comment)

The unexpected (to me) behaviour of /tmp/Rtmp cleanup is worth knowing regardless, and in some versions of my original reprex the next job I submit then complains about cannot allocate vector of size 100M. I only allocated a single vector, and my worker had 1810M memory available (& I never had this issue with a freshly-recruited worker), so I presume its been eaten up by previous (finished) tasks, or by the expanding /tmp directory.

I'll post a more complete reprex which will also test memory allocation once #1215 is complete.

stemangiola Jan 26, 2024
Author

At the end seems that the problem was the disk space. Only if slurm is used in a "small" partition. If I use slurm in a very big partition it's all fine. Also if I run deploiment = "main" in the "small" partition all is fine.

For some reason loading a 30Gb target and applying tar_group_by saved on disk more than 80Gb (I believe in a temporary form).

I would see two improvements here @wlandau

improve the error callr subprocess failed: could not start R, exited with non-zero status, has crashed or was killed, including a possible cause could be disk space.
If I group a datasets as such

<tibble>
A | B | data

Where data is 30Gb but A and B are very light (characters), targets could use H5 tech to just load the columns I am grouping for. This would speed up operations and dramatically decrease memory consumption.

stuvet Jan 26, 2024

Also potential implications of #1139 (comment) here if targets return large objects directly?

wlandau Jan 26, 2024
Maintainer

improve the error callr subprocess failed: could not start R, exited with non-zero status, has crashed or was killed, including a possible cause could be disk space.

That error message comes from callr, and unfortunately packages that build on callr cannot control what it says.

Where data is 30Gb but A and B are very light (characters), targets could use H5 tech to just load the columns I am grouping for. This would speed up operations and dramatically decrease memory consumption.

It may be possible to define a custom tar_format() format which stores pointers to shards of data, or something like that: essentially a handle to data that lives somewhere else. This depends on the use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[help] `error = "continue"` does not apply if a job is out of memory, the whole pipeline crashes #1214

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

[help] error = "continue" does not apply if a job is out of memory, the whole pipeline crashes #1214

stemangiola Jan 25, 2024

Help

Description

Replies: 1 comment · 9 replies

stuvet Jan 25, 2024

stemangiola Jan 25, 2024 Author

stuvet Jan 25, 2024

stemangiola Jan 26, 2024 Author

stuvet Jan 26, 2024

wlandau Jan 26, 2024 Maintainer

[help] `error = "continue"` does not apply if a job is out of memory, the whole pipeline crashes #1214

stemangiola
Jan 25, 2024

Replies: 1 comment 9 replies

stuvet
Jan 25, 2024

stemangiola Jan 25, 2024
Author

stemangiola Jan 26, 2024
Author

wlandau Jan 26, 2024
Maintainer