Buffer overflow error for large number of jobs #96

mschubert · 2015-08-01T18:11:43Z

When submitting a large number of jobs, BatchJobs still fails for me (this is somewhat similar to #58, but the number of jobs is almost 50 times higher).

I submit between 275,000 and 500,000 jobs in 1, 2, 10, and 25 chunks.

BatchJobs_1.7 BBmisc_1.9

Submitting jobs in one chunk always works, so does sending 2 chunks. 10 chunks sometimes works and sometimes doesn't, and 25 chunks never works.

If staged.queries = TRUE (otherwise same behaviour as in #58), independent of db.options = list(pragmas = c("busy_timeout=5000", "journal_mode=WAL")) and fs.timeout:

in submitJobs() call function itself runs fine until return(invisible(ids))
message "Might take some time, do not interrupt this!"
after this, all jobs are killed/crash/disappear
if I waitForJobs(), R segfaults

Might take some time, do not interrupt this!
Syncing registry ...
Waiting [S:550000 D:0 E:0 R:0] |+                                 |   0% (00:00:00)
Status for 550000 jobs at 2015-08-01 18:00:21
Submitted: 550000 (100.00%)
Started:   550000 (100.00%)
Running:      0 (  0.00%)
Done:         0 (  0.00%)
Errors:       0 (  0.00%)
Expired:   550000 (100.00%)
Time: min=NAs avg=NAs max=NAs
       n submitted started done error running expired t_min t_avg t_max
1 550000    550000  550000    0     0       0  550000    NA    NA    NA
*** buffer overflow detected ***: /usr/lib/R/bin/exec/R terminated
======= Backtrace: =========
/lib64/libc.so.6(__fortify_fail+0x37)[0x3ec4302527]
/lib64/libc.so.6[0x3ec4300410]
/lib64/libc.so.6[0x3ec42ff2c7]
/usr/lib/R/lib/libR.so(+0xfc01b)[0x2b722845901b]
/usr/lib/R/lib/libR.so(+0xfff4e)[0x2b722845cf4e]
/usr/lib/R/lib/libR.so(+0xfffbf)[0x2b722845cfbf]

The text was updated successfully, but these errors were encountered:

mllg · 2015-08-03T13:50:32Z

I've just discovered a bug: setting fs.timeout disabled staging (D'OH!). I've already pushed a fix for this (cefc9c2).

Furthermore, I tried to finally get rid of all database problems once and forever by avoiding all database transaction on the nodes (although read-only should in theory be safe). This is now also in the devel branch and automatically enabled if staged.queries is set. Could you please give it a try on your system?

mschubert · 2015-08-05T11:38:16Z

I'm having issues with submitJobs() in cefc9c2.

Backend: LSF
Submitting 292 chunks / 1457500 jobs.

In writeFiles(), Map(...) causes the submission of jobs to take over 5 hours total.

I realize that the number of jobs is high, but the number of chunks should be reasonable in a way that submission should be done a a minute (without explicit job.delay - note that with staged.queries = FALSE submission takes about 30 minutes, but then some jobs fail because of locked db).

I'm thinking that this is mainly due to file system load that already starts during submission, but later continues at running - I designed my algorithm in a way that bigger objects are all passed via more.args and each individual function call (=job) does only need an index on those objects (couple of bytes). Still, the file system becomes a lot less responsive after a couple of chunks are started.

If I understand my debugging correctly, BatchJobs runs a new instance of R for every job (not chunk, both for staged.queries=T/F)? If that's the case, then BatchJobs can not be used over approximately 1M function calls with > 100 chunks (which would be a pity).

mllg · 2015-08-05T13:00:48Z

Well that's basically the tradeoff. You either rely on the database to retrieve the information and risk running into a locked database. Or you just store everything on the file system to avoid querying the database on the nodes which might be a big overhead.

Are you sure that https://github.com/tudo-r/BatchJobs/blob/master/R/writeFiles.R#L34 is the bottleneck? I've tried myself with 1e6 jobs and found that it takes less than 5 minutes on a local (but slow) hdd. If you're sure that the Map() takes most of the time, I'll try to optimize this for high-latency file systems, i.e. writing the job information chunked.

mschubert · 2015-08-05T13:09:31Z

The other question would be: is it really required for the master to have information about every job on each slave at any time? I certainly don't, and if not, then you could avoid most of the db or file system load altogether: for instance, you use one R session per chunk and only report chunk statistics. This would make the whole thing a lot more scalable (but I realize this would be a major undertaking).

I am sure that that Map() is the problem for the first 20 chunks submitted (stepped through in the debugger). After the first couple of chunks it also gets remarkably slower, but I haven't taken a look at this specifically. I can do a full profiling, but this will take at least a day because of the nature of the issue.

mllg · 2015-08-05T13:25:45Z

The other question would be: is it really required for the master to have information about every job on each slave at any time? I certainly don't, and if not, then you could avoid most of the db or file system load altogether: for instance, you use one R session per chunk and only report chunk statistics. This would make the whole thing a lot more scalable (but I realize this would be a major undertaking).

We kind of do this already by using a buffer which is flushed every 5-10 minutes (c.f. doJob.R, msg.buf).

I am sure that that Map() is the problem for the first 20 chunks submitted (stepped through in the debugger). After the first couple of chunks it also gets remarkably slower, but I haven't taken a look at this specifically. I can do a full profiling, but this will take at least a day because of the nature of the issue.

Please try if 0f914a2 mitigates the runtime issues.

mschubert · 2015-08-05T14:25:25Z

0f914a2: submission is down to 9 minutes (30x speedup), "Syncing registry" afterwards takes 10 minutes.

File system load overall is still too high (at about 300 jobs running), I had to manually stop and resume jobs to keep the volume responsive.

Reducing the results over night only got 4%, time remaining is shown as 99 hours. When complete after 4 days, R crashed with a message saying:

*** caught bus error ***
address 0x2abbf7d08e50, cause 'non-existent physical address'
Bus error (core dumped)

mllg · 2015-10-01T12:21:25Z

0f914a2: submission is down to 9 minutes (30x speedup), "Syncing registry" afterwards takes 10 minutes.

Well that sound acceptable to me.

File system load overall is still too high (at about 300 jobs running), I had to manually stop and resume jobs to keep the volume responsive.

I'll set the update frequency for chunked jobs more conservatively. But also check your log files -- if you produce a lot of output, this could be the problem. You could probably try to redirect logs to /dev/null.

Reducing the results over night only got 4%, time remaining is shown as 99 hours. When complete after 4 days, R crashed with a message saying:

*** caught bus error ***
address 0x2abbf7d08e50, cause 'non-existent physical address'
Bus error (core dumped)

I think I did not touch anything here ... have you solved this? I would assume this is a (temporary) file system problem. We just iterate over the results and load them, nothing special.

…ues mentioned in #96

mschubert · 2015-10-01T17:53:39Z

The crash was caused by a bug in dplyr (used to assemble my results after - this has got nothing to do with BatchJobs).

The time it takes for reducing the results remains, however.

mllg · 2015-10-01T21:03:35Z

Well, maybe reading 500.000 files just takes some time. But if you give me some more information, I can try to optimize this step a bit.

How does your result look like, i.e. how big is a single result ( as reported by object.size())?
What function do you call exactly? reduceResults()?
What is the runtime of a reduction step? Do you think it is linear/quadratic/exponential wrt the number of jobs?
Is it possible to use reduceResultsList() instead? This one pre-allocates and thus does not need to copy aggr in every iteration which is a likely bottleneck.
Is reduceResultsParallel() a viable alternative?

mschubert · 2015-10-04T18:41:52Z

Thank you for your contined efforts!

My result is a single numeric representing the mean cross-validation error of a trained glmnet model
I already use reduceResultsList()
reduceResultsParallel() would put additional strain on the file system and I'd like to avoid that
I haven't tested how the timing behaves with different numbers of jobs, and I don't have the time for the next two weeks unfortunately (I'll update here in ping you once I get around to doing it)

In general, I think that the approach of having one result file with one numeric per function call is not feasible with >1M calls on a high latency fs.

I also played around using rzmq that bypasses the file system altogether and could see 100x speedups.

mllg added a commit that referenced this issue Oct 1, 2015

set flushing interval more conservatively to mitigate file system iss…

323106b

…ues mentioned in #96

mschubert mentioned this issue Oct 10, 2015

Changing temporary directory mschubert/ebits#64

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Buffer overflow error for large number of jobs #96

Buffer overflow error for large number of jobs #96

mschubert commented Aug 1, 2015

mllg commented Aug 3, 2015

mschubert commented Aug 5, 2015

mllg commented Aug 5, 2015

mschubert commented Aug 5, 2015

mllg commented Aug 5, 2015

mschubert commented Aug 5, 2015

mllg commented Oct 1, 2015

mschubert commented Oct 1, 2015

mllg commented Oct 1, 2015

mschubert commented Oct 4, 2015

Buffer overflow error for large number of jobs #96

Buffer overflow error for large number of jobs #96

Comments

mschubert commented Aug 1, 2015

mllg commented Aug 3, 2015

mschubert commented Aug 5, 2015

mllg commented Aug 5, 2015

mschubert commented Aug 5, 2015

mllg commented Aug 5, 2015

mschubert commented Aug 5, 2015

mllg commented Oct 1, 2015

mschubert commented Oct 1, 2015

mllg commented Oct 1, 2015

mschubert commented Oct 4, 2015