GHA for Pytorch failing due to memory exhaustion #540

bkmartinjr · 2023-06-14T18:34:52Z

We need to skinny the memory use of the Pytorch loader unit tests. This failure is happening (guess) 50% of the runs.

This is a reoccurrence of #533

=================================== FAILURES ===================================
_ test__multiprocessing__returns_full_result[6-3-X_layer_names0-pytorch_x_value_gen] _

soma_experiment = <Experiment '/tmp/pytest-of-runner/pytest-1/test__multiprocessing__returns0/exp' (open for 'r') (2 items)
    'ms': 'f...p/ms' (unopened)
    'obs': 'file:///tmp/pytest-of-runner/pytest-1/test__multiprocessing__returns0/exp/obs' (unopened)>

    @pytest.mark.experimental
    # noinspection PyTestParametrized
    @pytest.mark.parametrize("n_obs,n_vars,X_layer_names,X_value_gen", [(6, 3, ("raw",), pytorch_x_value_gen)])
    def test__multiprocessing__returns_full_result(soma_experiment: Experiment) -> None:
        dp = ExperimentDataPipe(
            soma_experiment,
            measurement_name="RNA",
            X_name="raw",
            obs_column_names=["label"],
        )
        dl = experiment_dataloader(dp, num_workers=2)
    
>       full_result = list(iter(dl))

api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py:276: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/torch/utils/data/dataloader.py:441: in __iter__
    return self._get_iterator()
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/torch/utils/data/dataloader.py:388: in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/torch/utils/data/dataloader.py:1015: in __init__
    self._worker_result_queue = multiprocessing_context.Queue()  # type: ignore[var-annotated]
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/context.py:103: in Queue
    return Queue(maxsize, ctx=self.get_context())
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/queues.py:43: in __init__
    self._rlock = ctx.Lock()
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/context.py:[68](https://github.com/chanzuckerberg/cellxgene-census/actions/runs/5270582495/jobs/9530291183#step:7:69): in Lock
    return Lock(ctx=self.get_context())
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/synchronize.py:162: in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/synchronize.py:[80](https://github.com/chanzuckerberg/cellxgene-census/actions/runs/5270582495/jobs/9530291183#step:7:81): in __init__
    register(self._semlock.name, "semaphore")
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/resource_tracker.py:147: in register
    self._send('REGISTER', name, rtype)
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/resource_tracker.py:154: in _send
    self.ensure_running()
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/resource_tracker.py:121: in ensure_running
    pid = util.spawnv_passfds(exe, args, fds_to_pass)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path = '/opt/hostedtoolcache/Python/3.9.17/x64/bin/python'
args = ['/opt/hostedtoolcache/Python/3.9.17/x64/bin/python', '-c', 'from multiprocessing.resource_tracker import main;main(16)']
passfds = (8, 16)

    def spawnv_passfds(path, args, passfds):
        import _posixsubprocess
        passfds = tuple(sorted(map(int, passfds)))
        errpipe_read, errpipe_write = os.pipe()
        try:
>           return _posixsubprocess.fork_exec(
                args, [os.fsencode(path)], True, passfds, None, None,
                -1, -1, -1, -1, -1, -1, errpipe_read, errpipe_write,
                False, False, None, None, None, -1, None)
E               OSError: [Errno 12] Cannot allocate memory

The text was updated successfully, but these errors were encountered:

bkmartinjr added the bug Something isn't working label Jun 14, 2023

bkmartinjr assigned atolopko-czi Jun 14, 2023

atolopko-czi mentioned this issue Jun 14, 2023

fix OOM on pytorch unit test #542

Merged

atolopko-czi closed this as completed in #542 Jun 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GHA for Pytorch failing due to memory exhaustion #540

GHA for Pytorch failing due to memory exhaustion #540

bkmartinjr commented Jun 14, 2023 •

edited by atolopko-czi

Loading

GHA for Pytorch failing due to memory exhaustion #540

GHA for Pytorch failing due to memory exhaustion #540

Comments

bkmartinjr commented Jun 14, 2023 • edited by atolopko-czi Loading

bkmartinjr commented Jun 14, 2023 •

edited by atolopko-czi

Loading