Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GHA for Pytorch failing due to memory exhaustion #540

Closed
bkmartinjr opened this issue Jun 14, 2023 · 0 comments · Fixed by #542
Closed

GHA for Pytorch failing due to memory exhaustion #540

bkmartinjr opened this issue Jun 14, 2023 · 0 comments · Fixed by #542
Assignees
Labels
bug Something isn't working

Comments

@bkmartinjr
Copy link
Contributor

bkmartinjr commented Jun 14, 2023

We need to skinny the memory use of the Pytorch loader unit tests. This failure is happening (guess) 50% of the runs.

This is a reoccurrence of #533

=================================== FAILURES ===================================
_ test__multiprocessing__returns_full_result[6-3-X_layer_names0-pytorch_x_value_gen] _

soma_experiment = <Experiment '/tmp/pytest-of-runner/pytest-1/test__multiprocessing__returns0/exp' (open for 'r') (2 items)
    'ms': 'f...p/ms' (unopened)
    'obs': 'file:///tmp/pytest-of-runner/pytest-1/test__multiprocessing__returns0/exp/obs' (unopened)>

    @pytest.mark.experimental
    # noinspection PyTestParametrized
    @pytest.mark.parametrize("n_obs,n_vars,X_layer_names,X_value_gen", [(6, 3, ("raw",), pytorch_x_value_gen)])
    def test__multiprocessing__returns_full_result(soma_experiment: Experiment) -> None:
        dp = ExperimentDataPipe(
            soma_experiment,
            measurement_name="RNA",
            X_name="raw",
            obs_column_names=["label"],
        )
        dl = experiment_dataloader(dp, num_workers=2)
    
>       full_result = list(iter(dl))

api/python/cellxgene_census/tests/experimental/ml/test_pytorch.py:276: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/torch/utils/data/dataloader.py:441: in __iter__
    return self._get_iterator()
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/torch/utils/data/dataloader.py:388: in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/site-packages/torch/utils/data/dataloader.py:1015: in __init__
    self._worker_result_queue = multiprocessing_context.Queue()  # type: ignore[var-annotated]
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/context.py:103: in Queue
    return Queue(maxsize, ctx=self.get_context())
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/queues.py:43: in __init__
    self._rlock = ctx.Lock()
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/context.py:[68](https://github.com/chanzuckerberg/cellxgene-census/actions/runs/5270582495/jobs/9530291183#step:7:69): in Lock
    return Lock(ctx=self.get_context())
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/synchronize.py:162: in __init__
    SemLock.__init__(self, SEMAPHORE, 1, 1, ctx=ctx)
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/synchronize.py:[80](https://github.com/chanzuckerberg/cellxgene-census/actions/runs/5270582495/jobs/9530291183#step:7:81): in __init__
    register(self._semlock.name, "semaphore")
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/resource_tracker.py:147: in register
    self._send('REGISTER', name, rtype)
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/resource_tracker.py:154: in _send
    self.ensure_running()
/opt/hostedtoolcache/Python/3.9.17/x64/lib/python3.9/multiprocessing/resource_tracker.py:121: in ensure_running
    pid = util.spawnv_passfds(exe, args, fds_to_pass)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

path = '/opt/hostedtoolcache/Python/3.9.17/x64/bin/python'
args = ['/opt/hostedtoolcache/Python/3.9.17/x64/bin/python', '-c', 'from multiprocessing.resource_tracker import main;main(16)']
passfds = (8, 16)

    def spawnv_passfds(path, args, passfds):
        import _posixsubprocess
        passfds = tuple(sorted(map(int, passfds)))
        errpipe_read, errpipe_write = os.pipe()
        try:
>           return _posixsubprocess.fork_exec(
                args, [os.fsencode(path)], True, passfds, None, None,
                -1, -1, -1, -1, -1, -1, errpipe_read, errpipe_write,
                False, False, None, None, None, -1, None)
E               OSError: [Errno 12] Cannot allocate memory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants