Reducing GPU memory usage #1689

dbobrovskiy · 2023-11-27T15:29:25Z

I'm opening this issue following the discussion on the forum: https://forum.pyro.ai/t/reducing-mcmc-memory-usage/5639/6.

The problem is, not-in-place array copying that happens in mcmc.run after the actual sampling might result in an out-of-memory exception even though the sampling itself was successful. First of all, it would be nice if this could be avoided and the arrays could be transferred to CPU before any not-in-place operations.

More generally, the GPU memory can be controlled buy sampling sequentially using post_warmup_state and transferring each batch of samples to CPU before running the next one. However, this doesn't seem to work as expected, and the consequent batches require more memory than the first one (see the output for the code below).

mcmc_samples = [None] * (n_samples // 1000)
# set up MCMC
self.mcmc = MCMC(kernel, num_warmup=n_warmup, num_samples=1000, num_chains=n_chains)
for i in range((n_samples) // 1000):
    print(f"Batch {i+1}")
    # run MCMC for 1000 samples
    self.mcmc.run(jax.random.PRNGKey(0), self.spliced, self.unspliced)
    # store samples transferred to CPU
    mcmc_samples[i] = jax.device_put(self.mcmc.get_samples(), jax.devices("cpu")[0])
    # reset the mcmc before running the next batch
    self.mcmc.post_warmup_state = self.mcmc.last_state

the code above results in:

Running MCMC in batches of 1000 samples, 2 batches in total.
First batch will include 1000 warmup samples.
Batch 1
sample: 100%|██████████| 2000/2000 [11:18<00:00,  2.95it/s, 1023 steps of size 5.13e-06. acc. prob=0.85]
Batch 2
sample: 100%|██████████| 1000/1000 [05:48<00:00,  2.87it/s, 1023 steps of size 5.13e-06. acc. prob=0.85]
2023-11-24 14:43:23.854505: W external/tsl/tsl/framework/bfc_allocator.cc:485] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.56GiB (rounded to 2750440192)requested by op

To summarise,

Could not-in-place operations at the end of sampling be optionally transferred to CPU?
How should one sample sequentially so that memory usage is not increased in the process?

The text was updated successfully, but these errors were encountered:

fehiepsi · 2023-11-28T12:04:15Z

Could you try replacing those two lines by

        self._states = jax.device_get(states)
        self._states_flat = jax.device_get(states_flat)

If it works, then we can introduce a method named transfer_states_to_host() to perform those device_get operator.

dbobrovskiy · 2023-11-28T17:49:07Z

Yep, it works!
Neither requires extra GPU memory after sampling nor leads to memory increase throughout sequential samples.

fehiepsi · 2023-11-28T17:55:11Z

Thanks! Do you want to make a PR to add a helper doing such work? :)

fehiepsi added the enhancement New feature or request label Nov 28, 2023

fehiepsi added the good first issue Good for newcomers label Dec 3, 2023

amifalk mentioned this issue Dec 24, 2023

transfer_states_to_host convenience function #1707

Merged

fehiepsi closed this as completed in #1707 Dec 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reducing GPU memory usage #1689

Reducing GPU memory usage #1689

dbobrovskiy commented Nov 27, 2023

fehiepsi commented Nov 28, 2023

dbobrovskiy commented Nov 28, 2023

fehiepsi commented Nov 28, 2023

Reducing GPU memory usage #1689

Reducing GPU memory usage #1689

Comments

dbobrovskiy commented Nov 27, 2023

fehiepsi commented Nov 28, 2023

dbobrovskiy commented Nov 28, 2023

fehiepsi commented Nov 28, 2023