Large ops with multiprocess fail with internal error #1231

awni · 2024-06-25T14:25:46Z

Run the following with:

mpirun -n 4 -- python <script>

import mlx.core as mx
import socket

hostname = socket.gethostname()
world = mx.distributed.init()
rank = world.rank()

print(f"Distributed available: {mx.distributed.is_available()}")
print(f"Hostname: {hostname}: {rank}")

DIM = 200000

num_processes = mx.distributed.init().size()

print(f'Hostname: {hostname}  num_processes: {num_processes}')

data = mx.zeros((1600, DIM))
w = mx.zeros((512, DIM))
mx.eval(w, data)

for it in range(100):
    mx.eval(data @ w.T)
    print(it)

Fails on an M2 Ultra with:

libc++abi: terminating due to uncaught exception of type std::runtime_error: [METAL] Command buffer execution failed: Internal Error (0000000e:Internal Error)

The text was updated successfully, but these errors were encountered:

angeloskath · 2024-06-26T23:31:38Z

Hmm this has nothing to do with distributed processing unfortunately. The following also fails with the same error

for i in {0..3}; do python <path to script> & done; wait

awni · 2024-06-27T00:39:19Z

I guess since this is somewhat rare.. and there isn't much if anything we can do in MLX, I will close this for now :.

awni · 2024-07-10T13:26:35Z

Some possible work-arounds:

The internal error signifies that the GPU kernel timed out. One way to fix that is to decrease the size of the operations:

For example if you are using LoRA, try

decreasing the batch size
decreasing the sequence length
decrease the model size

Another possible fix is to decrease the number of operations per command buffer, this may work in some cases but may also slow things down. You can do that like so:

MLX_MAX_OPS_PER_BUFFER=1 <training command here>

awni mentioned this issue Jun 25, 2024

[BUG] all_reduce_grads() fails with a Transformer model for number of nodes > 1 #1226

Closed

awni closed this as completed Jun 27, 2024

awni mentioned this issue Jul 5, 2024

M2 Ultra 192 GB fails to run while M3 Max 128GB can run ml-explore/mlx-examples#866

Closed

awni reopened this Jul 10, 2024

awni mentioned this issue Jul 10, 2024

Finetuning gemma-2-27b-8bits error ml-explore/mlx-examples#881

Closed

awni mentioned this issue Aug 19, 2024

[BUG] METALCommand buffer execution failed: Internal Error (0000000e:Internal Error) #1335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large ops with multiprocess fail with internal error #1231

Large ops with multiprocess fail with internal error #1231

awni commented Jun 25, 2024

angeloskath commented Jun 26, 2024

awni commented Jun 27, 2024

awni commented Jul 10, 2024

Large ops with multiprocess fail with internal error #1231

Large ops with multiprocess fail with internal error #1231

Comments

awni commented Jun 25, 2024

angeloskath commented Jun 26, 2024

awni commented Jun 27, 2024

awni commented Jul 10, 2024