Segmentation fault running OPT1.3b f32 with data tiling enabled #14398

monorimet · 2023-07-13T17:52:53Z

What happened?

I am continuing from nod-ai/SHARK#1589 in an investigation of the segmentation faults that occur when trying to run OPT-1.3b at f32 precision with --iree-flow-enable-data-tiling.

I am having some trouble making a dispatch-level reproducer but will share the smallest reproduction as well as relevant IR.

I have compiled with --iree-flow-break-dispatch=@forward:24 and the data tiling flag, the resulting .vmfb successfully runs through iree-benchmark-module. With --iree-flow-break-dispatch=@forward:25, the .vmfb segfaults in iree-benchmark-module.

Full reproduction steps are given below for this case, and here is a download link to the full IR dump after iree-flow-outline-dispatch-regions.

From the above IR dump I've isolated the func.func from dispatch 25:

builtin.module {
  func.func @forward_dispatch_25_generic_128x2048_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<128x2048xf32>>, %arg1: !flow.dispatch.tensor<readonly:tensor<2048xf32>>, %arg2: !flow.dispatch.tensor<writeonly:tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>>) {
    %cst = arith.constant 0.000000e+00 : f32
    %cst_0 = arith.constant 2.048000e+03 : f32
    %cst_1 = arith.constant 9.99999974E-6 : f32
    %0 = flow.dispatch.tensor.load %arg0, offsets = [0, 0], sizes = [128, 2048], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<128x2048xf32>> -> tensor<128x2048xf32>
    %1 = flow.dispatch.tensor.load %arg1, offsets = [0], sizes = [2048], strides = [1] : !flow.dispatch.tensor<readonly:tensor<2048xf32>> -> tensor<2048xf32>
    %2 = tensor.empty() : tensor<128x2048xf32>
    %3 = tensor.empty() : tensor<128xf32>
    %4 = linalg.fill ins(%cst : f32) outs(%3 : tensor<128xf32>) -> tensor<128xf32>
    %5 = linalg.generic {indexing_maps = [#map4, #map5], iterator_types = ["parallel", "reduction"]} ins(%0 : tensor<128x2048xf32>) outs(%4 : tensor<128xf32>) {
    ^bb0(%in: f32, %out: f32):
      %8 = arith.mulf %in, %in : f32
      %9 = arith.addf %8, %out : f32
      linalg.yield %9 : f32
    } -> tensor<128xf32>
    %6 = linalg.generic {indexing_maps = [#map4, #map5, #map6, #map4], iterator_types = ["parallel", "parallel"]} ins(%0, %5, %1 : tensor<128x2048xf32>, tensor<128xf32>, tensor<2048xf32>) outs(%2 : tensor<128x2048xf32>) {
    ^bb0(%in: f32, %in_2: f32, %in_3: f32, %out: f32):
      %8 = arith.divf %in_2, %cst_0 : f32
      %9 = arith.addf %8, %cst_1 : f32
      %10 = math.rsqrt %9 : f32
      %11 = arith.mulf %in, %10 : f32
      %12 = arith.addf %11, %in_3 : f32
      linalg.yield %12 : f32
    } -> tensor<128x2048xf32>
    %7 = iree_linalg_ext.set_encoding %6 : tensor<128x2048xf32> -> tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>
    flow.dispatch.tensor.store %7, %arg2, offsets = [0, 0], sizes = [128, 2048], strides = [1, 1] : tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>> -> !flow.dispatch.tensor<writeonly:tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>>
    return
  }
}

Steps to reproduce your issue

Download opt-1_3b_causallm_128_torch.mlir
Run :

iree-compile ./opt-1_3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-flow-break-dispatch=@forward:25 --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-stack-allocation-limit=131072 --iree-llvmcpu-enable-microkernels -o opt-1_3b_causallm_128_torch_cpu-task_ukernels.vmfb

Run :

iree-benchmark-module --module=opt-1_3b_causallm_128_torch_cpu-task.vmfb --function="forward" --input=1x128xi64 --input=1x128xi64 --benchmark_repetitions=10 --task_topology_max_group_count=16 --device=local-task

What component(s) does this issue relate to?

No response

Version information

6e49915

Additional context

No response

The text was updated successfully, but these errors were encountered:

monorimet · 2023-07-13T19:06:00Z

I don't see anything that stands out as problematic about the above IR (for dispatch 24/25/26) -- I notice that dispatch 26 sets the encoding for the 128x8192x2048 matmul at dispatch 27, and this is right around where the segfault occurs in end-to-end/'end-to-slice' execution:

  flow.executable private @forward_dispatch_26 {
    flow.executable.export public @forward_dispatch_26_set_encoding_MATMUL_F32F32F32_RHS_2048x8192 workgroups() -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
      flow.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @forward_dispatch_26_set_encoding_MATMUL_F32F32F32_RHS_2048x8192(%arg0: !flow.dispatch.tensor<readonly:tensor<2048x8192xf32>>, %arg1: !flow.dispatch.tensor<writeonly:tensor<2048x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RHS>>>) {
        %0 = flow.dispatch.tensor.load %arg0, offsets = [0, 0], sizes = [2048, 8192], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<2048x8192xf32>> -> tensor<2048x8192xf32>
        %1 = iree_linalg_ext.set_encoding %0 : tensor<2048x8192xf32> -> tensor<2048x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RHS>>
        flow.dispatch.tensor.store %1, %arg1, offsets = [0, 0], sizes = [2048, 8192], strides = [1, 1] : tensor<2048x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RHS>> -> !flow.dispatch.tensor<writeonly:tensor<2048x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RHS>>>
        return
      }
    }
  }
  flow.executable private @forward_dispatch_27 {
    flow.executable.export public @forward_dispatch_27_matmul_128x8192x2048_f32 workgroups() -> (index, index, index) {
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
      flow.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @forward_dispatch_27_matmul_128x8192x2048_f32(%arg0: !flow.dispatch.tensor<readonly:tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>>, %arg1: !flow.dispatch.tensor<readonly:tensor<2048x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RHS>>>, %arg2: !flow.dispatch.tensor<writeonly:tensor<128x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RESULT>>>) {
        %cst = arith.constant 0.000000e+00 : f32
        %0 = flow.dispatch.tensor.load %arg0, offsets = [0, 0], sizes = [128, 2048], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>> -> tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>
        %1 = flow.dispatch.tensor.load %arg1, offsets = [0, 0], sizes = [2048, 8192], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<2048x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RHS>>> -> tensor<2048x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RHS>>
        %2 = tensor.empty() : tensor<128x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RESULT>>
        %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<128x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RESULT>>) -> tensor<128x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RESULT>>
        %4 = linalg.matmul ins(%0, %1 : tensor<128x2048xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_LHS>>, tensor<2048x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RHS>>) outs(%3 : tensor<128x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RESULT>>) -> tensor<128x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RESULT>>
        flow.dispatch.tensor.store %4, %arg2, offsets = [0, 0], sizes = [128, 8192], strides = [1, 1] : tensor<128x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RESULT>> -> !flow.dispatch.tensor<writeonly:tensor<128x8192xf32, #iree_linalg_ext.encoding<MATMUL_F32F32F32_RESULT>>>
        return
      }
    }
  }

I can't seem to reproduce the issue with a single dispatch (probably user error?) so I'm trying the break with specific dispatch names to make sure we aren't looking at the wrong dispatch, since the current suspect (dispatch 25) seems somewhat benign...
please correct if this is wrong, I'm not entirely sure what to be looking for. Then I will keep trying to minimize the failure case.

bjacob · 2023-07-13T19:07:51Z

Good news time - this is fixed by #14349.

This is actually a compiler bug.

The iree-compile command line causes an assertion failure. Because assertions are disabled in release builds, it was continuing silently with broken compiler logic, producing a faulty bytecode module, causing that runtime crash in the iree-benchmark-module command line. But the root cause is the compiler bug - and it shows as this assertion failure in a iree-compile build with assertions enabled:

iree-compile: iree/compiler/src/iree/compiler/Dialect/HAL/IR/HALTypes.cpp:139: std::optional<int32_t> mlir::iree_compiler::IREE::HAL::getEncodingTypeValue(mlir::Attribute): Assertion `!attr && "encoding types other than default not yet supported"' failed.
Please report issues to https://github.com/openxla/iree/issues and include the crash backtrace.
Stack dump:
0.      Program arguments: /usr/local/google/home/benoitjacob/iree-build-linux/tools/iree-compile ./opt-1_3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-flow-enable-data-tiling --iree-flow-break-dispatch=@forward:25 --iree-llvmcpu-stack-allocation-limit=131072 --iree-llvmcpu-enable-microkernels -o opt-1_3b_causallm_128_torch_cpu-task_ukernels.vmfb

Since the assertion failure is about exactly the kind of thing that #14349 is refactoring, I gave it a try, and it does succeed -- the iree-compile command succeeds and the resulting bytecode module runs fine in iree-benchmark-module.

Note - for running locally on my machine I had to drop the --iree-llvmcpu-target-cpu=cascadelake flag.

MaheshRavishankar · 2023-07-13T22:52:59Z

cc @hanhanW

monorimet · 2023-07-13T23:24:08Z

@bjacob Did the model compile and run e2e without segfault on your patch? I am trying to reproduce with #14349 but the segfault is still showing up, though it seems to happen after dispatch 27 instead. Perhaps I'm not running something correctly but via your explanation I'm surprised not to see a compile-time error with assertions enabled.

This was with and without target-cpu=cascadelake.

Compile command (break at smallest failing dispatch index):

/home/ean/iree/iree-build/tools/iree-compile ./opt-1_3b_causallm_128_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=host --iree-llvmcpu-target-cpu=cascadelake --iree-llvmcpu-stack-allocation-limit=140000 --iree-flow-break-dispatch=@forward:26 --iree-flow-enable-data-tiling --iree-llvmcpu-enable-microkernels -o opt-1_3b_causallm_128_torch_cpu-task_tiled_ukernels.vmfb

bjacob · 2023-07-14T00:59:13Z

Yes, for me the model did run e2e without segfault. Sorry that success isn't reproducing on your end. I'll dig some more (maybe I'll run with ASan to catch things where maybe I was just being lucky) and I'll report back here.

bjacob · 2023-07-14T01:07:18Z

Re

This was with and without target-cpu=cascadelake.

Note that your compile command also has --iree-llvmcpu-target-cpu-features=host. So if you were running this on a CascadeLake host, the target-cpu=cascadelake by itself wasn't making a difference. Can you try with just target-cpu=haswell or target-cpu-features=+avx,+avx2,+fma and no host flag whatsoever?

monorimet · 2023-07-14T02:04:43Z

That seems to have worked. Thanks so much! I will post perf results in the SHARK tracker shortly.

from what I see this should bring us closer to pytorch numbers. Does this workaround mean we are foregoing some optimizations for this cpu arch?

Re

This was with and without target-cpu=cascadelake.

Note that your compile command also has --iree-llvmcpu-target-cpu-features=host. So if you were running this on a CascadeLake host, the target-cpu=cascadelake by itself wasn't making a difference. Can you try with just target-cpu=haswell or target-cpu-features=+avx,+avx2,+fma and no host flag whatsoever?

bjacob · 2023-07-14T02:45:57Z

Yes, this is just a debugging step. No need to look into perf results now, we don't want to forego AVX512 if the target is AVX512-capable. So now we know that there are two separate issues there:

A compiler issue, which I was able to reproduce, and which is fixed by data-tiling: introduce upper_bound_tile_size op to defer padding-size choice to MaterializeEncoding. #14349.
A runtime issue, which is specific to AVX512 (the difference between passing target-cpu=cascadelake and not passing it, for a f32 model). I also have a AVX512 machine so I'll try reproducing and debugging that there.

monorimet · 2023-07-14T03:00:43Z

OK. I will wait on the perf results. I happened upon an issue with sequence length 8, which, with the latest flags and #14349, gives a compile-time error ./opt-1_3b_causallm_8_torch.mlir:865:12: error: 'memref.alloca' op all stack allocations need to be hoisted to the entry block of the function.

I'm sure the M < 16 here is causing it to take a different path. We can prioritize the functional path and getting avx512 first but it seems #14349 was intended (in part) to address the narrow matmul cases.

To reproduce (with build on changes from #14349):

Download opt_1-3b_causallm_8_torch.mlir
Run:

iree-compile ./opt-1_3b_causallm_8_torch.mlir --iree-hal-target-backends=llvm-cpu --iree-llvmcpu-target-cpu-features=+avx,+avx2,+fma --iree-llvmcpu-target-cpu=haswell --iree-llvmcpu-stack-allocation-limit=140000 --iree-flow-enable-data-tiling --iree-llvmcpu-enable-microkernels -o opt_1-3b_causallm_8_torch_cpu.vmfb

I'm here to help with debugging if you need an extra pair of eyes or hands. I'll be testing cases to see if I can help narrow down either of these issues unless you need my efforts pointed elsewhere.

edit : the m<16 case we can file a separate issue for and address once avx512 and data-tiling are playing nice.

bjacob · 2023-07-14T16:26:12Z

I can reproduce the error: 'memref.alloca' op all stack allocations need to be hoisted to the entry block of the function. Looking.

MaheshRavishankar · 2023-07-14T16:34:38Z

Ok, this makes sense. We probably have a dynamic shaped allocation that isn't hoisted out of inner loops.
@hanhanW I am fairly certain this is because this is over fused. I can limit the fusion if you can let me know what the basic issue is.

bjacob · 2023-07-14T16:43:51Z

I'm making a minimized testcase, should have it in a moment.

bjacob · 2023-07-14T16:59:35Z

Filed #14406 with minimized testcase. It does look related to fusions, as it only triggers when sufficiently many of these linalg.generic's are chained, preventing further minimization of the testcase.

bjacob · 2023-07-15T03:26:10Z

Confirmed that the updated #14349 avoids the issue from #14398 (comment) (independently of @hanhanW 's fix to the underlying problem). Still debugging some apparent compile-time regression before I merge, but this should be unblocked.

…ze choice to MaterializeEncoding. (#14349) This fixes #11632, by introducing a materializable `upper_bound_tile_size ` instead of hardcoding a fixed padding amount at Flow, and fixes it in sufficient generality to also solve the problem for narrow matmuls - let's explain that in more detail as this is an important part of what this PR is doing. For each combination of element types and each target, the MaterializeEncoding pass selects appropriate matmul tile shapes. Input tensors get padded to the next multiple of the next tile size. The padding increases the inherent arithmetic cost of the problem at hand. When, along some dimension, the original tensor size is smaller than the tile size, that can result in particulary large overhead. The extreme case, which is also a very common case, is matrix-times-vector multiplication. The "vector" right-hand side is really a matrix with one dimension size equal to 1, so if the general matmul tile shape along that dimension is 8 or 16, as is usually the case, that can be a 8x or 16x increase in the inherent arithmetic cost of the matmul op. The solution to that is to adjust MaterializeEncoding tile shapes to narrow dimensions. We had some logic in place to deal with that, but #11632 was leaving it moot: the flow-level padding of everything to the next multiple of 16 meant that our logic there never really had a chance of kicking in. With #11632 being fixed, this PR was the opportunity to also fix that along the way, and to ensure that the solution to #11632 worked also in that respect. As matrix-times-vector products were the common case that suffered the most from #11632, it would have been too bad to "solve" #11632 without addressing that. By the way, matrix-times-vector is only the extreme case, but other narrow cases matter too. When, e.g. on AVX-512, the general matmul tile size is 16, even width-8 matmuls (MxKx8) were suffering from 2x-widening. So the solution in this PR is making sure to address all narrow cases, defined as whenever a tensor dimension size is less than the general tile size. The difficulty was that when MaterializeEncoding runs on a dispatch function, it runs on an already-padded tensor; even as this PR introduces `upper_bound_tile_size`, that only makes it possible to select the right padding amount, but there's still a `tensor.pad` op and it's still getting in the way of knowing the actual, original tensor shape for the purpose of adjusting tile shapes for narrow cases. Moreover, as `MaterializeEncoding` is a type-converter pass, it can't just walk from a Value up to its defining-op to find the pre-padding tensor. There are no values there, only types. So the information about the pre-padding tensor shape has to be part of the tensor type that `MaterializeEncoding` sees, that its, the padded tensor type. The solution to that problem in this PR is to add a `original_type` field to `EncodingAttr`. Fixes #11632. Fixes a compiler issue encountered in #14398 but not the originally reported runtime crash by itself. This now also includes the removal of a now-useless VMVX pass, which was originally split into #14383 .

bjacob · 2023-07-18T17:35:15Z

@monorimet This is fixed for me now (I believe by #14349 but there have been a flurry of related commits over the past few days).

bjacob · 2023-07-18T17:40:19Z

(EDIT - moving that performance discussion back to nod-ai/SHARK#1589 (comment) )

@monorimet - Current benchmark results give a flavor of performance to expect. Note - testing on a Intel Skylake-XEON CPU with AVX-512. Compiling with --iree-llvmcpu-target-cpu=skylake-avx512. Command lines as in the original PR description above.

Without data-tiling and ukernels: 515 ms
With data-tiling but not ukernels: 72 ms
With data-tiling and ukernels: 3100 ms

So, data-tiling alone is a ~ 8x speedup. Ukernels alone are not yet good. But I'll get to that now, and it will be at least as fast as non-ukernels and in some cases faster. What's almost certainly happening here is that this particular model is f32, and f32 matmuls on ISAs like AVX-512 are what default codegen is good at. As soon as we depart from that, e.g. f16, things are more challenging for default codegen and the ukernels become more of a win.

benvanik · 2023-07-18T17:46:52Z

Wow, fantastic improvement with data tiling!

…ze choice to MaterializeEncoding. (iree-org#14349) This fixes iree-org#11632, by introducing a materializable `upper_bound_tile_size ` instead of hardcoding a fixed padding amount at Flow, and fixes it in sufficient generality to also solve the problem for narrow matmuls - let's explain that in more detail as this is an important part of what this PR is doing. For each combination of element types and each target, the MaterializeEncoding pass selects appropriate matmul tile shapes. Input tensors get padded to the next multiple of the next tile size. The padding increases the inherent arithmetic cost of the problem at hand. When, along some dimension, the original tensor size is smaller than the tile size, that can result in particulary large overhead. The extreme case, which is also a very common case, is matrix-times-vector multiplication. The "vector" right-hand side is really a matrix with one dimension size equal to 1, so if the general matmul tile shape along that dimension is 8 or 16, as is usually the case, that can be a 8x or 16x increase in the inherent arithmetic cost of the matmul op. The solution to that is to adjust MaterializeEncoding tile shapes to narrow dimensions. We had some logic in place to deal with that, but next multiple of 16 meant that our logic there never really had a chance of kicking in. With iree-org#11632 being fixed, this PR was the opportunity to also fix that along the way, and to ensure that the solution to iree-org#11632 worked also in that respect. As matrix-times-vector products were the common case that suffered the most from iree-org#11632, it would have been too bad to "solve" iree-org#11632 without addressing that. By the way, matrix-times-vector is only the extreme case, but other narrow cases matter too. When, e.g. on AVX-512, the general matmul tile size is 16, even width-8 matmuls (MxKx8) were suffering from 2x-widening. So the solution in this PR is making sure to address all narrow cases, defined as whenever a tensor dimension size is less than the general tile size. The difficulty was that when MaterializeEncoding runs on a dispatch function, it runs on an already-padded tensor; even as this PR introduces `upper_bound_tile_size`, that only makes it possible to select the right padding amount, but there's still a `tensor.pad` op and it's still getting in the way of knowing the actual, original tensor shape for the purpose of adjusting tile shapes for narrow cases. Moreover, as `MaterializeEncoding` is a type-converter pass, it can't just walk from a Value up to its defining-op to find the pre-padding tensor. There are no values there, only types. So the information about the pre-padding tensor shape has to be part of the tensor type that `MaterializeEncoding` sees, that its, the padded tensor type. The solution to that problem in this PR is to add a `original_type` field to `EncodingAttr`. Fixes iree-org#11632. Fixes a compiler issue encountered in iree-org#14398 but not the originally reported runtime crash by itself. This now also includes the removal of a now-useless VMVX pass, which was originally split into iree-org#14383 .

monorimet added the bug 🐞 Something isn't working label Jul 13, 2023

powderluv assigned bjacob and MaheshRavishankar Jul 13, 2023

bjacob mentioned this issue Jul 13, 2023

data-tiling: introduce upper_bound_tile_size op to defer padding-size choice to MaterializeEncoding. #14349

Merged

bjacob mentioned this issue Jul 14, 2023

compile error due to unhoisted stack alloc with fused generics + data-tiled matmul #14406

Closed

bjacob mentioned this issue Jul 17, 2023

[Epic][CPU] Enable predictable performance for matrix multiplication using data-tiling and micro-kernels #13216

Open

bjacob closed this as completed Jul 18, 2023

bjacob mentioned this issue Jul 18, 2023

OPT-1.3b performance tracker nod-ai/SHARK#1589

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault running OPT1.3b f32 with data tiling enabled #14398

Segmentation fault running OPT1.3b f32 with data tiling enabled #14398

monorimet commented Jul 13, 2023 •

edited by bjacob

Loading

monorimet commented Jul 13, 2023 •

edited

Loading

bjacob commented Jul 13, 2023

MaheshRavishankar commented Jul 13, 2023

monorimet commented Jul 13, 2023 •

edited

Loading

bjacob commented Jul 14, 2023

bjacob commented Jul 14, 2023 •

edited

Loading

monorimet commented Jul 14, 2023 •

edited

Loading

bjacob commented Jul 14, 2023 •

edited

Loading

monorimet commented Jul 14, 2023 •

edited

Loading

bjacob commented Jul 14, 2023

MaheshRavishankar commented Jul 14, 2023

bjacob commented Jul 14, 2023

bjacob commented Jul 14, 2023

bjacob commented Jul 15, 2023

bjacob commented Jul 18, 2023

bjacob commented Jul 18, 2023 •

edited

Loading

benvanik commented Jul 18, 2023

Segmentation fault running OPT1.3b f32 with data tiling enabled #14398

Segmentation fault running OPT1.3b f32 with data tiling enabled #14398

Comments

monorimet commented Jul 13, 2023 • edited by bjacob Loading

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

monorimet commented Jul 13, 2023 • edited Loading

bjacob commented Jul 13, 2023

MaheshRavishankar commented Jul 13, 2023

monorimet commented Jul 13, 2023 • edited Loading

bjacob commented Jul 14, 2023

bjacob commented Jul 14, 2023 • edited Loading

monorimet commented Jul 14, 2023 • edited Loading

bjacob commented Jul 14, 2023 • edited Loading

monorimet commented Jul 14, 2023 • edited Loading

bjacob commented Jul 14, 2023

MaheshRavishankar commented Jul 14, 2023

bjacob commented Jul 14, 2023

bjacob commented Jul 14, 2023

bjacob commented Jul 15, 2023

bjacob commented Jul 18, 2023

bjacob commented Jul 18, 2023 • edited Loading

benvanik commented Jul 18, 2023

monorimet commented Jul 13, 2023 •

edited by bjacob

Loading

monorimet commented Jul 13, 2023 •

edited

Loading

monorimet commented Jul 13, 2023 •

edited

Loading

bjacob commented Jul 14, 2023 •

edited

Loading

monorimet commented Jul 14, 2023 •

edited

Loading

bjacob commented Jul 14, 2023 •

edited

Loading

monorimet commented Jul 14, 2023 •

edited

Loading

bjacob commented Jul 18, 2023 •

edited

Loading