-
Notifications
You must be signed in to change notification settings - Fork 576
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault running OPT1.3b f32 with data tiling enabled #14398
Comments
I don't see anything that stands out as problematic about the above IR (for dispatch 24/25/26) -- I notice that dispatch 26 sets the encoding for the 128x8192x2048 matmul at dispatch 27, and this is right around where the segfault occurs in end-to-end/'end-to-slice' execution:
I can't seem to reproduce the issue with a single dispatch (probably user error?) so I'm trying the break with specific dispatch names to make sure we aren't looking at the wrong dispatch, since the current suspect (dispatch 25) seems somewhat benign... |
Good news time - this is fixed by #14349. This is actually a compiler bug. The
Since the assertion failure is about exactly the kind of thing that #14349 is refactoring, I gave it a try, and it does succeed -- the Note - for running locally on my machine I had to drop the |
cc @hanhanW |
@bjacob Did the model compile and run e2e without segfault on your patch? I am trying to reproduce with #14349 but the segfault is still showing up, though it seems to happen after dispatch 27 instead. Perhaps I'm not running something correctly but via your explanation I'm surprised not to see a compile-time error with assertions enabled. This was with and without Compile command (break at smallest failing dispatch index):
|
Yes, for me the model did run e2e without segfault. Sorry that success isn't reproducing on your end. I'll dig some more (maybe I'll run with ASan to catch things where maybe I was just being lucky) and I'll report back here. |
Re
Note that your compile command also has |
That seems to have worked. Thanks so much! I will post perf results in the SHARK tracker shortly. from what I see this should bring us closer to pytorch numbers. Does this workaround mean we are foregoing some optimizations for this cpu arch?
|
Yes, this is just a debugging step. No need to look into perf results now, we don't want to forego AVX512 if the target is AVX512-capable. So now we know that there are two separate issues there:
|
OK. I will wait on the perf results. I happened upon an issue with sequence length 8, which, with the latest flags and #14349, gives a compile-time error I'm sure the To reproduce (with build on changes from #14349):
I'm here to help with debugging if you need an extra pair of eyes or hands. I'll be testing cases to see if I can help narrow down either of these issues unless you need my efforts pointed elsewhere. edit : the |
I can reproduce the |
Ok, this makes sense. We probably have a dynamic shaped allocation that isn't hoisted out of inner loops. |
I'm making a minimized testcase, should have it in a moment. |
Filed #14406 with minimized testcase. It does look related to fusions, as it only triggers when sufficiently many of these |
Confirmed that the updated #14349 avoids the issue from #14398 (comment) (independently of @hanhanW 's fix to the underlying problem). Still debugging some apparent compile-time regression before I merge, but this should be unblocked. |
…ze choice to MaterializeEncoding. (#14349) This fixes #11632, by introducing a materializable `upper_bound_tile_size ` instead of hardcoding a fixed padding amount at Flow, and fixes it in sufficient generality to also solve the problem for narrow matmuls - let's explain that in more detail as this is an important part of what this PR is doing. For each combination of element types and each target, the MaterializeEncoding pass selects appropriate matmul tile shapes. Input tensors get padded to the next multiple of the next tile size. The padding increases the inherent arithmetic cost of the problem at hand. When, along some dimension, the original tensor size is smaller than the tile size, that can result in particulary large overhead. The extreme case, which is also a very common case, is matrix-times-vector multiplication. The "vector" right-hand side is really a matrix with one dimension size equal to 1, so if the general matmul tile shape along that dimension is 8 or 16, as is usually the case, that can be a 8x or 16x increase in the inherent arithmetic cost of the matmul op. The solution to that is to adjust MaterializeEncoding tile shapes to narrow dimensions. We had some logic in place to deal with that, but #11632 was leaving it moot: the flow-level padding of everything to the next multiple of 16 meant that our logic there never really had a chance of kicking in. With #11632 being fixed, this PR was the opportunity to also fix that along the way, and to ensure that the solution to #11632 worked also in that respect. As matrix-times-vector products were the common case that suffered the most from #11632, it would have been too bad to "solve" #11632 without addressing that. By the way, matrix-times-vector is only the extreme case, but other narrow cases matter too. When, e.g. on AVX-512, the general matmul tile size is 16, even width-8 matmuls (MxKx8) were suffering from 2x-widening. So the solution in this PR is making sure to address all narrow cases, defined as whenever a tensor dimension size is less than the general tile size. The difficulty was that when MaterializeEncoding runs on a dispatch function, it runs on an already-padded tensor; even as this PR introduces `upper_bound_tile_size`, that only makes it possible to select the right padding amount, but there's still a `tensor.pad` op and it's still getting in the way of knowing the actual, original tensor shape for the purpose of adjusting tile shapes for narrow cases. Moreover, as `MaterializeEncoding` is a type-converter pass, it can't just walk from a Value up to its defining-op to find the pre-padding tensor. There are no values there, only types. So the information about the pre-padding tensor shape has to be part of the tensor type that `MaterializeEncoding` sees, that its, the padded tensor type. The solution to that problem in this PR is to add a `original_type` field to `EncodingAttr`. Fixes #11632. Fixes a compiler issue encountered in #14398 but not the originally reported runtime crash by itself. This now also includes the removal of a now-useless VMVX pass, which was originally split into #14383 .
@monorimet This is fixed for me now (I believe by #14349 but there have been a flurry of related commits over the past few days). |
(EDIT - moving that performance discussion back to nod-ai/SHARK#1589 (comment) ) @monorimet - Current benchmark results give a flavor of performance to expect. Note - testing on a Intel Skylake-XEON CPU with AVX-512. Compiling with
So, data-tiling alone is a ~ 8x speedup. Ukernels alone are not yet good. But I'll get to that now, and it will be at least as fast as non-ukernels and in some cases faster. What's almost certainly happening here is that this particular model is |
Wow, fantastic improvement with data tiling! |
…ze choice to MaterializeEncoding. (iree-org#14349) This fixes iree-org#11632, by introducing a materializable `upper_bound_tile_size ` instead of hardcoding a fixed padding amount at Flow, and fixes it in sufficient generality to also solve the problem for narrow matmuls - let's explain that in more detail as this is an important part of what this PR is doing. For each combination of element types and each target, the MaterializeEncoding pass selects appropriate matmul tile shapes. Input tensors get padded to the next multiple of the next tile size. The padding increases the inherent arithmetic cost of the problem at hand. When, along some dimension, the original tensor size is smaller than the tile size, that can result in particulary large overhead. The extreme case, which is also a very common case, is matrix-times-vector multiplication. The "vector" right-hand side is really a matrix with one dimension size equal to 1, so if the general matmul tile shape along that dimension is 8 or 16, as is usually the case, that can be a 8x or 16x increase in the inherent arithmetic cost of the matmul op. The solution to that is to adjust MaterializeEncoding tile shapes to narrow dimensions. We had some logic in place to deal with that, but next multiple of 16 meant that our logic there never really had a chance of kicking in. With iree-org#11632 being fixed, this PR was the opportunity to also fix that along the way, and to ensure that the solution to iree-org#11632 worked also in that respect. As matrix-times-vector products were the common case that suffered the most from iree-org#11632, it would have been too bad to "solve" iree-org#11632 without addressing that. By the way, matrix-times-vector is only the extreme case, but other narrow cases matter too. When, e.g. on AVX-512, the general matmul tile size is 16, even width-8 matmuls (MxKx8) were suffering from 2x-widening. So the solution in this PR is making sure to address all narrow cases, defined as whenever a tensor dimension size is less than the general tile size. The difficulty was that when MaterializeEncoding runs on a dispatch function, it runs on an already-padded tensor; even as this PR introduces `upper_bound_tile_size`, that only makes it possible to select the right padding amount, but there's still a `tensor.pad` op and it's still getting in the way of knowing the actual, original tensor shape for the purpose of adjusting tile shapes for narrow cases. Moreover, as `MaterializeEncoding` is a type-converter pass, it can't just walk from a Value up to its defining-op to find the pre-padding tensor. There are no values there, only types. So the information about the pre-padding tensor shape has to be part of the tensor type that `MaterializeEncoding` sees, that its, the padded tensor type. The solution to that problem in this PR is to add a `original_type` field to `EncodingAttr`. Fixes iree-org#11632. Fixes a compiler issue encountered in iree-org#14398 but not the originally reported runtime crash by itself. This now also includes the removal of a now-useless VMVX pass, which was originally split into iree-org#14383 .
What happened?
I am continuing from nod-ai/SHARK#1589 in an investigation of the segmentation faults that occur when trying to run OPT-1.3b at f32 precision with
--iree-flow-enable-data-tiling
.I am having some trouble making a dispatch-level reproducer but will share the smallest reproduction as well as relevant IR.
I have compiled with
--iree-flow-break-dispatch=@forward:24
and the data tiling flag, the resulting .vmfb successfully runs through iree-benchmark-module. With--iree-flow-break-dispatch=@forward:25
, the .vmfb segfaults in iree-benchmark-module.Full reproduction steps are given below for this case, and here is a download link to the full IR dump after
iree-flow-outline-dispatch-regions
.From the above IR dump I've isolated the func.func from dispatch 25:
Steps to reproduce your issue
What component(s) does this issue relate to?
No response
Version information
6e49915
Additional context
No response
The text was updated successfully, but these errors were encountered: