Undeterministically task==NULL at runtime when finetuning GPT-2 #12529

wangkuiyi · 2023-03-06T23:14:36Z

What happened?

After we fixed #12369, I can make GPT-2 generate text well, so I'm moving on to fine-tuning GPT-2.

In iree-org/iree-jax#58, I added a loss function to the file iree-jax/models/gpt2/model.py. In JAX-Python, the fine-tuning works well.

Then, in iree-org/iree-jax#59, I add the fine-tuning feature as an MLIR function. The compilation went well, and I got the file /tmp/gpt2.vmfb.

I can run the module using iree-run-module

15:09 $ iree-run-module --module=/tmp/gpt2.vmfb --device=local-task --function=finetune --input="1x64xi32=13" --input="1x64xi32=13" --input="1xi32=10"
EXEC @finetune

Because the finetune function only updates the paramter and does not return anything, the above run prints only EXEC @finetune.

To check if the finetuning really works on macOS, I wrote a C++ program to run this vmfb file. Sometimes it works well, but sometimes it crashes with Bus error: 10.

(base) ✔ ~/w/iree-ios/iree-jax/models/gpt2/finetune [export_finetune|●4✚ 3…6]
14:51 $ ./build.sh && ./finetune /tmp/gpt2.vmfb ~/w/iree-ios/IREESampleApp/IREESampleApp
clang: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
Got id = 679
Yi Wang has two dogs. He's a good dog, but he's not a good dog.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One of the other dogs is in the other two.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One is Joy.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Joy.
Got id = 1881
Yi Wang has two dogs. One is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Relaxie.
Got id = 1881
Yi Wang has two dogs. One is Joy, the other is Relaxie.
(base) ✔ ~/w/iree-ios/iree-jax/models/gpt2/finetune [export_finetune|●4✚ 3…5]
14:51 $ ./build.sh && ./finetune /tmp/gpt2.vmfb ~/w/iree-ios/IREESampleApp/IREESampleApp
clang: warning: treating 'c' input as 'c++' when in C++ mode, this behavior is deprecated [-Wdeprecated]
Got id = 679
Yi Wang has two dogs. He's a good dog, but he's not a good dog.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One of the other dogs is in the other two.
Got id = 1881
Yi Wang has two dogs. One is the other is the other is the other is the other is the other is
Got id = 1881
Yi Wang has two dogs. One is Joy.
Bus error: 10

By putting the C++ program into an iOS app written in Objective-C, I can run the app on my iPhone 13 or the iOS Simulator. On these two platforms, the program crashes with EXC_BAD_ACCESS almost every time. I am attaching a stack trace from Xcode.

Steps to reproduce your issue

Build a very recent version of IREE after the fix EXC_BAD_ACCESS signal received executing GPT2 llvm-cpu on iOS Simulator #12369
Use the branch of IREE-JAX in Add MLIR function finetune to GPT-2 export.py iree-jax#59 to generate gpt2.vmfb
Build the sample C++ program that executes gpt2.vmfb on macOS/M1.
Build the sample iOS app that executes gpt2.vmfb on the iOS Simulator or an iPhone.

What component(s) does this issue relate to?

Runtime

Version information

IREE da22c84

Additional context

macOS
M1 Max

The text was updated successfully, but these errors were encountered:

bjacob · 2023-03-07T01:22:05Z

To triage an undeterministic issue like this, I would be very helpful to be able to run the reproduction steps with sanitizers: AddressSanitizer, and separately, ThreadSanitizer. This page says:

You can’t use Thread Sanitizer to diagnose iOS, tvOS, and watchOS apps running on a device. Use Thread Sanitizer only on your 64-bit macOS app, or to diagnose your 64-bit iOS, tvOS, or watchOS app running in Simulator.

Since you write above that this reproduces in Simulator, let's then focus on that.

In particular, task==NULL sounds like the kind of thing that could be associated with issues that ThreadSanitizer would diagnose.

Even a negative outcome (the sanitizer doesn't see anything) would be useful information in itself, as that would help rule out classes of issues.

We have sanitizers docs here,
https://github.com/openxla/iree/blob/main/docs/developers/developing_iree/sanitizers.md

But I wrote that a while ago and it's not optimal. Here's the important steps:

For both sanitizers, select the RelWithDebInfo build type.

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo .

For ThreadSanitizer, first re-compile your .vmfb module by adding these flags to your iree-compile command line:

iree-compile ...  --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false

Then re-build the IREE runtime (iree-run-module or anything else you're using to load the compiled module) with the IREE_ENABLE_TSAN CMake option:

cmake -DIREE_ENABLE_TSAN=ON .
cmake --build .

If the reproducing program is your own (finetune, if I read the Issue description correctly) then rebuild that with this C/C++ compiler flag: -fsanitize=thread. That is all what IREE_ENABLE_TSAN does to IREE binaries. But you still need to rebuild the IREE runtime (that it links to) with IREE_ENABLE_TSAN.

Then re-run your iree-run-module command line reproducing this issue, using both the TSan-enabled iree-run-module and the TSan-enabled compiled .vmfb module.

For AddressSanitizer, it's easier as you don't need to re-compile the .vmfb. Just re-compile the IREE runtime with the CMake option IREE_ENABLE_ASAN=ON.

cmake -DIREE_ENABLE_ASAN=ON .
cmake --build .

If the reproducing program is your own (finetune, if I read the Issue description correctly) then rebuild that with this C/C++ compiler flag: -fsanitize=address. That is all what IREE_ENABLE_ASAN does to IREE binaries. But you still need to rebuild the IREE runtime (that it links to) with IREE_ENABLE_ASAN.

wangkuiyi · 2023-03-07T03:47:32Z

Thanks @bjacob ! I rebuild the IREE compiler and runtime for macOS/M1 with the following additional CMake flags

-DCMAKE_BUILD_TYPE=RelWithDebInfo
-DIREE_ENABLE_ASAN=ON 
-DIREE_ENABLE_TSAN=ON 
-DIREE_BYTECODE_MODULE_ENABLE_TSAN=ON 
-DIREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER=ON 
-DIREE_ENABLE_MSAN=ON

The building was alright except that I had to fix libyaml a little bit yaml/libyaml#267

Then, I compiled the gpt2.mlir with the following command:

 iree-compile /tmp/gpt2.mlir \
   --iree-input-type=mhlo \
   --iree-hal-target-backends=llvm-cpu  \
   -o /tmp/gpt2-san.vmfb \
   --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false 2>&1 | tee /tmp/log

It gave me errors like the following. (The more complete error message is at https://gist.github.com/wangkuiyi/b4ef1a867e6f129fe3287a0ef0e1d600. The complete one is too big to upload to GitHub.)

Undefined symbols for architecture arm64:
 "___tsan_func_entry", referenced from:
     _encode_dispatch_0_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_1_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_2_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_3_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_4_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_5_generic_768x8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_6_matmul_2304x8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     ...
 "___tsan_func_exit", referenced from:
     _encode_dispatch_0_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_1_generic_8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_2_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_3_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_4_generic_8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_5_generic_768x8 in gpt2_module_linked_llvm_cpu-dff9a6.o
     _encode_dispatch_6_matmul_2304x8x768 in gpt2_module_linked_llvm_cpu-dff9a6.o
     ...
ld: symbol(s) not found for architecture arm64
Linking failed; escaped command line returned exit code 256:

It works if I remove --iree-llvm-sanitize=thread --iree-llvm-link-embedded=false.

bjacob · 2023-03-07T04:00:57Z

I don't know the fix for these linking errors, but, FYI:

-DCMAKE_BUILD_TYPE=RelWithDebInfo
-DIREE_ENABLE_ASAN=ON
-DIREE_ENABLE_TSAN=ON
-DIREE_BYTECODE_MODULE_ENABLE_TSAN=ON
-DIREE_BYTECODE_MODULE_FORCE_LLVM_SYSTEM_LINKER=ON
-DIREE_ENABLE_MSAN=ON

The IREE_ENABLE_*SAN options should be regarded as mutually exclusive. In effect, they are probably overriding each other, passing -fsanitize={address,thread,memory} where the one passed last overrides others. So here, drop -DIREE_ENABLE_ASAN=ON and -DIREE_ENABLE_MSAN=ON.

bjacob · 2023-03-07T04:10:44Z

Interesting! The linker command line from your gist is

/usr/bin/ld -o /var/folders/hd/6q8jftdn7b1fygsrzdkp5ww40000gn/T/gpt2_module_linked_llvm_cpu-dff9a6.so -static -dylib -flat_namespace -L /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/lib -lSystem /var/folders/hd/6q8jftdn7b1fygsrzdkp5ww40000gn/T/gpt2_module_linked_llvm_cpu-dff9a6.o

and it is itself generated by this code: https://github.com/openxla/iree/blob/1148f720be7e267f248e034b3cfb488633884980/compiler/src/iree/compiler/Dialect/HAL/Target/LLVM/internal/UnixLinkerTool.cpp#L82-L92

This is as if on the Apple platform, the TSan instrumentation library needed to be explicitly linked in (?) We need someone with Apple experience here.... maybe @powderluv ?

bjacob · 2023-03-07T04:20:49Z

Maybe try adding "-fsanitize=thread" to the linker flags (code linked in previous comment). It's suggested at various places including google/sanitizers#701 .

That is, at UnixLinkerTool.cpp:90 (above linked code), add unconditionally

flags.push_back("-fsanitize=thread");

If that works, we'll figure how to do that conditionally.

wangkuiyi · 2023-03-07T05:10:27Z

clang -v -fsantize=thread helped me. The following command

clang -fsantize /tmp/a.c -o /tmp/a

is equivalent to the following two:

clang /tmp/a.c -c -o /tmp/a.o

and

ld /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin/libclang_rt.tsan_osx_dynamic.dylib \
  -rpath @executable_path \
  -rpath /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/14.0.0/lib/darwin \
  /tmp/a.o -o /tmp/a \
  -lSystem -syslibroot /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk

stellaraccident · 2023-03-07T16:11:25Z

I suspect that this issue is not coming from generated code. In that case, you may be able to get away with just building the iree runtime with sanitizers and not fiddling with the iree-llvm-sanitize=thread compiler flags.

(It would obviously be good if this all worked better on apple platforms so just offering an option that night lead through the maze faster -- it is still useful to figure out how to fully enable sanitizers)

stellaraccident · 2023-03-07T16:15:28Z

Other things that can be done to bisect the area that is having the problem:

compile with vmvx (slow but unlikely to crash on generated code)
use the dylib-sync vs dylib-task runtime option (uses single threaded mode)

I suggest the last one because that error makes me think there is something going on with the task scheduler in threaded mode. Narrowing down which piece is crashing can help scope debugging activity.

bjacob · 2023-03-07T16:32:11Z

I suspect that this issue is not coming from generated code. In that case, you may be able to get away with just building the iree runtime with sanitizers and not fiddling with the iree-llvm-sanitize=thread compiler flags.

Agree that this issue does not look like it comes from the generated code.... but TSan specifically (as opposed to other sanitizers) does not allow taking advantage of that in that way, because a TSan-enabled IREE runtime can only call TSan-enabled module code (TSan is an ABI break). Well, it will run, but it will crash.

compile with vmvx (slow but unlikely to crash on generated code)

Ah good idea, that does enable running a TSan-enabled IREE-runtime without having to get TSan to work in module code. My above objection is specific to llvm-cpu target backend.

I suggest the last one because that error makes me think there is something going on with the task scheduler in threaded mode. Narrowing down which piece is crashing can help scope debugging activity.

+1

allieculp · 2023-04-13T18:06:57Z

@bjacob @wangkuiyi Looks like this went a bit stale, any further update?

bjacob · 2023-04-13T20:27:32Z

Deferring to @wangkuiyi .

wangkuiyi · 2023-04-18T04:24:31Z

@allieculp and @bjacob - I got GPT-2 fine-tuning work a month ago, but via @antiagainst 's Metal GPU backend. This issue comes with the CPU backend, but not the Metal GPU one.

wangkuiyi added bug 🐞 Something isn't working awaiting-triage labels Mar 6, 2023

wangkuiyi mentioned this issue Mar 7, 2023

Enable thread sanitize vmfb on macOS #12533

Open

julianwa removed the awaiting-triage label Apr 5, 2023

allieculp assigned bjacob and wangkuiyi Apr 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Undeterministically task==NULL at runtime when finetuning GPT-2 #12529

Undeterministically task==NULL at runtime when finetuning GPT-2 #12529

wangkuiyi commented Mar 6, 2023

bjacob commented Mar 7, 2023 •

edited

Loading

wangkuiyi commented Mar 7, 2023 •

edited

Loading

bjacob commented Mar 7, 2023

bjacob commented Mar 7, 2023

bjacob commented Mar 7, 2023 •

edited

Loading

wangkuiyi commented Mar 7, 2023

stellaraccident commented Mar 7, 2023

stellaraccident commented Mar 7, 2023

bjacob commented Mar 7, 2023 •

edited

Loading

allieculp commented Apr 13, 2023

bjacob commented Apr 13, 2023

wangkuiyi commented Apr 18, 2023

Undeterministically task==NULL at runtime when finetuning GPT-2 #12529

Undeterministically task==NULL at runtime when finetuning GPT-2 #12529

Comments

wangkuiyi commented Mar 6, 2023

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

bjacob commented Mar 7, 2023 • edited Loading

wangkuiyi commented Mar 7, 2023 • edited Loading

bjacob commented Mar 7, 2023

bjacob commented Mar 7, 2023

bjacob commented Mar 7, 2023 • edited Loading

wangkuiyi commented Mar 7, 2023

stellaraccident commented Mar 7, 2023

stellaraccident commented Mar 7, 2023

bjacob commented Mar 7, 2023 • edited Loading

allieculp commented Apr 13, 2023

bjacob commented Apr 13, 2023

wangkuiyi commented Apr 18, 2023

bjacob commented Mar 7, 2023 •

edited

Loading

wangkuiyi commented Mar 7, 2023 •

edited

Loading

bjacob commented Mar 7, 2023 •

edited

Loading

bjacob commented Mar 7, 2023 •

edited

Loading