-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable more tests for CPU in CI #51
Conversation
* [BACKEND][CPU] Implement the empty cpu backend * Run clang-format * Fix yadf error Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags:
A quick addition on how to use it.
Summary: This is stll a kind of the boilerplate and basic lowering for the first milestone (compiling vector addition). This PR firstly lowers `tt.func` and `tt.return`. Test Plan: It can safely compile an empty kernel. ``` @triton.jit def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr): return ``` > TRITON_ENABLE_LLVM_DEBUG=1 TRITON_CPU_BACKEND=1 python3 empty_kerne.py ``` //===-------------------------------------------===// Legalizing operation : 'tt.func'(0x73be2a0) { * Fold { } -> FAILURE : unable to fold * Pattern : 'tt.func -> ()' { Trying to match "(anonymous namespace)::FuncOpConversion" ** Insert : 'llvm.func'(0x6c04c70) ** Insert Block into : 'llvm.func'(0x6c04c70) ** Insert Block into : 'llvm.func'(0x6c04c70) ** Erase : 'tt.func'(0x73be2a0) "(anonymous namespace)::FuncOpConversion" result 1 //===-------------------------------------------===// Legalizing operation : 'llvm.func'(0x6c04c70) { } -> SUCCESS : operation marked legal by the target //===-------------------------------------------===// ... //===-------------------------------------------===// Legalizing operation : 'tt.return'(0x73efeb0) { "tt.return"() : () -> () * Fold { } -> FAILURE : unable to fold * Pattern : 'tt.return -> ()' { Trying to match "(anonymous namespace)::ReturnOpConversion" ** Insert : 'llvm.return'(0x73c0f00) ** Replace : 'tt.return'(0x73efeb0) "(anonymous namespace)::ReturnOpConversion" result 1 //===-------------------------------------------===// Legalizing operation : 'llvm.return'(0x73c0f00) { "llvm.return"() : () -> () } -> SUCCESS : operation marked legal by the target //===-------------------------------------------===// } -> SUCCESS : pattern applied successfully ```
…riton-lang#1) Summary: As title, `tl.program_id` needs to be supported first. As of now, we think pid will be provided as additional function arguments to the kernel. So, getting program_id is mapped to reading one of the last three arguments. I also quickly implemented `tl.device_print` or `print`, only for scalar types for a quick "Hello, World!" testing. Test Plan: Tested with a simple example: ``` @triton.jit def add_kernel(...): pid = tl.program_id(axis=0) # We use a 1D launch grid so axis is 0. foo = pid + 42 tl.device_print("Hello, World!", foo, pid) ``` The resulting .llir is valid: ``` @printfFormat_1 = internal constant [31 x i8] c"pid (%u, %u, %u) test: %u, %u\0A\00" declare !dbg !3 i32 @printf(ptr, ...) define void @add_kernel(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, i32 %3, i32 %4, i32 %5, i32 %6) !dbg !7 { %8 = add i32 %4, 42, !dbg !8 %9 = call i32 (ptr, ...) @printf(ptr @printfFormat_0, i32 %4, i32 %5, i32 %6, i32 %8, i32 %4) ret void, !dbg !9 } ``` Tried to compile with a fake main function: ``` > % cat main.c extern void add_kernel(float*, float*, float*, int, int, int, int); int main() { add_kernel(0, 0, 0, 4, 5, 6, 7); } > % llc -filetype=obj add_kernel.llir && clang -o a.out add_kernel.llir.o main.c > % ./a.out pid (5, 6, 7) Hello, World!: 47, 5 ```
Signed-off-by: Ilya Enkovich <[email protected]>
Co-authored-by: Shane Nay <[email protected]>
…n-lang#4) Signed-off-by: Ilya Enkovich <[email protected]>
…ion flows (triton-lang#6) * Support basic lowering through vector dialect in CPU backend. Signed-off-by: Ilya Enkovich <[email protected]> * Use axis info in memory op lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Mark test_ptx_cast as enabled for CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Support umulhi operation. Signed-off-by: Ilya Enkovich <[email protected]> * Support tl.clamp, tl.minimum, tl.maximum. Signed-off-by: Ilya Enkovich <[email protected]> * Add enable_fp_fusion opt for CPU (only affects ASM dump now). Signed-off-by: Ilya Enkovich <[email protected]> * Fix kernel args passing for propagated constants. Signed-off-by: Ilya Enkovich <[email protected]> * Add permutations support. Signed-off-by: Ilya Enkovich <[email protected]> * Support 2-D transfer_read/transfer_write lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Introduce shape info analysis and use it for loads/stores by block pointers. Delay scalar pointers lowering. Signed-off-by: Ilya Enkovich <[email protected]> * Support 'other' arg for loads. Signed-off-by: Ilya Enkovich <[email protected]> * Support tl.join. Signed-off-by: Ilya Enkovich <[email protected]> * Minor renaming. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
…ent (triton-lang#8) * [BACKEND][CPU] Make it buildable and runnable in a different environment * Revert seemingly inconsistent python code formatting
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Minjang Kim <[email protected]>
…iton-lang#11) * [CPU] Support flexible active driver + update vector-add tutorial * Update vector-add to run CPU always + optional GPU * Update do_bench for CPU
…ng#16) Signed-off-by: Gregory Shimansky <[email protected]>
…ng#17) * Fixed yaml syntax Signed-off-by: Gregory Shimansky <[email protected]> * Removed cpu label from run-on Signed-off-by: Gregory Shimansky <[email protected]> * Added missing zlib-dev Signed-off-by: Gregory Shimansky <[email protected]> * Added missing apt-get update Signed-off-by: Gregory Shimansky <[email protected]> * Remove pip cache because on self-hosted runner it slows things down Signed-off-by: Gregory Shimansky <[email protected]> * Corrected path to tests Signed-off-by: Gregory Shimansky <[email protected]> * Added installation of torch==2.1.2 Signed-off-by: Gregory Shimansky <[email protected]> --------- Signed-off-by: Gregory Shimansky <[email protected]>
* [CPU] Add OpenMP launcher * Address the comments * Fix induction variable type * Always use preallocated output buffer for CPU with torch.add
Signed-off-by: Ilya Enkovich <[email protected]>
* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR * Don't touch the main compiler.py
Signed-off-by: Gregory Shimansky <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
…-lang#23) * add un-masked tiled matrix-multiplication for triton-cpu * clean and add comment * move test under tutorials
* Fix RelWithDebInfo build. Signed-off-by: Ilya Enkovich <[email protected]> * Skip fp8 cast tests on CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Fix segfault. Signed-off-by: Ilya Enkovich <[email protected]> * [BACKEND] Update LLVM version to llvm/llvm-project@765206e (triton-lang#4059) * Add -s option to pytest run. Signed-off-by: Ilya Enkovich <[email protected]> * Add a workaround for LLVM bug causing test failure on Skylake CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Add a workaround for LLVM fpext bug causing test failure on Skylake CPU. Signed-off-by: Ilya Enkovich <[email protected]> * Fix formatting. Signed-off-by: Ilya Enkovich <[email protected]> --------- Signed-off-by: Ilya Enkovich <[email protected]> Co-authored-by: Pablo Zimmermann <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
Signed-off-by: Ilya Enkovich <[email protected]>
.github/workflows/build-test.yml
Outdated
python -m pytest -s -n 32 --device cpu python/test/unit/language/test_compile_errors.py | ||
python -m pytest -s -n 32 --device cpu python/test/unit/runtime/test_bindings.py | ||
python -m pytest -s -n 32 --device cpu python/test/unit/runtime/test_cache.py | ||
python -m pytest -s -n 32 --device cpu python/test/unit/runtime/test_launch.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I should combine all these commands and run their tests under a single command
Let me know if you would like me to break out the commits into individual PRs |
third_party/cpu/backend/driver.py
Outdated
@@ -125,11 +125,10 @@ def format_of(ty): | |||
|
|||
args_format = ''.join([format_of(_extracted_type(ty)) for ty in signature.values()]) | |||
format = "iiiOKOOOO" + args_format | |||
arg_ptrs_list = ', '.join(f"&arg{i}" for i, ty in signature.items()) if len(signature) > 0 else '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The if
check here was redundant
4edd05f
to
5a769b4
Compare
@pytest.mark.interpreter | ||
@pytest.mark.parametrize("debug", [False, True]) | ||
def test_interleave(device, debug): | ||
|
||
if device == "cpu" and debug: | ||
pytest.skip("Test aborts for device=cpu and debug=True") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens due to missing support of device_assert
.
Of course, we can skip this for now, but it should be fixed with #35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, @minjang actually suggested that I look at tackling that (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I'm porting my previous implementation of device_print to the new repo, but having some crashes :) It should be working soon. device_assert is also very similar.
…dd to CI The previous test implementation would break when run under pytest-xdist as it reused & removed the same temporary directory for the triton cache.
Most of the tests are skipped for now...
The only test that needed fixing was `test_unknown_annotations`, where we were generating invalidate code for the launcher. In particular, when `kernel_fn_args` was empty, we would get the following error: ``` /var/folders/_z/88s630fd3d9fx72mbmx90qvw0000gn/T/tmpy481mz0l/main.cpp:37:29: error: expected ';' before '(' token 37 | using kernel_ptr_t = void(*)(, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t); | ^ | ; ```
It was giving "not compiled with CUDA" errors because the test was querying for the CUDA compute capability, but it was not actually using this information. It looks like the first iteration of the test needed to know the compute capability, but that changed somewhere along the way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch, it looks great!
I have a couple of minor comments:
- Please try to avoid CPU-unrelated changes, it can complicate our rebases on the upstream repo and the following merge of the backend. You can submit such changes directly to the upstream Triton repo. E. g. changes in conftest.py don't seem to be backend-specific.
- Could you please create issues in the Feature Support section of our project for tests that are skipped for now and are supposed to be enabled in the future for CPU? And also for test suites (if any) that are not covered yet by your patch? It would help a lot in tracking missing features for CPU.
@@ -294,7 +294,9 @@ def kernel(): | |||
triton.compile(triton.compiler.ASTSource(fn=kernel, signature={}, constants={})) | |||
|
|||
|
|||
def test_global_access_in_fn_default_arg(): | |||
def test_global_access_in_fn_default_arg(device): | |||
if device == "cpu": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know which device is used for interpreter tests, but all other device checks in tests check for the interpreter mode first. Let's follow the same is_cpu
scheme everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm it looks like that's no longer the case in upstream? https://github.com/triton-lang/triton/blob/0dd9029abf61c949471eb512a0b1e0da55339859/python/test/unit/language/test_core.py#L32-L34
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh wait, I see the check was just folded into get_current_target()
. tbh I think it's cleaner to have interpreter-only tests marked explicitly as such, rather than conflating the interpreter check with the platform check... like, what if we have some tests that pass for both the interpreter and the CPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know which device is used for the interpreter, guess it might be CPU. So this check might disable the test for the interpreter.
@@ -282,6 +282,10 @@ def upcast_test(src_dtype, dst_dtype, exponent_bits, mantissa_bits, exponent_bia | |||
]) | |||
def test_typeconvert_upcast(src_dtype, dst_dtype, device): | |||
|
|||
if device == "cpu": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests are covered in #40
tmpdir = ".tmp" | ||
|
||
def get_device_key(device): | ||
if device == "cuda": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get it from the active driver as in jit.py
?
Yeah, I did kind of realize that :) I actually ended up splitting out some of the changes into triton-lang#4334 and triton-lang#4341 -- will rebase upon them once they're landed.
Yeah I'm actually not sure why we have two different schemes. Let me ask @ptillet in triton-lang#4334 since I have the same changes there. |
That's perfect. Thanks! |
if device == "cpu": | ||
if "float8" in src_dtype or "float8" in dst_dtype: | ||
pytest.skip("float8 tests are not supported on CPU") | ||
pytest.skip("FIXME: Test aborts for device=cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is #58
|
||
def test_jit_debug() -> None: | ||
if device == "cpu": | ||
pytest.skip("FIXME: Test aborts for device=cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is device_assert
def test_metadata() -> None: | ||
def test_metadata(device) -> None: | ||
if device == "cpu": | ||
pytest.skip("FIXME: Test aborts for device=cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is #57
def test_global_access_in_fn_default_arg(): | ||
def test_global_access_in_fn_default_arg(device): | ||
if device == "cpu": | ||
pytest.skip("Test aborts for device=cpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also #57
1a50ae8
to
fb7ec84
Compare
Instead of having all tests marked with
pytest.mark.cpu
, I've taken to enabling all tests in a file by default. I think test_core.py can be the exception for now since it has an order magnitude more tests than the other files.