Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable more tests for CPU in CI #51

Closed
wants to merge 56 commits into from
Closed

Conversation

int3
Copy link
Collaborator

@int3 int3 commented Jul 15, 2024

Instead of having all tests marked with pytest.mark.cpu, I've taken to enabling all tests in a file by default. I think test_core.py can be the exception for now since it has an order magnitude more tests than the other files.

bertmaher and others added 30 commits June 24, 2024 00:31
* [BACKEND][CPU] Implement the empty cpu backend

* Run clang-format

* Fix yadf error

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
A quick addition on how to use it.
Summary: This is stll a kind of the boilerplate and basic lowering for the first milestone (compiling vector addition). This PR firstly lowers `tt.func` and `tt.return`.


Test Plan: It can safely compile an empty kernel.

```
@triton.jit
def add_kernel(x_ptr,  y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    return
```

> TRITON_ENABLE_LLVM_DEBUG=1 TRITON_CPU_BACKEND=1 python3 empty_kerne.py

```
//===-------------------------------------------===//
Legalizing operation : 'tt.func'(0x73be2a0) {
  * Fold {
  } -> FAILURE : unable to fold

  * Pattern : 'tt.func -> ()' {
Trying to match "(anonymous namespace)::FuncOpConversion"
    ** Insert  : 'llvm.func'(0x6c04c70)
    ** Insert Block into : 'llvm.func'(0x6c04c70)
    ** Insert Block into : 'llvm.func'(0x6c04c70)
    ** Erase   : 'tt.func'(0x73be2a0)
"(anonymous namespace)::FuncOpConversion" result 1

    //===-------------------------------------------===//
    Legalizing operation : 'llvm.func'(0x6c04c70) {
    } -> SUCCESS : operation marked legal by the target
    //===-------------------------------------------===//

...

//===-------------------------------------------===//
Legalizing operation : 'tt.return'(0x73efeb0) {
  "tt.return"() : () -> ()

  * Fold {
  } -> FAILURE : unable to fold

  * Pattern : 'tt.return -> ()' {
Trying to match "(anonymous namespace)::ReturnOpConversion"
    ** Insert  : 'llvm.return'(0x73c0f00)
    ** Replace : 'tt.return'(0x73efeb0)
"(anonymous namespace)::ReturnOpConversion" result 1

    //===-------------------------------------------===//
    Legalizing operation : 'llvm.return'(0x73c0f00) {
      "llvm.return"() : () -> ()

    } -> SUCCESS : operation marked legal by the target
    //===-------------------------------------------===//
  } -> SUCCESS : pattern applied successfully
```
…riton-lang#1)

Summary: As title, `tl.program_id` needs to be supported first. As of now, we think pid will be provided as additional function arguments to the kernel. So, getting program_id is mapped to reading one of the last three arguments.

I also quickly implemented `tl.device_print` or `print`, only for scalar types for a quick "Hello, World!" testing.

Test Plan: Tested with a simple example:

```
@triton.jit
def add_kernel(...):
    pid = tl.program_id(axis=0)  # We use a 1D launch grid so axis is 0.
    foo = pid + 42
    tl.device_print("Hello, World!", foo, pid)
```

The resulting .llir is valid:
```
@printfFormat_1 = internal constant [31 x i8] c"pid (%u, %u, %u) test: %u, %u\0A\00"

declare !dbg !3 i32 @printf(ptr, ...)

define void @add_kernel(ptr addrspace(1) %0, ptr addrspace(1) %1, ptr addrspace(1) %2, i32 %3, i32 %4, i32 %5, i32 %6) !dbg !7 {
  %8 = add i32 %4, 42, !dbg !8
  %9 = call i32 (ptr, ...) @printf(ptr @printfFormat_0, i32 %4, i32 %5, i32 %6, i32 %8, i32 %4)
  ret void, !dbg !9
}
```

Tried to compile with a fake main function:
```
> % cat main.c
extern void add_kernel(float*, float*, float*, int, int, int, int);

int main() {
    add_kernel(0, 0, 0, 4, 5, 6, 7);
}

> % llc -filetype=obj add_kernel.llir && clang -o a.out add_kernel.llir.o main.c
> % ./a.out
pid (5, 6, 7) Hello, World!: 47, 5
```
…ion flows (triton-lang#6)

* Support basic lowering through vector dialect in CPU backend.

Signed-off-by: Ilya Enkovich <[email protected]>

* Use axis info in memory op lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Mark test_ptx_cast as enabled for CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support umulhi operation.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.clamp, tl.minimum, tl.maximum.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add enable_fp_fusion opt for CPU (only affects ASM dump now).

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix kernel args passing for propagated constants.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add permutations support.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 2-D transfer_read/transfer_write lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Introduce shape info analysis and use it for loads/stores by block pointers.

Delay scalar pointers lowering.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support 'other' arg for loads.

Signed-off-by: Ilya Enkovich <[email protected]>

* Support tl.join.

Signed-off-by: Ilya Enkovich <[email protected]>

* Minor renaming.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
…ent (triton-lang#8)

* [BACKEND][CPU] Make it buildable and runnable in a different environment

* Revert seemingly inconsistent python code formatting
Signed-off-by: Ilya Enkovich <[email protected]>
Co-authored-by: Minjang Kim <[email protected]>
…iton-lang#11)

* [CPU] Support flexible active driver + update vector-add tutorial

* Update vector-add to run CPU always + optional GPU

* Update do_bench for CPU
…ng#17)

* Fixed yaml syntax

Signed-off-by: Gregory Shimansky <[email protected]>

* Removed cpu label from run-on

Signed-off-by: Gregory Shimansky <[email protected]>

* Added missing zlib-dev

Signed-off-by: Gregory Shimansky <[email protected]>

* Added missing apt-get update

Signed-off-by: Gregory Shimansky <[email protected]>

* Remove pip cache because on self-hosted runner it slows things down

Signed-off-by: Gregory Shimansky <[email protected]>

* Corrected path to tests

Signed-off-by: Gregory Shimansky <[email protected]>

* Added installation of torch==2.1.2

Signed-off-by: Gregory Shimansky <[email protected]>

---------

Signed-off-by: Gregory Shimansky <[email protected]>
* [CPU] Add OpenMP launcher

* Address the comments

* Fix induction variable type

* Always use preallocated output buffer for CPU with torch.add
* [CPU] Dump human-readable asm code in TRITON_CACHE_DIR

* Don't touch the main compiler.py
…-lang#23)

* add un-masked tiled matrix-multiplication for triton-cpu

* clean and add comment

* move test under tutorials
* Fix RelWithDebInfo build.

Signed-off-by: Ilya Enkovich <[email protected]>

* Skip fp8 cast tests on CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix segfault.

Signed-off-by: Ilya Enkovich <[email protected]>

* [BACKEND] Update LLVM version to llvm/llvm-project@765206e (triton-lang#4059)

* Add -s option to pytest run.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add a workaround for LLVM bug causing test failure on Skylake CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Add a workaround for LLVM fpext bug causing test failure on Skylake CPU.

Signed-off-by: Ilya Enkovich <[email protected]>

* Fix formatting.

Signed-off-by: Ilya Enkovich <[email protected]>

---------

Signed-off-by: Ilya Enkovich <[email protected]>
Co-authored-by: Pablo Zimmermann <[email protected]>
@int3 int3 requested a review from digantdesai July 15, 2024 21:15
Comment on lines 82 to 85
python -m pytest -s -n 32 --device cpu python/test/unit/language/test_compile_errors.py
python -m pytest -s -n 32 --device cpu python/test/unit/runtime/test_bindings.py
python -m pytest -s -n 32 --device cpu python/test/unit/runtime/test_cache.py
python -m pytest -s -n 32 --device cpu python/test/unit/runtime/test_launch.py
Copy link
Collaborator Author

@int3 int3 Jul 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe I should combine all these commands and run their tests under a single command

@int3
Copy link
Collaborator Author

int3 commented Jul 15, 2024

Let me know if you would like me to break out the commits into individual PRs

@@ -125,11 +125,10 @@ def format_of(ty):

args_format = ''.join([format_of(_extracted_type(ty)) for ty in signature.values()])
format = "iiiOKOOOO" + args_format
arg_ptrs_list = ', '.join(f"&arg{i}" for i, ty in signature.items()) if len(signature) > 0 else ''
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if check here was redundant

@int3 int3 force-pushed the more-tests branch 2 times, most recently from 4edd05f to 5a769b4 Compare July 16, 2024 02:02
@pytest.mark.interpreter
@pytest.mark.parametrize("debug", [False, True])
def test_interleave(device, debug):

if device == "cpu" and debug:
pytest.skip("Test aborts for device=cpu and debug=True")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens due to missing support of device_assert.
Of course, we can skip this for now, but it should be fixed with #35

Copy link
Collaborator Author

@int3 int3 Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, @minjang actually suggested that I look at tackling that (:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I'm porting my previous implementation of device_print to the new repo, but having some crashes :) It should be working soon. device_assert is also very similar.

int3 added 10 commits July 16, 2024 18:40
…dd to CI

The previous test implementation would break when run under pytest-xdist as it
reused & removed the same temporary directory for the triton cache.
Most of the tests are skipped for now...
The only test that needed fixing was `test_unknown_annotations`, where we were
generating invalidate code for the launcher. In particular, when
`kernel_fn_args` was empty, we would get the following error:

```
/var/folders/_z/88s630fd3d9fx72mbmx90qvw0000gn/T/tmpy481mz0l/main.cpp:37:29: error: expected ';' before '(' token
   37 | using kernel_ptr_t = void(*)(, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t, uint32_t);
      |                             ^
      |                             ;
```
It was giving "not compiled with CUDA" errors because the test was querying for
the CUDA compute capability, but it was not actually using this information. It
looks like the first iteration of the test needed to know the compute
capability, but that changed somewhere along the way.
Copy link
Collaborator

@ienkovich ienkovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch, it looks great!

I have a couple of minor comments:

  • Please try to avoid CPU-unrelated changes, it can complicate our rebases on the upstream repo and the following merge of the backend. You can submit such changes directly to the upstream Triton repo. E. g. changes in conftest.py don't seem to be backend-specific.
  • Could you please create issues in the Feature Support section of our project for tests that are skipped for now and are supposed to be enabled in the future for CPU? And also for test suites (if any) that are not covered yet by your patch? It would help a lot in tracking missing features for CPU.

@@ -294,7 +294,9 @@ def kernel():
triton.compile(triton.compiler.ASTSource(fn=kernel, signature={}, constants={}))


def test_global_access_in_fn_default_arg():
def test_global_access_in_fn_default_arg(device):
if device == "cpu":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know which device is used for interpreter tests, but all other device checks in tests check for the interpreter mode first. Let's follow the same is_cpu scheme everywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@int3 int3 Jul 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, I see the check was just folded into get_current_target(). tbh I think it's cleaner to have interpreter-only tests marked explicitly as such, rather than conflating the interpreter check with the platform check... like, what if we have some tests that pass for both the interpreter and the CPU?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know which device is used for the interpreter, guess it might be CPU. So this check might disable the test for the interpreter.

@@ -282,6 +282,10 @@ def upcast_test(src_dtype, dst_dtype, exponent_bits, mantissa_bits, exponent_bia
])
def test_typeconvert_upcast(src_dtype, dst_dtype, device):

if device == "cpu":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are covered in #40

tmpdir = ".tmp"

def get_device_key(device):
if device == "cuda":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get it from the active driver as in jit.py?

@int3
Copy link
Collaborator Author

int3 commented Jul 17, 2024

Please try to avoid CPU-unrelated changes

Yeah, I did kind of realize that :) I actually ended up splitting out some of the changes into triton-lang#4334 and triton-lang#4341 -- will rebase upon them once they're landed.

Let's follow the same is_cpu scheme everywhere.

Yeah I'm actually not sure why we have two different schemes. Let me ask @ptillet in triton-lang#4334 since I have the same changes there.

@int3 int3 marked this pull request as draft July 17, 2024 17:59
@ienkovich
Copy link
Collaborator

Yeah, I did kind of realize that :) I actually ended up splitting out some of the changes into triton-lang#4334 and triton-lang#4341 -- will rebase upon them once they're landed.

That's perfect. Thanks!

if device == "cpu":
if "float8" in src_dtype or "float8" in dst_dtype:
pytest.skip("float8 tests are not supported on CPU")
pytest.skip("FIXME: Test aborts for device=cpu")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is #58


def test_jit_debug() -> None:
if device == "cpu":
pytest.skip("FIXME: Test aborts for device=cpu")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is device_assert

def test_metadata() -> None:
def test_metadata(device) -> None:
if device == "cpu":
pytest.skip("FIXME: Test aborts for device=cpu")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is #57

def test_global_access_in_fn_default_arg():
def test_global_access_in_fn_default_arg(device):
if device == "cpu":
pytest.skip("Test aborts for device=cpu")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also #57

@int3
Copy link
Collaborator Author

int3 commented Aug 31, 2024

Obsoleted by #106 and #131.

@int3 int3 closed this Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants