Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CORE] Consolidate 6+ kernel boolean toggels args to single Backend arg #68

Merged
merged 8 commits into from
Jun 27, 2024

Conversation

ZX-ModelCloud
Copy link
Contributor

@ZX-ModelCloud ZX-ModelCloud commented Jun 26, 2024

Resolves #59

The following args will be merged into single backed: Backend = Backend.AUTO

use_triton: bool,
disable_exllama: bool = False,
disable_exllamav2: bool = False,
use_marlin: bool = False,
use_bitblas: bool = True,
Reason: It is not only super confusing for users to use correctly (matrix condition of passive binary toggles), even project developers ran into multiple bugs due to these passive switches. We can't keep adding more binary toggles every time we add a backend/kernel/runtime. Becoming unmaintainable and unusable by both end-users and project devs.

Prelim design:

class Backend(ENUM):
AUTO # choose the fastest one based on quant model compatibility
CUDA_OLD
CUDA
TRITON_V2
EXLLAMA
EXLLAMA_V2
MARLIN
BITBLAS

@ZX-ModelCloud ZX-ModelCloud marked this pull request as ready for review June 27, 2024 05:11
@Qubitium Qubitium merged commit 5b724ac into main Jun 27, 2024
2 of 3 checks passed
@Qubitium Qubitium deleted the zx_consolidate_backend branch June 27, 2024 06:16
@Qubitium Qubitium changed the title Consolidate Backend [CORE] Consolidate 6+ kernel boolean toggels args to single Backend arg Jun 27, 2024
DeJoker pushed a commit to DeJoker/GPTQModel that referenced this pull request Jul 19, 2024
DeJoker pushed a commit to DeJoker/GPTQModel that referenced this pull request Jul 19, 2024
* Consolidate Backend

* change Backend.TRITON_V2 to Backend.TRITON

* According to quantize_config.format, determine when the Backend is packing the model.

* Auto choose the fastest one Backend based on quant model compatibility

* fix issue: Automatically select Backend, returns incorrect qlinear.

* cleanup

* cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Consolidate 6+ related use/disable: bool args in from_quantized into single backend: Backend
2 participants