Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QuIP# support #4803

Merged
merged 11 commits into from
Dec 6, 2023
Merged

Add QuIP# support #4803

merged 11 commits into from
Dec 6, 2023

Conversation

oobabooga
Copy link
Owner

@oobabooga oobabooga commented Dec 4, 2023

QuIP# is a novel quantization method. Its 2-bit performance is better than anything previously available.

Repository: https://github.com/Cornell-RelaxML/quip-sharp

Blog post: https://cornell-relaxml.github.io/quip-sharp/

Installation

The installation is currently manual, but later I will add it to the one-click installer.

  1. Clone quip-sharp into your repositories folder and install it:
git clone 'https://github.com/Cornell-RelaxML/quip-sharp' repositories/quip-sharp
cd repositories/quip-sharp/quiptools
python setup.py install
cd ../../..

You need to have a C++ compiler (like g++) and nvcc available in your environment for the command above.

  1. Install the following additional requirements:
pip install fast-hadamard-transform glog==0.3.1 primefac==2.0.12
  1. Download a model. Example:
python download-model.py relaxml/Llama-2-70b-E8P-2Bit

4) Download my tokenizer (I'm using it as a placeholder for now, as the model above doesn't include a tokenizer):

python download-model.py oobabooga/llama-tokenizer
  1. Start the web UI:
python server.py --model relaxml_Llama-2-70b-E8P-2Bit --loader 'QuIP#'

Perplexity

On a small test that I have been running since the beginning of this year to compare different quantizations:

Model Perplexity
llama-2-70b.ggmlv3.q4_K_M.bin 4.552218437194824
llama-65b.ggmlv3.q4_K_M.bin 4.906391620635986
relaxml/Llama-2-70b-E8P-2Bit 5.173901081085205
llama-30b.ggmlv3.q4_K_M.bin 5.215567588806152
turboderp/LLama2-70B-2.5bpw-h6-exl2 5.4921875

It's the same test as in the first table in this blog post, so the numbers are directly comparable.

This is the first time I see a quantized 70b model that fits in a RTX 3090 perform better than a q4_K_M 30b model. Which is especially important nowadays since Meta never released a Llama-2 30b base model.

Performance

I can get to 3042 context with 24GB VRAM. It generates at around 8 tokens/second when the context is small and 6 tokens/second when it is large.

Output generated in 33.51 seconds (5.94 tokens/s, 199 tokens, context 2842, seed 977283488)

@oobabooga oobabooga mentioned this pull request Dec 4, 2023
@oobabooga
Copy link
Owner Author

oobabooga commented Dec 4, 2023

The installation procedure is almost identical to the one GPTQ-for-LLaMa; maybe @jllllll can come to the rescue and create wheels for this one as well.

@LoopControl
Copy link

LoopControl commented Dec 4, 2023

Is there any minimum architecture support required for this (for example, AWQ quant requires Ampere or better on Nvidia cards)?

(Trying to figure out if it will work on cards like the P40 which uses Compute version 6.1 architecture - -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61).

@oobabooga
Copy link
Owner Author

I don't know. Most of these custom CUDA kernels require Ampere cards, but my old fork of GPTQ-for-LLaMa has a custom kernel and works on Pascal cards. I guess it depends on the operations performed.

Maybe @tsengalb99 can tell us what the requirements are.

@tsengalb99
Copy link

tsengalb99 commented Dec 4, 2023

Hi - a few things:

  • Our codebase is constantly being updated, so you will want to make sure you have the latest version. It looks like you have a pretty recent version since we only transitioned to the fast_hadamard_kernel vs a local copy a few commits ago.
  • I'm aware of a "higher than expected" memory consumption issue that I'm looking into right now. If this is something easily fixable (ie not an artifact of HF) you should be able to fit a longer context length into a 24G GPU after I fix it.
  • HF generate is incompatible with torch's wrapper around CUDA graphs. I have not looked into how to get CUDA graphs working with HF generate, but if you plan on deploying QuIP# models, you should probably do this. We have a lot of kernel launches in our quantized linear implementation and in non HF generate settings, CUDA graphs gives about a 2x speedup.
  • @chaosagent has been working on the CUDA kernels and can comment more on if they need Ampere or newer. FWIW I was not able to compile on a 1080ti just now but could on a 2080ti, so it looks like Volta or newer may be required.

@tsengalb99
Copy link

Download my tokenizer (I'm using it as a placeholder for now, as the model above doesn't include a tokenizer):

BTW our models should work with HF's AutoTokenizer. We have multiple places in our code where we just call AutoTokenizer and everything works fine.

@oobabooga
Copy link
Owner Author

Thanks for the reply @tsengalb99. Updates and eventual breaking changes are expected, and I'll make sure to update the code in this PR accordingly over time.

About CUDA graphs and the HF .generate(): I am not familiar enough with the HF pipelines to get this integration working myself. The best I can do for now is ping @younesbelkada, as maybe this would be easy for him.

@oobabooga
Copy link
Owner Author

Download my tokenizer (I'm using it as a placeholder for now, as the model above doesn't include a tokenizer):

BTW our models should work with HF's AutoTokenizer. We have multiple places in our code where we just call AutoTokenizer and everything works fine.

That doesn't work with a local copy of relaxml/Llama-2-70b-E8P-2Bit:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('relaxml_Llama-2-70b-E8P-2Bit')

OSError: Can't load tokenizer for 'relaxml_Llama-2-70b-E8P-2Bit'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'relaxml_Llama-2-70b-E8P-2Bit' is the correct path to a directory containing all relevant files for a LlamaTokenizerFast tokenizer.

The problem is that the tokenizer files are not present in the repository. This can be easily fixed by uploading the tokenizer files here (or any other copy of the default Llama tokenizer) to that repository.

@tsengalb99
Copy link

You need to extract the base model string (eg meta-llama/Llama-2-7b-hf) which model_from_hf_path does. See https://github.com/Cornell-RelaxML/quip-sharp/blob/90fd0d473e255f282e1631d2d2e796593c187239/eval_zeroshot.py#L26 for an example.

@oobabooga
Copy link
Owner Author

I had seen this, but this repository is based on loading from local copies of HF repositories (stored under text-generation-webui/models) rather than using the native HF download tools that fetch models/tokenizers to a cache folder from their names.

This is very secondary and I wouldn't worry about it.

@jerry-chee
Copy link

Hi, another QuIP# author here. Depending on what your interested in, we also have quantized 4 bit models in our huggingface repo (ex: relaxml/Llama-2-70b-chat-HI-4Bit-Packed) that have much smaller degradation from the fp16 model. We expect to have fast inference with these 4 bit models approximately by the end of the week; our current forward pass code slow is a slower naive implementation of the codebook for this specific 4 bit quantization.

Screen Shot 2023-12-04 at 12 23 49 PM

@oobabooga
Copy link
Owner Author

That's great to hear @jerry-chee, thanks for the information.

@oobabooga
Copy link
Owner Author

I spent a while trying to create GitHub Actions wheels for quip-sharp here and failed, so I gave up and instead just added an error message instructing the user to install manually.

I also removed the usage of a default Llama tokenizer as this causes issues such as Cornell-RelaxML/quip-sharp#6. It would be good if the repositories were updated to include the corresponding tokenizer files -- every GPTQ, AWQ, and EXL2 repository on HF contains these.

Hopefully the interest in quip-sharp will increase and someone will soon be able to find a solution to the CUDA graphs issue for better performance. I am personally already happy with the 8 tokens/second I am getting for 70b models.

@oobabooga oobabooga merged commit 98361af into dev Dec 6, 2023
@tsengalb99
Copy link

I spent a while trying to create GitHub Actions wheels for quip-sharp here and failed, so I gave up and instead just added an error message instructing the user to install manually.

Interesting, we can take a look at that later as a very low priority thing

I also removed the usage of a default Llama tokenizer as this causes issues such as Cornell-RelaxML/quip-sharp#6. It would be good if the repositories were updated to include the corresponding tokenizer files -- every GPTQ, AWQ, and EXL2 repository on HF contains these.

We will try to do that some time in the next few weeks.

Hopefully the interest in quip-sharp will increase and someone will soon be able to find a solution to the CUDA graphs issue for better performance. I am personally already happy with the 8 tokens/second I am getting for 70b models.

I filed a ticket with huggingface huggingface/transformers#27837 and it's on their todo list. We have faster kernels in the pipeline so the speed will increase from those alone.

@oobabooga oobabooga deleted the quip-sharp branch December 6, 2023 05:30
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Dec 6, 2023

@tsengalb99 To make pascal work fast.. like your 1060 it requires the use of up-casting to FP32 math. Pascal also has no tensor cores and atomicadd but there are functions for the latter that can be used in it's place and they are reasonable. Compute 6.1 also has dp4a instructions that can be used to speed things up.

Why would anyone bother? The P40 is prolific and is the only other 24gb card besides the 3090 with that much ram. On top of that it's $200. Otherwise people are stuck with janky 7b and 13b models which are useful as simple tools and that's about it.

If the goal is to run larger models, I think pascal support is a good thing to have.

@iChristGit
Copy link

iChristGit commented Dec 8, 2023

running bdist_egg
running egg_info
writing quiptools_cuda.egg-info\PKG-INFO
writing dependency_links to quiptools_cuda.egg-info\dependency_links.txt
writing top-level names to quiptools_cuda.egg-info\top_level.txt
C:\Python\lib\site-packages\torch\utils\cpp_extension.py:502: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
  warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'quiptools_cuda.egg-info\SOURCES.txt'
writing manifest file 'quiptools_cuda.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
Traceback (most recent call last):
  File "D:\TextGen\repositories\quip-sharp\quiptools\setup.py", line 4, in <module>
    setup(name='quiptools_cuda',
  File "C:\Python\lib\site-packages\setuptools\__init__.py", line 87, in setup
    return distutils.core.setup(**attrs)
  File "C:\Python\lib\site-packages\setuptools\_distutils\core.py", line 177, in setup
    return run_commands(dist)
  File "C:\Python\lib\site-packages\setuptools\_distutils\core.py", line 193, in run_commands
    dist.run_commands()
  File "C:\Python\lib\site-packages\setuptools\_distutils\dist.py", line 968, in run_commands
    self.run_command(cmd)
  File "C:\Python\lib\site-packages\setuptools\dist.py", line 1217, in run_command
    super().run_command(command)
  File "C:\Python\lib\site-packages\setuptools\_distutils\dist.py", line 987, in run_command
    cmd_obj.run()
  File "C:\Python\lib\site-packages\setuptools\command\install.py", line 74, in run
    self.do_egg_install()
  File "C:\Python\lib\site-packages\setuptools\command\install.py", line 123, in do_egg_install
    self.run_command('bdist_egg')
  File "C:\Python\lib\site-packages\setuptools\_distutils\cmd.py", line 317, in run_command
    self.distribution.run_command(command)
  File "C:\Python\lib\site-packages\setuptools\dist.py", line 1217, in run_command
    super().run_command(command)
  File "C:\Python\lib\site-packages\setuptools\_distutils\dist.py", line 987, in run_command
    cmd_obj.run()
  File "C:\Python\lib\site-packages\setuptools\command\bdist_egg.py", line 165, in run
    cmd = self.call_command('install_lib', warn_dir=0)
  File "C:\Python\lib\site-packages\setuptools\command\bdist_egg.py", line 151, in call_command
    self.run_command(cmdname)
  File "C:\Python\lib\site-packages\setuptools\_distutils\cmd.py", line 317, in run_command
    self.distribution.run_command(command)
  File "C:\Python\lib\site-packages\setuptools\dist.py", line 1217, in run_command
    super().run_command(command)
  File "C:\Python\lib\site-packages\setuptools\_distutils\dist.py", line 987, in run_command
    cmd_obj.run()
  File "C:\Python\lib\site-packages\setuptools\command\install_lib.py", line 11, in run
    self.build()
  File "C:\Python\lib\site-packages\setuptools\_distutils\command\install_lib.py", line 112, in build
    self.run_command('build_ext')
  File "C:\Python\lib\site-packages\setuptools\_distutils\cmd.py", line 317, in run_command
    self.distribution.run_command(command)
  File "C:\Python\lib\site-packages\setuptools\dist.py", line 1217, in run_command
    super().run_command(command)
  File "C:\Python\lib\site-packages\setuptools\_distutils\dist.py", line 987, in run_command
    cmd_obj.run()
  File "C:\Python\lib\site-packages\setuptools\command\build_ext.py", line 79, in run
    _build_ext.run(self)
  File "C:\Python\lib\site-packages\setuptools\_distutils\command\build_ext.py", line 339, in run
    self.build_extensions()
  File "C:\Python\lib\site-packages\torch\utils\cpp_extension.py", line 525, in build_extensions
    _check_cuda_version(compiler_name, compiler_version)
  File "C:\Python\lib\site-packages\torch\utils\cpp_extension.py", line 407, in _check_cuda_version
    torch_cuda_version = packaging.version.parse(torch.version.cuda)
  File "C:\Python\lib\site-packages\pkg_resources\_vendor\packaging\version.py", line 49, in parse
    return Version(version)
  File "C:\Python\lib\site-packages\pkg_resources\_vendor\packaging\version.py", line 264, in __init__
    match = self._regex.search(version)
TypeError: expected string or bytes-like object

I'm getting this error after doing python setup.py install, I use Win11 with a 3090Ti , I have NVCC and Visual Studio 2022. @oobabooga any idea?

@NoMansPC
Copy link

NoMansPC commented Dec 9, 2023

So I ran cmd_windows, copied and pasted the first command to install Quip manually, and it gave me an error.

@iChristGit
Copy link

So I ran cmd_windows, copied and pasted the first command to install Quip manually, and it gave me an error.

Can you paste the error?

@cmhamiche
Copy link

cmhamiche commented Dec 9, 2023

On WSL with Ubuntu LTS , quiptools-cuda compiled with cuda 11.8 not 12.1.
conda install -c "nvidia/label/cuda-11.8.0" cuda

edit: 9.2 gb of vram used for Llama-1-30b-E8P-2Bit at ~4.70 tokens/s on a 3060 12gb it's bonkers.

@iChristGit
Copy link

On WSL with Ubuntu LTS , quiptools-cuda compiled with cuda 11.8 not 12.1. conda install -c "nvidia/label/cuda-11.8.0" cuda

edit: 9.2 gb of vram used for Llama-1-30b-E8P-2Bit at ~4.70 tokens/s on a 3060 12gb it's bonkers.

I tried it but still get this error:

rWarning: There are no /home/osher/text-generation-webui/installer_files/env/bin/x86_64-conda-linux-gnu-c++ version bounds defined for CUDA version 12.1
  warnings.warn(f'There are no {compiler_name} version bounds defined for CUDA version {cuda_str_version}')
building 'quiptools_cuda' extension
Emitting ninja build file /home/osher/text-generation-webui/repositories/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
/home/osher/text-generation-webui/installer_files/env/bin/x86_64-conda-linux-gnu-c++ -shared -Wl,-rpath,/home/osher/text-generation-webui/installer_files/env/lib -Wl,-rpath-link,/home/osher/text-generation-webui/installer_files/env/lib -L/home/osher/text-generation-webui/installer_files/env/lib -Wl,-rpath,/home/osher/text-generation-webui/installer_files/env/lib -Wl,-rpath-link,/home/osher/text-generation-webui/installer_files/env/lib -L/home/osher/text-generation-webui/installer_files/env/lib -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,--allow-shlib-undefined -Wl,-rpath,/home/osher/text-generation-webui/installer_files/env/lib -Wl,-rpath-link,/home/osher/text-generation-webui/installer_files/env/lib -L/home/osher/text-generation-webui/installer_files/env/lib -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/osher/text-generation-webui/installer_files/env/include -DNDEBUG -D_FORTIFY_SOURCE=2 -O2 -isystem /home/osher/text-generation-webui/installer_files/env/include /home/osher/text-generation-webui/repositories/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools.o /home/osher/text-generation-webui/repositories/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools_e8p_gemv.o /home/osher/text-generation-webui/repositories/quip-sharp/quiptools/build/temp.linux-x86_64-cpython-311/quiptools_wrapper.o -L/home/osher/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/lib -L/home/osher/text-generation-webui/installer_files/env/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-311/quiptools_cuda.cpython-311-x86_64-linux-gnu.so
/home/osher/text-generation-webui/installer_files/env/bin/../lib/gcc/x86_64-conda-linux-gnu/11.2.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status

@cmhamiche
Copy link

cmhamiche commented Dec 9, 2023

You still have cuda 12.1 installed.
start cmd_wsl.bat
sudo apt remove cuda-12-1
conda install -c "nvidia/label/cuda-11.8.0" cuda

At compilation, you might see this warning instead:

/home/linux/text-gen-install/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/cpp_extension.py:424: UserWarning: There are no g++ version bounds defined for CUDA version 11.8

If nothing works, search for text-gen-install in your WSL home folder, back-up your files, delete text-gen-install folder and start fresh with cuda 11.8 at install.

@BadisG
Copy link
Contributor

BadisG commented Dec 9, 2023

Doesn't work on windows 10 for me, here's my specs:

PyTorch : 2.1.1+cu118
CUDA : 11.8
C++ compiler: Visual Studio Entreprise 2022 (MSVC 14.3x)

Here's my error:

(textgen) D:\text-generation-webui\repositories\quip-sharp\quiptools>python setup.py install
running install
D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()
D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\cmd.py:66: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` and ``easy_install``.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://github.com/pypa/setuptools/issues/917 for details.
        ********************************************************************************

!!
  self.initialize_options()
running bdist_egg
running egg_info
writing quiptools_cuda.egg-info\PKG-INFO
writing dependency_links to quiptools_cuda.egg-info\dependency_links.txt
writing top-level names to quiptools_cuda.egg-info\top_level.txt
reading manifest file 'quiptools_cuda.egg-info\SOURCES.txt'
writing manifest file 'quiptools_cuda.egg-info\SOURCES.txt'
installing library code to build\bdist.win-amd64\egg
running install_lib
running build_ext
D:\anaconda3\envs\textgen\Lib\site-packages\torch\utils\cpp_extension.py:383: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
  warnings.warn(f'Error checking compiler version for {compiler}: {error}')
building 'quiptools_cuda' extension
Emitting ninja build file D:\text-generation-webui\repositories\quip-sharp\quiptools\build\temp.win-amd64-cpython-311\Release\build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] cl /showIncludes /nologo /O2 /W3 /GL /DNDEBUG /MD /MD /wd4819 /wd4251 /wd4244 /wd4267 /wd4275 /wd4018 /wd4190 /wd4624 /wd4067 /wd4068 /EHsc -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\torch\csrc\api\include -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\TH -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include" -ID:\anaconda3\envs\textgen\include -ID:\anaconda3\envs\textgen\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" -c D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_wrapper.cpp /FoD:\text-generation-webui\repositories\quip-sharp\quiptools\build\temp.win-amd64-cpython-311\Release\quiptools_wrapper.obj -g -lineinfo -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quiptools_cuda -D_GLIBCXX_USE_CXX11_ABI=0 /std:c++17
cl : Command line warning D9002 : ignoring unknown option '-g'
cl : Command line warning D9002 : ignoring unknown option '-lineinfo'
[2/3] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc --generate-dependencies-with-compile --dependency-output D:\text-generation-webui\repositories\quip-sharp\quiptools\build\temp.win-amd64-cpython-311\Release\quiptools_e8p_gemv.obj.d -std=c++17 --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /wd4624 -Xcompiler /wd4067 -Xcompiler /wd4068 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\torch\csrc\api\include -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\TH -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include" -ID:\anaconda3\envs\textgen\include -ID:\anaconda3\envs\textgen\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" -c D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu -o D:\text-generation-webui\repositories\quip-sharp\quiptools\build\temp.win-amd64-cpython-311\Release\quiptools_e8p_gemv.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -O2 -g -Xcompiler -rdynamic -lineinfo -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quiptools_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
FAILED: D:/text-generation-webui/repositories/quip-sharp/quiptools/build/temp.win-amd64-cpython-311/Release/quiptools_e8p_gemv.obj
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc --generate-dependencies-with-compile --dependency-output D:\text-generation-webui\repositories\quip-sharp\quiptools\build\temp.win-amd64-cpython-311\Release\quiptools_e8p_gemv.obj.d -std=c++17 --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /wd4624 -Xcompiler /wd4067 -Xcompiler /wd4068 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\torch\csrc\api\include -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\TH -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include" -ID:\anaconda3\envs\textgen\include -ID:\anaconda3\envs\textgen\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" -c D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu -o D:\text-generation-webui\repositories\quip-sharp\quiptools\build\temp.win-amd64-cpython-311\Release\quiptools_e8p_gemv.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -O2 -g -Xcompiler -rdynamic -lineinfo -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quiptools_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
cl : Command line warning D9002 : ignoring unknown option '-rdynamic'
quiptools_e8p_gemv.cu
cl : Command line warning D9002 : ignoring unknown option '-rdynamic'
quiptools_e8p_gemv.cu
D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu(131): warning #177-D: variable "local_n_i" was declared but never referenced
          detected during instantiation of "void decode_matmul_e8p_kernel(scalar_t *, const scalar_t *, const int16_t *, const int64_t *, int64_t, int64_t, int64_t) [with scalar_t=double]"
(293): here

D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu(132): warning #177-D: variable "local_k_i" was declared but never referenced
          detected during instantiation of "void decode_matmul_e8p_kernel(scalar_t *, const scalar_t *, const int16_t *, const int64_t *, int64_t, int64_t, int64_t) [with scalar_t=double]"
(293): here

D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu(131): warning #177-D: variable "local_n_i" was declared but never referenced
          detected during instantiation of "void decode_matmul_e8p_kernel(scalar_t *, const scalar_t *, const int16_t *, const int64_t *, int64_t, int64_t, int64_t) [with scalar_t=float]"
(293): here

D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu(132): warning #177-D: variable "local_k_i" was declared but never referenced
          detected during instantiation of "void decode_matmul_e8p_kernel(scalar_t *, const scalar_t *, const int16_t *, const int64_t *, int64_t, int64_t, int64_t) [with scalar_t=float]"
(293): here

D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu(218): error: more than one operator "*" matches these operands:
            built-in operator "arithmetic * arithmetic"
            function "c10::operator*(const c10::Half &, const c10::Half &)"
D:\anaconda3\envs\textgen\Lib\site-packages\torch\include\c10/util/Half-inl.h(93): here
            function "c10::operator*(c10::Half, float)"
D:\anaconda3\envs\textgen\Lib\site-packages\torch\include\c10/util/Half-inl.h(141): here
            function "c10::operator*(c10::Half, double)"
D:\anaconda3\envs\textgen\Lib\site-packages\torch\include\c10/util/Half-inl.h(184): here
            function "c10::operator*(c10::Half, int)"
D:\anaconda3\envs\textgen\Lib\site-packages\torch\include\c10/util/Half-inl.h(214): here
            function "c10::operator*(c10::Half, int64_t)"
D:\anaconda3\envs\textgen\Lib\site-packages\torch\include\c10/util/Half-inl.h(242): here
            operand types are: c10::Half * int8_t
          detected during instantiation of "void decode_matmul_e8p_kernel(scalar_t *, const scalar_t *, const int16_t *, const int64_t *, int64_t, int64_t, int64_t) [with scalar_t=c10::Half]"
(293): here

D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu(131): warning #177-D: variable "local_n_i" was declared but never referenced
          detected during instantiation of "void decode_matmul_e8p_kernel(scalar_t *, const scalar_t *, const int16_t *, const int64_t *, int64_t, int64_t, int64_t) [with scalar_t=c10::Half]"
(293): here

D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu(132): warning #177-D: variable "local_k_i" was declared but never referenced
          detected during instantiation of "void decode_matmul_e8p_kernel(scalar_t *, const scalar_t *, const int16_t *, const int64_t *, int64_t, int64_t, int64_t) [with scalar_t=c10::Half]"
(293): here

D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu(218): error: more than one operator "*" matches these operands:
            built-in operator "arithmetic * arithmetic"
            function "c10::operator*(const c10::BFloat16 &, const c10::BFloat16 &)"
D:\anaconda3\envs\textgen\Lib\site-packages\torch\include\c10/util/BFloat16-inl.h(96): here
            function "c10::operator*(c10::BFloat16, float)"
D:\anaconda3\envs\textgen\Lib\site-packages\torch\include\c10/util/BFloat16-inl.h(152): here
            function "c10::operator*(c10::BFloat16, double)"
D:\anaconda3\envs\textgen\Lib\site-packages\torch\include\c10/util/BFloat16-inl.h(193): here
            function "c10::operator*(c10::BFloat16, int)"
D:\anaconda3\envs\textgen\Lib\site-packages\torch\include\c10/util/BFloat16-inl.h(221): here
            function "c10::operator*(c10::BFloat16, int64_t)"
D:\anaconda3\envs\textgen\Lib\site-packages\torch\include\c10/util/BFloat16-inl.h(249): here
            operand types are: c10::BFloat16 * int8_t
          detected during instantiation of "void decode_matmul_e8p_kernel(scalar_t *, const scalar_t *, const int16_t *, const int64_t *, int64_t, int64_t, int64_t) [with scalar_t=c10::BFloat16]"
(293): here

D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu(131): warning #177-D: variable "local_n_i" was declared but never referenced
          detected during instantiation of "void decode_matmul_e8p_kernel(scalar_t *, const scalar_t *, const int16_t *, const int64_t *, int64_t, int64_t, int64_t) [with scalar_t=c10::BFloat16]"
(293): here

D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools_e8p_gemv.cu(132): warning #177-D: variable "local_k_i" was declared but never referenced
          detected during instantiation of "void decode_matmul_e8p_kernel(scalar_t *, const scalar_t *, const int16_t *, const int64_t *, int64_t, int64_t, int64_t) [with scalar_t=c10::BFloat16]"
(293): here

2 errors detected in the compilation of "D:/text-generation-webui/repositories/quip-sharp/quiptools/quiptools_e8p_gemv.cu".
quiptools_e8p_gemv.cu
[3/3] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc --generate-dependencies-with-compile --dependency-output D:\text-generation-webui\repositories\quip-sharp\quiptools\build\temp.win-amd64-cpython-311\Release\quiptools.obj.d -std=c++17 --use-local-env -Xcompiler /MD -Xcompiler /wd4819 -Xcompiler /wd4251 -Xcompiler /wd4244 -Xcompiler /wd4267 -Xcompiler /wd4275 -Xcompiler /wd4018 -Xcompiler /wd4190 -Xcompiler /wd4624 -Xcompiler /wd4067 -Xcompiler /wd4068 -Xcompiler /EHsc -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\torch\csrc\api\include -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\TH -ID:\anaconda3\envs\textgen\Lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include" -ID:\anaconda3\envs\textgen\include -ID:\anaconda3\envs\textgen\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Tools\MSVC\14.38.33130\ATLMFC\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Professional\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\um" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\shared" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\winrt" "-IC:\Program Files (x86)\Windows Kits\10\\include\10.0.22621.0\\cppwinrt" -c D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools.cu -o D:\text-generation-webui\repositories\quip-sharp\quiptools\build\temp.win-amd64-cpython-311\Release\quiptools.obj -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -O2 -g -Xcompiler -rdynamic -lineinfo -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=quiptools_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -gencode=arch=compute_86,code=compute_86 -gencode=arch=compute_86,code=sm_86
cl : Command line warning D9002 : ignoring unknown option '-rdynamic'
quiptools.cu
cl : Command line warning D9002 : ignoring unknown option '-rdynamic'
quiptools.cu
D:\text-generation-webui\repositories\quip-sharp\quiptools\quiptools.cu(34): warning #177-D: function "gpuAssert" was declared but never referenced

quiptools.cu
cl : Command line warning D9002 : ignoring unknown option '-rdynamic'
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
  File "D:\anaconda3\envs\textgen\Lib\site-packages\torch\utils\cpp_extension.py", line 2100, in _run_ninja_build
    subprocess.run(
  File "D:\anaconda3\envs\textgen\Lib\subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "D:\text-generation-webui\repositories\quip-sharp\quiptools\setup.py", line 4, in <module>
    setup(name='quiptools_cuda',
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\__init__.py", line 107, in setup
    return distutils.core.setup(**attrs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\core.py", line 185, in setup
    return run_commands(dist)
           ^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\core.py", line 201, in run_commands
    dist.run_commands()
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\dist.py", line 969, in run_commands
    self.run_command(cmd)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\dist.py", line 1234, in run_command
    super().run_command(command)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\command\install.py", line 80, in run
    self.do_egg_install()
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\command\install.py", line 129, in do_egg_install
    self.run_command('bdist_egg')
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\dist.py", line 1234, in run_command
    super().run_command(command)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\command\bdist_egg.py", line 164, in run
    cmd = self.call_command('install_lib', warn_dir=0)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\command\bdist_egg.py", line 150, in call_command
    self.run_command(cmdname)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\dist.py", line 1234, in run_command
    super().run_command(command)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\command\install_lib.py", line 11, in run
    self.build()
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\command\install_lib.py", line 111, in build
    self.run_command('build_ext')
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\cmd.py", line 318, in run_command
    self.distribution.run_command(command)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\dist.py", line 1234, in run_command
    super().run_command(command)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\dist.py", line 988, in run_command
    cmd_obj.run()
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\command\build_ext.py", line 84, in run
    _build_ext.run(self)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 345, in run
    self.build_extensions()
  File "D:\anaconda3\envs\textgen\Lib\site-packages\torch\utils\cpp_extension.py", line 873, in build_extensions
    build_ext.build_extensions(self)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 467, in build_extensions
    self._build_extensions_serial()
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 493, in _build_extensions_serial
    self.build_extension(ext)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\command\build_ext.py", line 246, in build_extension
    _build_ext.build_extension(self, ext)
  File "D:\anaconda3\envs\textgen\Lib\site-packages\setuptools\_distutils\command\build_ext.py", line 548, in build_extension
    objects = self.compiler.compile(
              ^^^^^^^^^^^^^^^^^^^^^^
  File "D:\anaconda3\envs\textgen\Lib\site-packages\torch\utils\cpp_extension.py", line 845, in win_wrap_ninja_compile
    _write_ninja_file_and_compile_objects(
  File "D:\anaconda3\envs\textgen\Lib\site-packages\torch\utils\cpp_extension.py", line 1774, in _write_ninja_file_and_compile_objects
    _run_ninja_build(
  File "D:\anaconda3\envs\textgen\Lib\site-packages\torch\utils\cpp_extension.py", line 2116, in _run_ninja_build
    raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

@iChristGit
Copy link

I think we just need to download a pre-compiled wheel and use it instead of building it @BadisG

@BadisG
Copy link
Contributor

BadisG commented Dec 9, 2023

Do we have such wheel yet @iChristGit ?

@iChristGit
Copy link

Do we have such wheel yet @iChristGit ?

Not yet sadly, I also wanna run natively on Windows11, its the same errors that i suppose someone with a native linux build can do and upload, just like the old GPTQ.

@TheLounger
Copy link
Contributor

TheLounger commented Dec 10, 2023

File "D:\anaconda3\envs\textgen\Lib\site-packages\torch\utils\cpp_extension.py", line 2116, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error compiling objects for extension

This can be fixed by disabling Ninja. setup.py line 10:

cmdclass={'build_ext': cpp_extension.BuildExtension.with_options(use_ninja=False)})

But then you'll also probably get this:

quiptools_e8p_gemv.cu(218): error: more than one operator "*" matches these operands:
            built-in operator "arithmetic * arithmetic"
            function "c10::operator*(const c10::Half &, const c10::Half &)" .....

The Internet suggests we need the -D__CUDA_NO_HALF_OPERATORS__ nvcc flag, but this doesn't seem to make a difference (also seems to be in use already.)
set TORCH_NVCC_FLAGS="-D__CUDA_NO_HALF_OPERATORS__" doesn't work either.

Windows 10
PyTorch: 2.1.1+cu121
CUDA: 12.1.1
C++: VS2022 | MSVC 14.36.32532

Note that oobabooga already attempted to make wheels so for Windows we might just need to wait for that to succeed or for QuIP# devs to give some pointers or fix their setup script.

@NoMansPC
Copy link

Man, I keep hoping that Quip will work out of the box with new iteration of the webui, but so far, still no luck. It's still asking to install Quip manually.

@iChristGit
Copy link

Man, I keep hoping that Quip will work out of the box with new iteration of the webui, but so far, still no luck. It's still asking to install Quip manually.

Its takes time, when GPTQ was first released I was picking my hair with each error to compile it, now its a 1 click install.
Imagine running Mixtral-8x7B with 2-bit with great perplexity :O

@NoMansPC
Copy link

Man, I keep hoping that Quip will work out of the box with new iteration of the webui, but so far, still no luck. It's still asking to install Quip manually.

Its takes time, when GPTQ was first released I was picking my hair with each error to compile it, now its a 1 click install. Imagine running Mixtral-8x7B with 2-bit with great perplexity :O

hahaha yeah I know. Same with Exllama 2. It wouldn't work at all when it was released.

@tsengalb99
Copy link

Sorry we (the QuIP# team) can't be of much help here since we don't have any access to Windows machines with NVIDIA GPUs. We're hoping to package quiptools into a wheel in the future when it becomes more mature, but as of now since QuIP# is a WIP the install process is a bit more involved (but hopefully not too involved).

@Nicoolodion2
Copy link

Okay... who is gonna quantize Mixtral-8x7B? And what VRAM/RAM requirements would that have?

@iChristGit
Copy link

Okay... who is gonna quantize Mixtral-8x7B? And what VRAM/RAM requirements would that have?

Already been quantized by thebloke
https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF
GGUF/GPTQ/AWQ ready (haven't checked yet)

@Nicoolodion2
Copy link

Okay... who is gonna quantize Mixtral-8x7B? And what VRAM/RAM requirements would that have?

Already been quantized by thebloke https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF GGUF/GPTQ/AWQ ready (haven't checked yet)

I thought Thebloke still doesn't provide QuIP# quantization?

@iChristGit
Copy link

Okay... who is gonna quantize Mixtral-8x7B? And what VRAM/RAM requirements would that have?

Already been quantized by thebloke https://huggingface.co/TheBloke/Mixtral-8x7B-v0.1-GGUF GGUF/GPTQ/AWQ ready (haven't checked yet)

I thought Thebloke still doesn't provide QuIP# quantization?

You are right, I was thinking you wanted any kind of quant, not Quip#
I dont know who is the hero that would start quantization for 2bit Quip#

@BadisG
Copy link
Contributor

BadisG commented Dec 16, 2023

@iChristGit As long as it still doesn't work on windows I don't see the incentive for it...

@Nicoolodion2
Copy link

@iChristGit As long as it still doesn't work on windows I don't see the incentive for it...

Oh wait what? It doesn't work on windows yet? That would explain a lot for me, because i haven't been able to get it running yet...

@iChristGit
Copy link

@iChristGit As long as it still doesn't work on windows I don't see the incentive for it...

Oh wait what? It doesn't work on windows yet? That would explain a lot for me, because i haven't been able to get it running yet...

Yep its hard to figure out how to compile it on windows, but on linux its easy as far as people say.
I don't know why can't someone just upload the wheel for us, maybe its not as straightforward.

@iChristGit
Copy link

@iChristGit As long as it still doesn't work on windows I don't see the incentive for it...

Yep.. you can run WSL on windows for the meantime maybe.

@iChristGit
Copy link

As of latest commits Quip# is marked as only available on linux, does this mean its not posibble to make it work in windows at all? @oobabooga

@oobabooga
Copy link
Owner Author

I have tried to compile it for Windows using GitHub actions and it fails with some vague errors. I think that there is something in the quip# code itself that prevents it from compiling on Windows.

@Nicoolodion2
Copy link

Okay I am on Windows WSL (Ubuntu) now, but I get this error when I try to install python setup.py install. A part of the error is:

error: can't create or remove files in install directory

The following error occurred while trying to add or remove files in the
installation directory:

[Errno 13] Permission denied: '/usr/local/lib/python3.10/dist-packages/test-easy-install-4350.write-test'

@CamiloMM
Copy link

Is there any technical reason for it not working on Windows, or is it just "this is too new, and nobody really tried"? If someone bumped into a roadblock, it might be good to document it (some dependency not compiling?)

For the few that managed to run it, is it really as good as the perplexity claims make it seem?

@iChristGit
Copy link

I have tried to compile it for Windows using GitHub actions and it fails with some vague errors. I think that there is something in the quip# code itself that prevents it from compiling on Windows.

"I have tried to compile it for Windows using GitHub actions and it fails with some vague errors. I think that there is something in the quip# code itself that prevents it from compiling on Windows."

A comment from ooba a couple weeks back, still same issue it wont compile on windows.

@CamiloMM
Copy link

CamiloMM commented Feb 1, 2024

Ok, tried it for a bit now, the thing that hangs is package fast-hadamard-transform which fails with

UserWarning: fast_hadamard_transform was requested, but nvcc was not found.  Are you sure your environment has nvcc available?

However, nvcc --version reports

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:09:35_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

That was as far as I got because I have no idea what to do next and Python befuddles me.

@Tralen
Copy link

Tralen commented Feb 4, 2024

@CamiloMM, I solved that problem by installing the current cuda-toolkit from nvidia website (I'm on Linux Mint).

@Nicoolodion2, it is a permission issue, I had to add --user to the command to get setup.py to run correctly. It would be good to know if we should be setting a virtualenv in the instructions.

I any case, even though quip-sharp is right there, oobabooga still doesn't find it for me. I can't get past the error:

QuIP# has not been found. It must be installed manually for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.