flash attention support #17

fmmoret · 2023-10-24T23:48:41Z

Relevant issues & prs

huggingface/transformers#26557
huggingface/transformers#26350
huggingface/transformers#26585

michaelfeil · 2023-10-25T15:44:21Z

FlashAttention and related would be someting that is useful on GPU (only/mostly). It could get a 2-4x speedup for Bert.
As this is a free-time project, I would only see options that this is implemented in the upstream inference engine e.g. (sentence-transformers + HF_transformers + torch) or (fastembed + onnx) or TensorRT (e.g. by adding the newly released TensorRT-LLM engine).

By best guess would be to add TensorRT-LLM, which supports BERT with fp8 and flashattention afaik. Would wait for some weeks to let the TensorRT-LLM get more mature.

Contributions are of course also welcome.

michaelfeil · 2023-10-28T16:52:13Z

I'll add FlashAttention via BetterTransformers:
https://huggingface.co/docs/optimum/bettertransformer/tutorials/convert

michaelfeil · 2023-10-30T14:12:55Z

infinity uses now torch.nn.functionnal.scaled_dot_product_attention via bettertransformer .
Attention is now anywhere 1.5-3x faster, making infinity around 20% faster on batch inference. This does not use the FlashAttention directly, as we use an attention_mask. See: The PyTorch-native scaled_dot_product_attention operator can only dispatch to Flash Attention if no attention_mask is provided.

Beyond, you can set the torch backend to .half() precision, which also boosts the performance by another 30-40% - but looses in terms of numerical precision. 10**-6 -> 10**-3

infinity/libs/infinity_emb/infinity_emb/transformer/sentence_transformer.py

Lines 53 to 60 in 8116680

    
           if self._target_device.type == "cuda" and os.environ.get( 
        
               "INFINITY_TORCH_ENABLE_HALF", False 
        
           ): 
        
               logger.info( 
        
                   "Switching to half() precision (fp16)." 
        
                   "Enabled by the setting the env var `INFINITY_TORCH_ENABLE_HALF`" 
        
               ) 
        
               self.half()

Perhaps, the integration better with upcoming versions of torch>=2.0.0 or new releases of optimum.

michaelfeil self-assigned this Oct 28, 2023

michaelfeil mentioned this issue Oct 28, 2023

add optimum dependencies #20

Merged

michaelfeil closed this as completed in #20 Oct 30, 2023

michaelfeil reopened this Oct 30, 2023

michaelfeil closed this as completed Nov 4, 2023

This was referenced Jul 17, 2024

ImportError: This modeling file requires the following packages that were not found in your environment: flash_attn. Run pip install flash_attn #304

Closed

Add flash_attn support #306

Merged

michaelfeil mentioned this issue Jul 21, 2024

flash_attn not found when using docker, and Alibaba-NLP/gte-Qwen2-1.5B-instruct #308

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash attention support #17

flash attention support #17

fmmoret commented Oct 24, 2023

michaelfeil commented Oct 25, 2023

michaelfeil commented Oct 28, 2023

michaelfeil commented Oct 30, 2023

flash attention support #17

flash attention support #17

Comments

fmmoret commented Oct 24, 2023

michaelfeil commented Oct 25, 2023

michaelfeil commented Oct 28, 2023

michaelfeil commented Oct 30, 2023