huggingface · zucchini-nlp · Aug 6, 2024 · Jul 19, 2024 · Jul 20, 2024 · Jul 23, 2024
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -99,6 +99,8 @@
     sections:
     - local: generation_strategies
       title: Customize the generation strategy
+    - local: kv_cache
+      title: Best Practices for Generation with Cache
     title: Generation
   - isExpanded: false
     sections:

diff --git a/docs/source/en/generation_strategies.md b/docs/source/en/generation_strategies.md
@@ -174,117 +174,6 @@ An increasing sequence: one, two, three, four, five, six, seven, eight, nine, te
 ```
 
 
-## KV Cache Quantization
-
-The `generate()` method supports caching keys and values to enhance efficiency and avoid re-computations. However the key and value
-cache can occupy a large portion of memory, becoming a bottleneck for long-context generation, especially for Large Language Models.
-Quantizing the cache when using `generate()` can significantly reduce memory requirements at the cost of speed.
-
-KV Cache quantization in `transformers` is largely inspired by the paper [KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache]
-(https://arxiv.org/abs/2402.02750) and currently supports `quanto` and `HQQ` as backends. For more information on the inner workings see the paper.
-
-To enable quantization of the key-value cache, one needs to indicate `cache_implementation="quantized"` in the `generation_config`.
-Quantization related arguments should be passed to the `generation_config` either as a `dict` or an instance of a [`QuantizedCacheConfig`] class.
-One has to indicate which quantization backend to use in the [`QuantizedCacheConfig`], the default is `quanto`.
-
-<Tip warning={true}>
-
-Cache quantization can be detrimental if the context length is short and there is enough GPU VRAM available to run without cache quantization.
-
-</Tip>
-
-
-```python
->>> import torch
->>> from transformers import AutoTokenizer, AutoModelForCausalLM
-
->>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
->>> model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
->>> inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)
-
->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20, cache_implementation="quantized", cache_config={"nbits": 4, "backend": "quanto"})
->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
-I like rock music because it's loud and energetic. It's a great way to express myself and rel
-
->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=20)
->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
-I like rock music because it's loud and energetic. I like to listen to it when I'm feeling
-```
-
-## KV Cache Offloading
-
-Similarly to KV cache quantization, this strategy aims to reduce GPU VRAM usage.
-It does so by moving the KV cache for most layers to the CPU.
-As the model's `forward()` method iterates over the layers, this strategy maintains the current layer cache on the GPU.
-At the same time it asynchronously prefetches the next layer cache as well as sending the previous layer cache back to the CPU.
-Unlike KV cache quantization, this strategy always produces the same result as the default KV cache implementation.
-Thus, it can serve as a drop-in replacement or a fallback for it.
-
-Depending on your model and the characteristics of your generation task (size of context, number of generated tokens, number of beams, etc.)
-you may notice a small degradation in generation throughput compared to the default KV cache implementation.
-
-To enable KV cache offloading, pass `cache_implementation="offloaded"` in the `generation_config`.
-
-```python
->>> import torch
->>> from transformers import AutoTokenizer, AutoModelForCausalLM
->>> ckpt = "microsoft/Phi-3-mini-4k-instruct"
-
->>> tokenizer = AutoTokenizer.from_pretrained(ckpt)
->>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
->>> inputs = tokenizer("Fun fact: The shortest", return_tensors="pt").to(model.device)
-
->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23, cache_implementation="offloaded")
->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
-Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.
-
->>> out = model.generate(**inputs, do_sample=False, max_new_tokens=23)
->>> print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
-Fun fact: The shortest war in history was between Britain and Zanzibar on August 27, 1896.
-```
-
-<Tip warning={true}>
-
-Cache offloading requires a GPU and can be slower than the default KV cache. Use it if you are getting CUDA out of memory errors.
-
-</Tip>
-
-The example below shows how KV cache offloading can be used as a fallback strategy.
-```python
->>> import torch
->>> from transformers import AutoTokenizer, AutoModelForCausalLM
->>> def resilient_generate(model, *args, **kwargs):
-...     oom = False
-...     try:
-...         return model.generate(*args, **kwargs)
-...     except torch.cuda.OutOfMemoryError as e:
-...         print(e)
-...         print("retrying with cache_implementation='offloaded'")
-...         oom = True
-...     if oom:
-...         torch.cuda.empty_cache()
-...         kwargs["cache_implementation"] = "offloaded"
-...         return model.generate(*args, **kwargs)
-...
-...
->>> ckpt = "microsoft/Phi-3-mini-4k-instruct"
->>> tokenizer = AutoTokenizer.from_pretrained(ckpt)
->>> model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16).to("cuda:0")
->>> prompt = ["okay "*1000 + "Fun fact: The most"]
->>> inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
->>> beams = { "num_beams": 40, "num_beam_groups": 40, "num_return_sequences": 40, "diversity_penalty": 1.0, "max_new_tokens": 23, "early_stopping": True, }
->>> out = resilient_generate(model, **inputs, **beams)
->>> responses = tokenizer.batch_decode(out[:,-28:], skip_special_tokens=True)
-```
-
-On a GPU with 50 GB of RAM, running this code will print
-```
-CUDA out of memory. Tried to allocate 4.83 GiB. GPU
-retrying with cache_implementation='offloaded'
-```
-before successfully generating 40 beams.
-
-
 ## Watermarking
 
 The `generate()` supports watermarking the generated text by randomly marking a portion of tokens as "green".

diff --git a/docs/source/en/internal/generation_utils.md b/docs/source/en/internal/generation_utils.md
@@ -386,18 +386,36 @@ A [`Constraint`] can be used to force the generation to include specific tokens
     - get_seq_length
     - reorder_cache
 
+[[autodoc]] OffloadedCache
+    - update
+    - prefetch_layer
+    - evict_previous_layer
+
 [[autodoc]] StaticCache
     - update
     - get_seq_length
     - reset
 
+[[autodoc]] HybridCache
+    - update
+    - reset
+
+[[autodoc]] SlidingWindowCache
+    - update
+    - reset
+
 [[autodoc]] EncoderDecoderCache
     - get_seq_length
     - to_legacy_cache
     - from_legacy_cache
     - reset
     - reorder_cache
 
+[[autodoc]] MambaCache
+    - update_conv_state
+    - update_ssm_state
+    - reset
+
 ## Watermark Utils
 
 [[autodoc]] WatermarkDetector