Introduce `outlines.models.transformers_vision` #1052

lapp0 · 2024-07-19T15:32:18Z

Rendered Docs: https://github.com/lapp0/outlines/blob/multimodal-models/docs/reference/models/transformers_vision.md

Fixes Vision LLMs and Outlines #787
Fixes Support for multi-modal models #662

Changes

Introduce models.transformers_vision which subclasses models.transformers and overrides its behavior so it applies, instead of AutoTokenizer, AutoProcessor to handle the text AND PIL.Images media
Introduce VisionSequenceGeneratorAdapter, handling and validating the media argument.
Update outlines.generate to dispatch TransformersVision models to VisionSequenceGeneratorAdapter

Tests

tests/generate/test_api.py: Test prompt / media validation
tests/generate/test_generate.py:
- Add model_transformers_vision fixture. tests pass locally, but disabled because a model small enough for CI isn't available
- Test all outlines.generate generators to ensure dispatchers for this new sequence generator is handled correctly.

lapp0 · 2024-07-19T15:56:10Z

tests/generate/test_generate.py

    assert re.fullmatch(pattern, res) is not None, res


-@pytest.mark.parametrize("pattern", REGEX_PATTERNS)
+@pytest.mark.skip(


There are a handful of open json validation issues. This is a good integration test case generally to address json generation failures because it applies random models to structured json generation.

rlouf · 2024-07-19T16:00:04Z

docs/reference/models/transformers_vision.md

+```
+
+Create convenience function to load a `PIL.Image` from URL
+```


Suggested change

```

```python

rlouf · 2024-07-19T16:01:28Z

docs/reference/models/transformers_vision.md

+from pydantic import BaseModel
+from typing import List, Optional
+
+def img_from_url(url)


Function missing

parkervg · 2024-07-30T19:14:12Z

Very excited about this update!

After pip installing main at 26e2934, I'm encountering an error trying out the example in the Transformers Vision Documentation.

Since I'm running on my GPU-less laptop, I'm using the tiny llava-hf/llava-interleave-qwen-0.5b-hf model, which is also a LlavaNextForConditionalGeneration model. I was able to recreate the below error with bczhou/tiny-llava-v1-hf as well.

Here's the full code snippet:

import outlines
from outlines.models.transformers_vision import transformers_vision

model = transformers_vision(
    'llava-hf/llava-interleave-qwen-0.5b-hf'
)
from PIL import Image
from io import BytesIO
from urllib.request import urlopen

def img_from_url(url):
    img_byte_stream = BytesIO(urlopen(url).read())
    return Image.open(img_byte_stream).convert("RGB")

description_generator = outlines.generate.text(model)
description_generator(
    "<image> detailed description:",
    [img_from_url("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg")]
)

And the full error I get, with transformers==4.43.3 and torch==2.2.2.

TypeError                                 Traceback (most recent call last)
Cell In[10], line 16
     13     return Image.open(img_byte_stream).convert("RGB")
     15 description_generator = outlines.generate.text(model)
---> 16 description_generator(
     17     "<image> detailed description:",
     18     [img_from_url("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg")]
     19 )

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/outlines/generate/api.py:555, in VisionSequenceGeneratorAdapter.__call__(self, prompts, media, max_tokens, stop_at, seed, **model_specific_params)
    549 prompts, media = self._validate_prompt_media_types(prompts, media)
    551 generation_params = self.prepare_generation_parameters(
    552     max_tokens, stop_at, seed
    553 )
--> 555 completions = self.model.generate(
    556     prompts,
    557     media,
    558     generation_params,
    559     self.logits_processor,
    560     self.sampling_params,
    561     **model_specific_params,
    562 )
    564 return self._format(completions)

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/outlines/models/transformers_vision.py:56, in TransformersVision.generate(self, prompts, media, generation_parameters, logits_processor, sampling_parameters)
     46 inputs = self.processor(prompts, media, padding=True, return_tensors="pt").to(
     47     self.model.device
     48 )
     50 generation_kwargs = self._get_generation_kwargs(
     51     prompts,
     52     generation_parameters,
     53     logits_processor,
     54     sampling_parameters,
     55 )
---> 56 generated_ids = self._generate_output_seq(prompts, inputs, **generation_kwargs)
     58 # if single str input and single sample per input, convert to a 1D output
     59 if isinstance(prompts, str):
     60     # Should always be true until NotImplementedError above is fixed

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/outlines/models/transformers.py:350, in Transformers._generate_output_seq(self, prompts, inputs, generation_config, **generation_kwargs)
    346 def _generate_output_seq(
    347     self, prompts, inputs, generation_config, **generation_kwargs
    348 ):
    349     input_ids = inputs["input_ids"]
--> 350     output_ids = self.model.generate(
    351         **inputs, generation_config=generation_config, **generation_kwargs
    352     )
    354     # encoder-decoder returns output_ids only, decoder-only returns full seq ids
    355     if self.model.config.is_encoder_decoder:

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/transformers/generation/utils.py:1989, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1981     input_ids, model_kwargs = self._expand_inputs_for_generation(
   1982         input_ids=input_ids,
   1983         expand_size=generation_config.num_return_sequences,
   1984         is_encoder_decoder=self.config.is_encoder_decoder,
   1985         **model_kwargs,
   1986     )
   1988     # 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 1989     result = self._sample(
   1990         input_ids,
   1991         logits_processor=prepared_logits_processor,
   1992         logits_warper=prepared_logits_warper,
   1993         stopping_criteria=prepared_stopping_criteria,
   1994         generation_config=generation_config,
   1995         synced_gpus=synced_gpus,
   1996         streamer=streamer,
   1997         **model_kwargs,
   1998     )
   2000 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   2001     # 11. prepare logits warper
   2002     prepared_logits_warper = (
   2003         self._get_logits_warper(generation_config, device=input_ids.device)
   2004         if generation_config.do_sample
   2005         else None
   2006     )

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/transformers/generation/utils.py:2932, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, logits_warper, **model_kwargs)
   2929 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
   2931 # forward pass to get next token
-> 2932 outputs = self(**model_inputs, return_dict=True)
   2934 if synced_gpus and this_peer_finished:
   2935     continue  # don't waste resources running the code we don't need

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/torch/nn/modules/module.py:1511, in Module._wrapped_call_impl(self, *args, **kwargs)
   1509     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1510 else:
-> 1511     return self._call_impl(*args, **kwargs)

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/torch/nn/modules/module.py:1520, in Module._call_impl(self, *args, **kwargs)
   1515 # If we don't have any hooks, we want to skip the rest of the logic in
   1516 # this function, and just call forward.
   1517 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1518         or _global_backward_pre_hooks or _global_backward_hooks
   1519         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1520     return forward_call(*args, **kwargs)
   1522 try:
   1523     result = None

File ~/miniconda3/envs/blendsql/lib/python3.9/site-packages/transformers/models/llava_next/modeling_llava_next.py:766, in LlavaNextForConditionalGeneration.forward(self, input_ids, pixel_values, image_sizes, attention_mask, position_ids, past_key_values, inputs_embeds, vision_feature_layer, vision_feature_select_strategy, labels, use_cache, output_attentions, output_hidden_states, return_dict)
    763 # 2. Merge text and images
    764 if pixel_values is not None and input_ids.shape[1] != 1 and pixel_values.size(0) > 0:
    765     # ! infer image_num_patches from image_sizes
--> 766     image_num_patches = [
    767         image_size_to_num_patches(
    768             image_size=imsize,
    769             grid_pinpoints=self.config.image_grid_pinpoints,
    770             patch_size=self.config.vision_config.image_size,
    771         )
    772         for imsize in image_sizes
    773     ]
    774     # figure out if pixel_values is concatenated or stacked
    775     if pixel_values.dim() == 5:
    776         # stacking when input is (batch_size, num_patches, num_channels, height, width)

TypeError: 'NoneType' object is not iterable

Good (?) news is that it's at the transformers level, haven't had time to debug in much detail though. As a sanity check, I verified that the pipeline example from https://huggingface.co/llava-hf/llava-interleave-qwen-0.5b-hf successfully runs and produces generated output.

Not opening a stand-alone issue since this feature isn't apart of an official outlines release yet, but happy to create one if you'd prefer!

lapp0 · 2024-07-30T22:26:09Z

@parkervg I was able to reproduce your error. Issue was that it's trying to use LlavaNextForConditionalGeneration, but this doesn't work.

Could you please try setting the model and processor classes?

>>> model = transformers_vision(
...     'llava-hf/llava-interleave-qwen-0.5b-hf',
...     processor_class=transformers.AutoProcessor,
...     model_class=transformers.LlavaForConditionalGeneration,
... )
>>> prompt
'<|im_start|>user\n<image>\nWhat are these?<|im_end|>\n<|im_start|>assistant\n'
>>> 
>>> description_generator = outlines.generate.text(model)
>>> description_generator(prompt, [raw_image])
'They are kittens.'

We probably want to default to AutoProcessor and make model_class a required argument, with your error in mind. Could you open an issue to do so?

parkervg · 2024-07-31T01:01:13Z

Thanks for the guidance, opened an issue and a corresponding PR here: #1077

lapp0 force-pushed the multimodal-models branch 2 times, most recently from e033200 to 6adb73b Compare July 19, 2024 15:34

lapp0 requested a review from rlouf July 19, 2024 15:43

lapp0 added the transformers vision label Jul 19, 2024

lapp0 force-pushed the multimodal-models branch from 6adb73b to 9ae6e70 Compare July 19, 2024 15:50

lapp0 commented Jul 19, 2024

View reviewed changes

rlouf reviewed Jul 19, 2024

View reviewed changes

Introduce outlines.models.transformers_vision

43424c6

lapp0 force-pushed the multimodal-models branch from 9ae6e70 to 43424c6 Compare July 19, 2024 16:08

rlouf merged commit a7e3381 into dottxt-ai:main Jul 19, 2024
7 checks passed

parkervg mentioned this pull request Jul 31, 2024

transformers_vision: Default to AutoProcessor, make model_class required argument #1076

Closed

aW3st mentioned this pull request Oct 16, 2024

Upgrade outlines to 0.1.1 huggingface/text-generation-inference#2658

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `outlines.models.transformers_vision` #1052

Introduce `outlines.models.transformers_vision` #1052

lapp0 commented Jul 19, 2024 •

edited

Loading

lapp0 Jul 19, 2024

rlouf Jul 19, 2024

rlouf Jul 19, 2024

parkervg commented Jul 30, 2024

lapp0 commented Jul 30, 2024 •

edited

Loading

parkervg commented Jul 31, 2024

Introduce outlines.models.transformers_vision #1052

Introduce outlines.models.transformers_vision #1052

Conversation

lapp0 commented Jul 19, 2024 • edited Loading

Changes

Tests

lapp0 Jul 19, 2024

Choose a reason for hiding this comment

rlouf Jul 19, 2024

Choose a reason for hiding this comment

rlouf Jul 19, 2024

Choose a reason for hiding this comment

parkervg commented Jul 30, 2024

lapp0 commented Jul 30, 2024 • edited Loading

parkervg commented Jul 31, 2024

Introduce `outlines.models.transformers_vision` #1052

Introduce `outlines.models.transformers_vision` #1052

lapp0 commented Jul 19, 2024 •

edited

Loading

lapp0 commented Jul 30, 2024 •

edited

Loading