Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strangely, LanguageBind_Image preprocessor_config.json is missing while running demo #57

Closed
OPilgrim opened this issue Dec 25, 2023 · 8 comments

Comments

@OPilgrim
Copy link

Traceback (most recent call last):
File "/data/miniconda3/envs/bind/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/data/miniconda3/envs/bind/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/data/Projects/FactCheck/MultiModel/LVLMs/Video-LLaVA/llava/serve/tmp.py", line 57, in
main()
File "/data/Projects/FactCheck/MultiModel/LVLMs/Video-LLaVA/llava/serve/tmp.py", line 20, in main
tokenizer, model, processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit, load_4bit, device=device)
File "/data/Projects/FactCheck/MultiModel/LVLMs/Video-LLaVA/llava/model/builder.py", line 154, in load_pretrained_model
image_tower.load_model()
File "/data/Projects/FactCheck/MultiModel/LVLMs/Video-LLaVA/llava/model/multimodal_encoder/clip_encoder.py", line 23, in load_model
self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
File "/data/miniconda3/envs/bind/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 165, in from_pretrained
image_processor_dict, kwargs = cls.get_image_processor_dict(pretrained_model_name_or_path, **kwargs)
File "/data/miniconda3/envs/bind/lib/python3.10/site-packages/transformers/image_processing_utils.py", line 269, in get_image_processor_dict
resolved_image_processor_file = cached_file(
File "/data/miniconda3/envs/bind/lib/python3.10/site-packages/transformers/utils/hub.py", line 388, in cached_file
raise EnvironmentError(
OSError: LanguageBind/LanguageBind_Image does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co/LanguageBind/LanguageBind_Image/None' for available files.

@OPilgrim OPilgrim changed the title Strangely enough, LanguageBind_Image preprocessor_config.json is missing while running demo Strangely, LanguageBind_Image preprocessor_config.json is missing while running demo Dec 25, 2023
@awzhgw
Copy link

awzhgw commented Jan 6, 2024

yes... i failed on this

@awzhgw
Copy link

awzhgw commented Jan 6, 2024

@LinB203 ,can you resolve it ?

@LinB203
Copy link
Member

LinB203 commented Jan 6, 2024

Sorry for the late reply. I have uploaded preprocessor_config.json. Feel free to let me know if this works.

@OPilgrim
Copy link
Author

OPilgrim commented Jan 8, 2024

Thank you for the update, but now there is a new problem. It seems that the hidden dimension of mm_video_tower used by video-llava-7B does not match the hidden dimension of video-llava-7B. Are you sure that LanguageBind_Video_merge on hugging face is the correct version?

...
- This IS NOT expected if you are initializing LlavaLlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

['Video', 'Image']
You are using a model of type LanguageBindImage to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
Traceback (most recent call last):
  File "/data/miniconda3/envs/bind/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/miniconda3/envs/bind/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/data/Projects/FactCheck/MultiModel/LVLMs/Video-LLaVA/llava/serve/cli.py", line 144, in <module>
    main(args)
  File "/data/Projects/FactCheck/MultiModel/LVLMs/Video-LLaVA/llava/serve/cli.py", line 32, in main
    tokenizer, model, processor, context_len = load_pretrained_model(args.model_path, args.model_base, model_name,
  File "/data/Projects/FactCheck/MultiModel/LVLMs/Video-LLaVA/llava/model/builder.py", line 154, in load_pretrained_model
    image_tower.load_model()
  File "/data/Projects/FactCheck/MultiModel/LVLMs/Video-LLaVA/llava/model/multimodal_encoder/clip_encoder.py", line 24, in load_model
    self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name)
  File "/data/miniconda3/envs/bind/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/data/miniconda3/envs/bind/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3278, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
        size mismatch for vision_model.embeddings.class_embedding: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for vision_model.embeddings.position_ids: copying a param with shape torch.Size([1, 257]) from checkpoint, the shape in current model is torch.Size([1, 50]).
        size mismatch for vision_model.embeddings.patch_embedding.weight: copying a param with shape torch.Size([1024, 3, 14, 14]) from checkpoint, the shape in current model is torch.Size([768, 3, 32, 32]).
        size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([257, 1024]) from checkpoint, the shape in current model is torch.Size([50, 768]).
        size mismatch for vision_model.pre_layrnorm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for vision_model.pre_layrnorm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for vision_model.encoder.layers.0.self_attn.k_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
        size mismatch for vision_model.encoder.layers.0.self_attn.k_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for vision_model.encoder.layers.0.self_attn.v_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
        size mismatch for vision_model.encoder.layers.0.self_attn.v_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for vision_model.encoder.layers.0.self_attn.q_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
        size mismatch for vision_model.encoder.layers.0.self_attn.q_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for vision_model.encoder.layers.0.self_attn.out_proj.weight: copying a param with shape torch.Size([1024, 1024]) from checkpoint, the shape in current model is torch.Size([768, 768]).
        size mismatch for vision_model.encoder.layers.0.self_attn.out_proj.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
        size mismatch for vision_model.encoder.layers.0.layer_norm1.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([768]).
...

@LinB203
Copy link
Member

LinB203 commented Jan 8, 2024

Could you share your code?
The merge version is not recommended in our api.

@OPilgrim
Copy link
Author

OPilgrim commented Jan 8, 2024

I just ran a test demo:
python -m llava.serve.cli --model-path "LanguageBind/Video-LLaVA-7B" --image-file "./assets/main.jpg" --load-4bit
Then change "mm_image_tower" and "mm_video_tower" in LanguageBind\Video-LLaVA-7B\config.json to the local address
image
well, I just ran the Inference for image code directly, and the same problem occurred

@LinB203
Copy link
Member

LinB203 commented Jan 8, 2024

If you want to load based on local paths, make sure your local path is correct, here's a sample I loaded with my local path. It works fine.

image

image

Then remove some of the restrictions.

image

@OPilgrim
Copy link
Author

OPilgrim commented Jan 8, 2024

Thank you very much. It works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants