Add multi Modal LLM Support #16782

stack-one-tech · 2024-07-01T03:57:16Z

stack-one-tech
Jul 1, 2024

Hi,

Considering the recent developments in LLMs, especially in the multimodal realm, how about integrating a function that provides a summary or an overview of the visible image or a specific section of it with just a keypress?

Users could input their own API key (e.g., GPT-4, Gemini Pro, or Claude 3.5), and from then on, they can use the function.

josephsl · 2024-07-01T04:07:43Z

josephsl
Jul 1, 2024
Collaborator

Hi, there are several add-ons that can do this. A key issue is that these services may require (and some do require) Internet access unless compact local models can be used at some point (provided that hardware can deal with it, made easier thanks to neural processing units and other AI accelerators). Thanks.

1 reply

stack-one-tech Jul 1, 2024
Author

That's fantastic!
It's great that it's already been implemented!
It makes perfect sense.

Sure... That's true. The internet is still required for the large models at the moment. But I think in a few years, local models will be available that are just as capable and can run locally.

Well then, we can wrap this up here!
Thank you for the quick response!

Best regards from Leipzig, Germany.

Adriani90 · 2024-07-01T19:11:47Z

Adriani90
Jul 1, 2024
Collaborator

See this add-on for more details:
https://github.com/cartertemm/AI-content-describer

1 reply

leglands Jul 15, 2024

I was thinking about that kind of thing. ChatGpt 4o is capable to precisely describe website screenshots, resume documents, or powerpoint slides.

ABuffEr · 2024-07-15T17:59:22Z

ABuffEr
Jul 15, 2024

Personally, I dream an assistant that understands all sighted instructions like "look at the top/bottom/left/right" on web sites and applications, something that makes me crazy each time, and perhaps provides an interaction over images (like clicking a point over a map). But these situations is often related to personal context, so any cloud service is a privacy hole unfortunately. And models running over a basic PC are not so able for what I know, at the moment.

0 replies

Adriani90 · 2024-07-16T14:24:11Z

Adriani90
Jul 16, 2024
Collaborator

Yes but an assistant does not have to be an AI model actually. It could be a combination of NVDA and voice access or a text alternative of it in Windows 11. No cloud needed.
As far as I understand voice access can already perform actions for you on the screen. But not sure how far it is when it comes to drawing forms in concept boards or paint, or taking screen shots, etc.
I can imagine that the combination of NVDA with a promt version of voice access and the add-on content describer would probaly leverage accessibility to a really high level.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi Modal LLM Support #16782

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Add multi Modal LLM Support #16782

stack-one-tech Jul 1, 2024

Replies: 4 comments · 2 replies

josephsl Jul 1, 2024 Collaborator

stack-one-tech Jul 1, 2024 Author

Adriani90 Jul 1, 2024 Collaborator

leglands Jul 15, 2024

ABuffEr Jul 15, 2024

Adriani90 Jul 16, 2024 Collaborator

stack-one-tech
Jul 1, 2024

Replies: 4 comments 2 replies

josephsl
Jul 1, 2024
Collaborator

stack-one-tech Jul 1, 2024
Author

Adriani90
Jul 1, 2024
Collaborator

ABuffEr
Jul 15, 2024

Adriani90
Jul 16, 2024
Collaborator