Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] how to batch many .png files? #14

Open
ccchan234 opened this issue Feb 2, 2023 · 8 comments
Open

[Feature request] how to batch many .png files? #14

ccchan234 opened this issue Feb 2, 2023 · 8 comments

Comments

@ccchan234
Copy link

Is your feature request related to a problem? Please describe.

I got tons of files, now TE need to be done one file by one file.

Describe the solution you'd like

select several files, Rt click, choose extract to separate files, then extracted to separate files. (may be some people also want extract ALl to 1 single file but please add filename into the 1 single documents thx)

Describe alternatives you've considered

in the form of command

Additional context

@ccchan234
Copy link
Author

i have to say TE is very accurate for me, with screenshots taken for pastest MCQ questions.

thanks

@danielo515
Copy link

I also find a bit confusing how to use this plugin.
I was expecting some command to scan all the images and generate cache from them,or as this issue states, a whole folder. Is this even possible?

@scambier
Copy link
Owner

scambier commented Jan 4, 2024

Text Extractor was first and foremost built as a sort of "plugin's plugin". The idea was to provide a few basic helper functions for developers to build or expand their own plugin on top of it. Though to my knowledge, it's not used by anything else than Omnisearch.

I was expecting some command to scan all the images and generate cache from them

What is your use case?

@danielo515
Copy link

danielo515 commented Jan 4, 2024 via email

@scambier
Copy link
Owner

scambier commented Jan 4, 2024

Ok so you just need to enable images and pdf indexing in Omnisearch settings on a desktop PC. Omnisearch will ask Text Extractor to get the text for all those files, and that will generate the cache 👍

@danielo515
Copy link

danielo515 commented Jan 5, 2024 via email

@paulpall
Copy link

paulpall commented Jun 8, 2024

Ok so you just need to enable images and pdf indexing in Omnisearch settings on a desktop PC. Omnisearch will ask Text Extractor to get the text for all those files, and that will generate the cache 👍

I'm not sure if I have missed anything but I can't seem to get this to work with images either. PDF content seems to have been indexed, but with images I have to manually right-click and extract text to clipboard for each image to show up in search.

I had a look at the logs and there were a lot of Text Extractor - OCR Worker timeout image_name eval @ plugin:text-extractor:5068 messages... I'm on an ARM macOS laptop, perhaps there's some conflict stemming from that?

Perhaps a workaround could be a buttton in the settings to ignore timeouts and have it index all the images automatically? Even if it does takes hours, as long as there's a way to keep an eye on the progress, I wouldn't mind.

@scambier
Copy link
Owner

@paulpall

Perhaps a workaround could be a buttton in the settings to ignore timeouts and have it index all the images automatically? Even if it does takes hours

That's what is happening already, when Omnisearch uses Text Extractor, as long as this is enabled.
image

But if you have many images that cause a timeout (maybe they're particularly large or too complex for the OCR library), the worker is effectively blocked 120 seconds on a single image, and then blocked again on the next image, etc.

Eventually it will go through all of them though, as images are only treated once, even when they timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants