Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to read multipage TIF file #50

Closed
funk03 opened this issue Nov 21, 2013 · 11 comments
Closed

How to read multipage TIF file #50

funk03 opened this issue Nov 21, 2013 · 11 comments
Milestone

Comments

@funk03
Copy link

funk03 commented Nov 21, 2013

I have a TIF file that is multiple pages. The function Pix.LoadFromFile(filename) appears to be only loading the first page.
Is there a way to load all the pages?
I would like to be able to read the entire document at once.

Thanks

@charlesw
Copy link
Owner

Yes, reading multipage tiffs is supported by Leptonica, the imaging library used by Tesseract, however I haven't yet implemented support for this in the c# wrapper. The relevant function is `pixaReadMultipageTiffwhich returns aPixA`` structure. To implement this you'd need to do the following:

  • Add support for PixA (PixArray) and the relevant Load and ideally save functions.
  • Update your app to iterate through each Pix in the PixA instance and OCR it separately (tesseract's engine can only OCR one page at a time).

@benwalker14
Copy link

This is something I could use as well. Is there any timetable on the implementation of reading multi-page tiffs?

@charlesw
Copy link
Owner

Sorry not right now, I might be able to find some time to look into this in a couple weeks but its not a priority for me at the moment.

@charlesw charlesw added this to the 1.1 - Tesseract 3.03 support milestone Feb 20, 2014
@charlesw
Copy link
Owner

FYI, thanks to amferguson we will now support multi-page tiffs in the upcoming 1.1 release (tesseract 3.03).

@charlesw charlesw closed this as completed Aug 2, 2014
@yoshidahiro
Copy link

Is there any documentation or code samples to do a multipage tiff? I checked the existing code samples and there is nothing mentioned there. I could really use some guidance.

@Sicos1977
Copy link

I once did make an engine to detect the orientation of a scanned image with the help of Tesseract. In the time I programmed it there was no multipage tiff suport so I wrote some handy TIFF utilities... maybe that they are helpful to you. You can find them over here -->

https://github.com/Sicos1977/PageOrientationEngine/blob/master/PageOrientationEngine/Helpers/TiffUtils.cs

@charlesw
Copy link
Owner

You'll want to use PixArray.LoadMultiPageTiffFromFile and then iterate and
process each pix (page) separately.

If you want feel free to create a wiki article on this (I think everyone
has access to modify the wiki).

On Wed, Jul 29, 2015 at 3:49 PM, Kees [email protected] wrote:

I once did make an engine to detect the orientation of a scanned image
with the help of Tesseract. In the time I programmed it there was no
multipage tiff suport so I wrote some handy TIFF utilities... maybe that
they are helpful to you. You can find them over here -->

https://github.com/Sicos1977/PageOrientationEngine/blob/master/PageOrientationEngine/Helpers/TiffUtils.cs


Reply to this email directly or view it on GitHub
#50 (comment).

@RacerEvan55
Copy link

RacerEvan55 commented Jul 26, 2016

Here is a simple implementation to ocr a multipage tiff. Id add it to the wiki but looks like I cant create a page.
Hopefully this helps someone doing a search here in the issues section.
btw, is there any way to load a multipage tiff from a byte array? (like there is load a single page from byte array) Id rather not have to write out the tiff to the filesystem every time and go and clean it up after its done (it starts as a pdf)

`
StringBuilder OCRedText = new StringBuilder();

        using (TesseractEngine engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))
        {
            using (PixArray pages = PixArray.LoadMultiPageTiffFromFile(filePath))
            {
                foreach (Pix p in pages)
                {
                    using (Tesseract.Page page = engine.Process(p))
                    {
                        string text = page.GetText();
                        OCRedText.Append(text);
                    }
                }
            }
            return OCRedText.ToString();
        }

`

@charlesw
Copy link
Owner

Thanks for the sample and no there isn't a way to load a PixArray from an
in memory byte array currently.

On Tue, 26 Jul 2016, 17:51 RacerEvan55, [email protected] wrote:

Here is a simple implementation to ocr a multipage tiff. Id add it to the
wiki but looks like I cant create a page.
Hopefully this helps someone doing a search here in the issues section.
btw, is there any way to load a multipage tiff from a byte array? (like
there is load a single page from byte array) Id rather not have to write
out the tiff to the filesystem every time and go and clean it up after its
done (it starts as a pdf)

`
StringBuilder OCRedText = new StringBuilder();

    using (TesseractEngine engine = new TesseractEngine(@"./tessdata", "eng", EngineMode.Default))
    {
        using (PixArray pages = PixArray.LoadMultiPageTiffFromFile(filePath))
        {
            foreach (Pix p in pages)
            {
                using (Tesseract.Page page = engine.Process(p))
                {
                    string text = page.GetText();
                    OCRedText.Append(text);
                }
            }
        }`


You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
#50 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAPzyA-3Uuc59NW6FUr0JkscIPymntKYks5qZizzgaJpZM4BPHr1
.

@HK516
Copy link

HK516 commented Jul 5, 2017

Charles, from the post it is clear that, if we have a multipage tiff file, we can use engine.Process() for processing each page by looping through the PixArray.

Is it possible wherein a multipage tiff file can be completely processed in a single attempt, in any of the newer versions of Tesseract ?? I am using Tesseract 3.0.2.0 and could only see the engine.Process() method has a few overloads which accepts Bitmap or a Pix etc..

@charlesw
Copy link
Owner

charlesw commented Jul 6, 2017

@HK516 No you'll still need to process each page piecemill wise as detailed above. There haven't been any changes to the wrapper or tesseract underneath to allow processing in one lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants