Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected space #352

Open
ymouhat opened this issue Jun 21, 2024 · 7 comments
Open

unexpected space #352

ymouhat opened this issue Jun 21, 2024 · 7 comments

Comments

@ymouhat
Copy link

ymouhat commented Jun 21, 2024

Hello
We have identified the following behavior
using the following PDF to parse it, BALGRP.pdf

  • The element surounded in green contain only one or no space even when I edit the pdf.

image

  • But when the pdf is parsed additional spaces are detected
    for bank account there are 2 spaces instead of one
    company is split into while there is no space between COM & PANY

image

  • This is causing issue when comparing the expected string with the actual string parsed using cucumber library.

Could you please look at it ?

Many thanks in advance

@modesty
Copy link
Owner

modesty commented Jun 22, 2024

-m would turn on PROCESS_MERGE_BROKEN_TEXT_BLOCKS, tried it?

@ymouhat
Copy link
Author

ymouhat commented Jun 25, 2024

Hi @modesty
THank you, the developer will look into it
We let you know how it goes.

@JordiSAGE
Copy link

JordiSAGE commented Jun 25, 2024

Hello @modesty
I've tried adding the parameter programmatically both before and after creating the PDF object, but unfortunately it didn't work.

    process.env.PROCESS_MERGE_BROKEN_TEXT_BLOCKS = 'true';
    const pdfParser = new PDFParser(this, true);
    process.env.PROCESS_MERGE_BROKEN_TEXT_BLOCKS = 'true';

'General balance (Provisional) 6/17/2024 Company : ATP2 ATP2 - ATP Samples Currency : USD Legislation : USA USA Balance to 12/31/2023 Txs on 1/1/2024 to 12/31/2024 Balance to 12/31/2024 Account no Account heading Debit Credit Debit Credit Debit Credit 10100 Bank Account 1,070.00 1,070.00 12100 Accounts Receivable 2,740.00 1,070.00 1,670.00 12400 Shipped Not Invoiced Clearing 600.00 600.00 17000 FA - Construction in Progress 1,000.00 1,000.00 20100 Accounts Payable 3,000.00 3,000.00 25100 Sales Tax Payable 140.00 140.00 4 1100 Sales Revenue 2,600.00 2,600.00 41900 Sales Revenue - Clearing 600.00 600.00 70900 Miscellaneous Expense 2,000.00 2,000.00 Balance total 4,810.00 4,810.00 Totals management 2,600.00 2,600.00 Off-balance-sheet total COM PANY TOTAL ATP2 ATP2 - ATP Samples 7,410.00 7,410.00 Page 1 of 1'

FYI @ymouhat

@JordiSAGE
Copy link

JordiSAGE commented Jun 26, 2024

Hello again @modesty
I've tried as well, using the getMergedTextBlocksIfNeeded method, but it seems that this is not available on the PDFParser object anymore.
image

FYI @ymouhat

@JordiSAGE
Copy link

JordiSAGE commented Jun 27, 2024

Hi @modesty
I've integrated the pdf2json source code into the project, and it seems that this is merging some blocks correctly, but some others not, for instance, this is removing the space before ATP2 in 'Company :ATP2' or 'Account noAccount heading' that is supposed to have a large space, but it did it well in case of 'COMPANY', maybe the space distance threshold calculation on the method areAdjacentBlocks from pdf2json is not working properly.

t2.x - t1.x - t1.w < PDFFont.getSpaceThreshHold(t1);

General balance (Provisional) 6/17/2024 Company :ATP2 ATP2 - ATP Samples Currency :USD Legislation :USA USABalance to 12/31/2023 Txs on 1/1/2024 to 12/31/2024Balance to 12/31/2024 Account noAccount heading DebitCredit DebitCredit DebitCredit 10100Bank Account 1,070.00 1,070.00 12100Accounts Receivable 2,740.00 1,070.00 1,670.00 12400Shipped Not Invoiced Clearing 600.00 600.00 17000FA - Construction in Progress 1,000.00 1,000.00 20100Accounts Payable 3,000.00 3,000.00 25100Sales Tax Payable 140.00 140.00 41100 Sales Revenue 2,600.00 2,600.00 41900Sales Revenue - Clearing 600.00 600.00 70900Miscellaneous Expense 2,000.00 2,000.00 Balance total 4,810.00 4,810.00 Totals management 2,600.00 2,600.00 Off-balance-sheet total COMPANY TOTAL ATP2ATP2 - ATP Samples 7,410.00 7,410.00 Page 1 of 1

image
image

FYI: @ymouhat

@JordiSAGE
Copy link

JordiSAGE commented Jul 19, 2024

Hello @modesty
Any news about this issue? Is there any pending change to be integrated that will fix this?
Thank you in advance.
FYI: @ymouhat

@modesty
Copy link
Owner

modesty commented Jul 28, 2024

I've been swamped with work, didn't get a chance looking into it, @JordiSAGE
I'll leave the problem to the open community now, will try to review PR when it's ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants