Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix for Stream._group_rows #19

Merged
merged 3 commits into from
Oct 6, 2024

Conversation

ollynowell
Copy link

Original PR in camelot-dev here

PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.

In _generate_columns_and_rows this is done correctly the first time:
Line 328: t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))

But inner_text is not sorted at any point after being extended with outer_text, which means that the _group_rows algorithm does not always work correctly - motivating example below:

Motivating Example
My table looks like this:
image

Initial column detection identified just three columns (because those columns being longer pushed the mode to three)
image

inner_text was then populated by this column:
image

and subsequently extended with the outer_text from here:
image

Because inner_text isn't sorted after being extended, _group_rows first finds rows of length 1 from the one inner_text columns, and then finds longer rows from the outer_text column.
As a result the inner_text column isn't added.

The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.

@bosd
Copy link
Collaborator

bosd commented Aug 8, 2024

@ollynowell Thanks for your pr, can you rebase with master?

Is it possible to share the example file?

@bosd bosd added the bug Something isn't working label Aug 10, 2024
@bosd bosd force-pushed the bugfix-for-stream-grouprows branch from 414a7ff to 2effecd Compare August 13, 2024 06:06
@ollynowell
Copy link
Author

@ollynowell Thanks for your pr, can you rebase with master?

Is it possible to share the example file?

No unfortunately it's not possible, it is a client's file that I can't share.

@bosd bosd requested a review from foarsitter August 15, 2024 07:54
@bosd bosd mentioned this pull request Aug 28, 2024
25 tasks
@bosd
Copy link
Collaborator

bosd commented Aug 28, 2024

No unfortunately it's not possible, it is a client's file that I can't share.

Would it be possible to anonymize the data in the file?
Or recreate a file with a similar structure?

@ollynowell
Copy link
Author

No unfortunately it's not possible, it is a client's file that I can't share.

Would it be possible to anonymize the data in the file? Or recreate a file with a similar structure?

I can look into it if there is a good reason to - what do you need it for?

@bosd
Copy link
Collaborator

bosd commented Aug 28, 2024

I can look into it if there is a good reason to - what do you need it for?

It is better to not merge improvements blindly. If there is a file to test again. We can make sure that in the future there are no changes to the code which might break your fix / use case.

The result will be that we will gradually increase the accuracy of this tool.

@ollynowell
Copy link
Author

I can look into it if there is a good reason to - what do you need it for?

It is better to not merge improvements blindly. If there is a file to test again. We can make sure that in the future there are no changes to the code which might break your fix / use case.

The result will be that we will gradually increase the accuracy of this tool.

Sounds reasonable, I'll look into it

@ollynowell
Copy link
Author

Bug Minimal Example.pdf

image

@bosd Here is a minimal example!

Without the fix

Without the fix in this PR, this is the behaviour:

tables = camelot.read_pdf("Bug Minimal Example.pdf", pages="all", flavor="stream")

>>> print(tables[0].df)
         0        1       2       3               4
0  Outer 1  Outer 2  Long 1  Long 2  Inner \nLong 3
1       A1       B1      C1      D1         E1 \nF1
2                        C2      D2              F2
3                        C3      D3              F3
4                        C4      D4              F4

The column Inner is combined with the column Long 3.

Explanation

The reason for this is the following:

  1. In stream._generate_columns_and_rows the first call to _group_rows finds 5 rows.
  2. The elements variable is populated with the row lengths: [5, 5, 3, 3, 3]
  3. Since 3 is the modal row length, ncols = max(set(elements), key=elements.count) gives 3 columns
  4. These 3 columns are Long 1, Long 2, Long 3
  5. inner_text is populated with the two rows from the column Inner
  6. outer_text is populated with the two rows from the columns 'Outer 1andOuter 2`
  7. inner_text + outer_text is then passed into _add_columns and subsequently _group_rows again, without being sorted
  8. Since it isn't sorted, the first two rows are each just one element long, consisting of just the Inner column
  9. Then two more rows are found each of length two, consisting of the two Outer columns
  10. _add_columns then selects just just the columns from the rows of length two, and therefore only adds the Outer columns and not the Inner column

With the fix

With the fix the file is read correctly:

tables = camelot.read_pdf("Bug Minimal Example.pdf", pages="all", flavor="stream")

>>> print(tables[0].df)
         0        1       2       3      4       5
0  Outer 1  Outer 2  Long 1  Long 2  Inner  Long 3
1       A1       B1      C1      D1     E1      F1
2                        C2      D2             F2
3                        C3      D3             F3
4                        C4      D4             F4

The difference is that at step 8-9 above, having sorted inner_text + outer_text, two rows are found of length 3, consisting of the columns Inner, Outer 1, Outer 2

Further Work

This does seem like it is a genuine bugfix (so please let's merge it soon!!) - I can't see any reason why the code would be set up to either add inner columns or outer columns but not a combination.

However it is still possible to construct examples that don't work, although it is somewhat unlikely since it requires the headers to not be aligned in addition to the column values.

e.g.
Further Work Example.pdf

image

@bosd
Copy link
Collaborator

bosd commented Sep 21, 2024

@ollynowell Many Thanks for your detailed write-up and examples.
The bugfix seems valid!

Ideally, the files would be used in unit tests, to keep track on the performance of this lib to prevent futer regressions.
But Benchmarking this lib is a huge topic, to be implemented.
Would you be able to include a test in this pr? 😊

FWIW, I'm investigating / working on the network/hybrid parser #90 .
I used the file Further Working Example, got a reasonable result with network parser.
Still not perfect..
image

@ollynowell
Copy link
Author

ollynowell commented Sep 23, 2024

@bosd thanks for your prompt reply - perhaps there is still hope for this library after all!

Would you be able to include a test in this pr? 😊

Done :)

FWIW, I'm investigating / working on the network/hybrid parser #90 .
I used the file Further Working Example, got a reasonable result with network parser.
Still not perfect..

So it's missing the E5 value at the bottom? Is that a correct interpretation of those diagrams?

To be honest I don't understand why the stream logic limits the columns that are added in this way, rather than just adding all of them. But I haven't tested and any further changes in that direction would need a lot of checking against files of varying complexity, and would have a lot more risk of breaking things... as you say benchmarking this library is a huge topic!

@bosd
Copy link
Collaborator

bosd commented Sep 23, 2024

So it's missing the E5 value at the bottom? Is that a correct interpretation of those diagrams?
Yes, that is the correct interpretation.

Those diagrams are made on a 4year old fork.
Which did'nt inculde some of the fixes on the main dev branch.
In the meantime, the histroy accross all those forks have diverged a lot.
I'm currently working on porting the changes over. Which involves a lot of manual editing.
But the new parsers look very promising.

Will have to check later on, when my porting is done if E5 is correctly included.

@bosd
Copy link
Collaborator

bosd commented Sep 25, 2024

The last commits introduced some merge conflicts. Will look into it at a later time.

@bosd bosd force-pushed the bugfix-for-stream-grouprows branch from 9e37d85 to ac35b67 Compare October 6, 2024 10:35
@bosd bosd force-pushed the bugfix-for-stream-grouprows branch from ac35b67 to ead70ac Compare October 6, 2024 10:40
@bosd bosd merged commit 713bfc0 into py-pdf:main Oct 6, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants