Bugfix for Stream._group_rows #19

ollynowell · 2024-06-13T15:38:18Z

Original PR in camelot-dev here

PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.

In _generate_columns_and_rows this is done correctly the first time:
Line 328: t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))

But inner_text is not sorted at any point after being extended with outer_text, which means that the _group_rows algorithm does not always work correctly - motivating example below:

Motivating Example
My table looks like this:

Initial column detection identified just three columns (because those columns being longer pushed the mode to three)

inner_text was then populated by this column:

and subsequently extended with the outer_text from here:

Because inner_text isn't sorted after being extended, _group_rows first finds rows of length 1 from the one inner_text columns, and then finds longer rows from the outer_text column.
As a result the inner_text column isn't added.

The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.

bosd · 2024-08-08T20:30:03Z

@ollynowell Thanks for your pr, can you rebase with master?

Is it possible to share the example file?

ollynowell · 2024-08-15T06:31:27Z

@ollynowell Thanks for your pr, can you rebase with master?

Is it possible to share the example file?

No unfortunately it's not possible, it is a client's file that I can't share.

bosd · 2024-08-28T13:17:26Z

No unfortunately it's not possible, it is a client's file that I can't share.

Would it be possible to anonymize the data in the file?
Or recreate a file with a similar structure?

ollynowell · 2024-08-28T13:32:01Z

No unfortunately it's not possible, it is a client's file that I can't share.

Would it be possible to anonymize the data in the file? Or recreate a file with a similar structure?

I can look into it if there is a good reason to - what do you need it for?

bosd · 2024-08-28T14:43:56Z

I can look into it if there is a good reason to - what do you need it for?

It is better to not merge improvements blindly. If there is a file to test again. We can make sure that in the future there are no changes to the code which might break your fix / use case.

The result will be that we will gradually increase the accuracy of this tool.

ollynowell · 2024-08-28T15:59:16Z

I can look into it if there is a good reason to - what do you need it for?

It is better to not merge improvements blindly. If there is a file to test again. We can make sure that in the future there are no changes to the code which might break your fix / use case.

The result will be that we will gradually increase the accuracy of this tool.

Sounds reasonable, I'll look into it

ollynowell · 2024-09-20T15:09:17Z

Bug Minimal Example.pdf

@bosd Here is a minimal example!

Without the fix

Without the fix in this PR, this is the behaviour:

tables = camelot.read_pdf("Bug Minimal Example.pdf", pages="all", flavor="stream")

>>> print(tables[0].df)
         0        1       2       3               4
0  Outer 1  Outer 2  Long 1  Long 2  Inner \nLong 3
1       A1       B1      C1      D1         E1 \nF1
2                        C2      D2              F2
3                        C3      D3              F3
4                        C4      D4              F4

The column Inner is combined with the column Long 3.

Explanation

The reason for this is the following:

In stream._generate_columns_and_rows the first call to _group_rows finds 5 rows.
The elements variable is populated with the row lengths: [5, 5, 3, 3, 3]
Since 3 is the modal row length, ncols = max(set(elements), key=elements.count) gives 3 columns
These 3 columns are Long 1, Long 2, Long 3
inner_text is populated with the two rows from the column Inner
outer_text is populated with the two rows from the columns 'Outer 1andOuter 2`
inner_text + outer_text is then passed into _add_columns and subsequently _group_rows again, without being sorted
Since it isn't sorted, the first two rows are each just one element long, consisting of just the Inner column
Then two more rows are found each of length two, consisting of the two Outer columns
_add_columns then selects just just the columns from the rows of length two, and therefore only adds the Outer columns and not the Inner column

With the fix

With the fix the file is read correctly:

tables = camelot.read_pdf("Bug Minimal Example.pdf", pages="all", flavor="stream")

>>> print(tables[0].df)
         0        1       2       3      4       5
0  Outer 1  Outer 2  Long 1  Long 2  Inner  Long 3
1       A1       B1      C1      D1     E1      F1
2                        C2      D2             F2
3                        C3      D3             F3
4                        C4      D4             F4

The difference is that at step 8-9 above, having sorted inner_text + outer_text, two rows are found of length 3, consisting of the columns Inner, Outer 1, Outer 2

Further Work

This does seem like it is a genuine bugfix (so please let's merge it soon!!) - I can't see any reason why the code would be set up to either add inner columns or outer columns but not a combination.

However it is still possible to construct examples that don't work, although it is somewhat unlikely since it requires the headers to not be aligned in addition to the column values.

e.g.
Further Work Example.pdf

bosd · 2024-09-21T08:50:19Z

@ollynowell Many Thanks for your detailed write-up and examples.
The bugfix seems valid!

Ideally, the files would be used in unit tests, to keep track on the performance of this lib to prevent futer regressions.
But Benchmarking this lib is a huge topic, to be implemented.
Would you be able to include a test in this pr? 😊

FWIW, I'm investigating / working on the network/hybrid parser #90 .
I used the file Further Working Example, got a reasonable result with network parser.
Still not perfect..

ollynowell · 2024-09-23T12:40:27Z

@bosd thanks for your prompt reply - perhaps there is still hope for this library after all!

Would you be able to include a test in this pr? 😊

Done :)

FWIW, I'm investigating / working on the network/hybrid parser #90 .
I used the file Further Working Example, got a reasonable result with network parser.
Still not perfect..

So it's missing the E5 value at the bottom? Is that a correct interpretation of those diagrams?

To be honest I don't understand why the stream logic limits the columns that are added in this way, rather than just adding all of them. But I haven't tested and any further changes in that direction would need a lot of checking against files of varying complexity, and would have a lot more risk of breaking things... as you say benchmarking this library is a huge topic!

bosd · 2024-09-23T14:38:49Z

So it's missing the E5 value at the bottom? Is that a correct interpretation of those diagrams?
Yes, that is the correct interpretation.

Those diagrams are made on a 4year old fork.
Which did'nt inculde some of the fixes on the main dev branch.
In the meantime, the histroy accross all those forks have diverged a lot.
I'm currently working on porting the changes over. Which involves a lot of manual editing.
But the new parsers look very promising.

Will have to check later on, when my porting is done if E5 is correctly included.

bosd · 2024-09-25T10:43:58Z

The last commits introduced some merge conflicts. Will look into it at a later time.

…rouping algorithm, to avoid missing columns

…y appear in test_stream.py

bosd added the bug Something isn't working label Aug 10, 2024

bosd force-pushed the bugfix-for-stream-grouprows branch from 414a7ff to 2effecd Compare August 13, 2024 06:06

bosd requested a review from foarsitter August 15, 2024 07:54

bosd mentioned this pull request Aug 28, 2024

Roadmap #91

Open

25 tasks

bosd force-pushed the bugfix-for-stream-grouprows branch from 9e37d85 to ac35b67 Compare October 6, 2024 10:35

ollynowell added 3 commits October 6, 2024 12:40

Sort the PDFMiner text objects along the x axis before applying the g…

e5a01dd

…rouping algorithm, to avoid missing columns

Order stream data variables in tests/data.py in the same order as the…

1678047

…y appear in test_stream.py

Add test case for the bugfix

ead70ac

bosd force-pushed the bugfix-for-stream-grouprows branch from ac35b67 to ead70ac Compare October 6, 2024 10:40

bosd merged commit 713bfc0 into py-pdf:main Oct 6, 2024
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix for Stream._group_rows #19

Bugfix for Stream._group_rows #19

ollynowell commented Jun 13, 2024

bosd commented Aug 8, 2024

ollynowell commented Aug 15, 2024

bosd commented Aug 28, 2024

ollynowell commented Aug 28, 2024

bosd commented Aug 28, 2024

ollynowell commented Aug 28, 2024

ollynowell commented Sep 20, 2024

bosd commented Sep 21, 2024

ollynowell commented Sep 23, 2024 •

edited

Loading

bosd commented Sep 23, 2024

bosd commented Sep 25, 2024

Bugfix for Stream._group_rows #19

Bugfix for Stream._group_rows #19

Conversation

ollynowell commented Jun 13, 2024

bosd commented Aug 8, 2024

ollynowell commented Aug 15, 2024

bosd commented Aug 28, 2024

ollynowell commented Aug 28, 2024

bosd commented Aug 28, 2024

ollynowell commented Aug 28, 2024

ollynowell commented Sep 20, 2024

Without the fix

Explanation

With the fix

Further Work

bosd commented Sep 21, 2024

ollynowell commented Sep 23, 2024 • edited Loading

bosd commented Sep 23, 2024

bosd commented Sep 25, 2024

ollynowell commented Sep 23, 2024 •

edited

Loading