-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix for Stream._group_rows #19
Conversation
@ollynowell Thanks for your pr, can you rebase with master? Is it possible to share the example file? |
414a7ff
to
2effecd
Compare
No unfortunately it's not possible, it is a client's file that I can't share. |
Would it be possible to anonymize the data in the file? |
I can look into it if there is a good reason to - what do you need it for? |
It is better to not merge improvements blindly. If there is a file to test again. We can make sure that in the future there are no changes to the code which might break your fix / use case. The result will be that we will gradually increase the accuracy of this tool. |
Sounds reasonable, I'll look into it |
@bosd Here is a minimal example! Without the fixWithout the fix in this PR, this is the behaviour:
The column ExplanationThe reason for this is the following:
With the fixWith the fix the file is read correctly:
The difference is that at step 8-9 above, having sorted Further WorkThis does seem like it is a genuine bugfix (so please let's merge it soon!!) - I can't see any reason why the code would be set up to either add inner columns or outer columns but not a combination. However it is still possible to construct examples that don't work, although it is somewhat unlikely since it requires the headers to not be aligned in addition to the column values. |
@ollynowell Many Thanks for your detailed write-up and examples. Ideally, the files would be used in unit tests, to keep track on the performance of this lib to prevent futer regressions. FWIW, I'm investigating / working on the network/hybrid parser #90 . |
@bosd thanks for your prompt reply - perhaps there is still hope for this library after all!
Done :)
So it's missing the To be honest I don't understand why the stream logic limits the columns that are added in this way, rather than just adding all of them. But I haven't tested and any further changes in that direction would need a lot of checking against files of varying complexity, and would have a lot more risk of breaking things... as you say benchmarking this library is a huge topic! |
Those diagrams are made on a 4year old fork. Will have to check later on, when my porting is done if E5 is correctly included. |
The last commits introduced some merge conflicts. Will look into it at a later time. |
9e37d85
to
ac35b67
Compare
…rouping algorithm, to avoid missing columns
…y appear in test_stream.py
ac35b67
to
ead70ac
Compare
Original PR in camelot-dev here
PDFMiner text objects should be sorted before the row grouping algorithm, otherwise items that belong on the same row will not be correctly grouped together.
In
_generate_columns_and_rows
this is done correctly the first time:Line 328:
t_bbox["horizontal"].sort(key=lambda x: (-x.y0, x.x0))
But
inner_text
is not sorted at any point after being extended withouter_text
, which means that the_group_rows
algorithm does not always work correctly - motivating example below:Motivating Example
My table looks like this:
Initial column detection identified just three columns (because those columns being longer pushed the mode to three)
inner_text
was then populated by this column:and subsequently extended with the
outer_text
from here:Because
inner_text
isn't sorted after being extended,_group_rows
first finds rows of length 1 from the oneinner_text
columns, and then finds longer rows from theouter_text
column.As a result the
inner_text
column isn't added.The sort in this PR fixes the problem and seems a reasonable place to apply it to me, but I am not familiar with this codebase - I've only debugged this one case.