Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not extracting when PDF is large #1162

Open
victorcasignia opened this issue Sep 5, 2024 · 3 comments
Open

Not extracting when PDF is large #1162

victorcasignia opened this issue Sep 5, 2024 · 3 comments

Comments

@victorcasignia
Copy link

Using Docker to run the service.

Used on a 21 pages PDF. It only extracts up to page 9 then it jumps to the bibliography. How do I resolve this?

image

image

@lfoppiano
Copy link
Collaborator

Hi @victorcasignia,
which document are you processing? is it a scientific article?

Could you provide the document so I can make some tests?

@victorcasignia
Copy link
Author

@lfoppiano Yes. I used this document https://arxiv.org/pdf/2307.01952

@lfoppiano
Copy link
Collaborator

Hi @victorcasignia I did went through the document and the body seems to be correctly processed. You can double check that the body is all in the output. Even the head of sections are numbered correctly.

Now, the issues are all in the Appendix, which is larger than the body of the article. The first part are correctly handlded until "blue" (where the document ends). After that the model decided that it's body so all the content after page 21 is actually appended after the "Future work" section.

From our end, we could flag this issue so that we can use the document as training data for the fulltext model, but will be for version 0.8.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants