Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve PDF file detection, fix description #93

Merged
merged 3 commits into from
Aug 7, 2024

Conversation

peterekepeter
Copy link
Contributor

@peterekepeter peterekepeter commented Jul 17, 2024

Hi!

There is at least one system out in the wild that produces pdf files which start with a CRLF.

I added it as an extra entry.

Though from my testing, you can have any junk in front of the file as long as at some point you encounter the %PDF- string so a proper fix would be to look for the sequence of bytes/characters.

Anyways, stay safe out there!

@peterekepeter peterekepeter changed the title Detect PDF files that start with CRLF Improve PDF file detection, fix description Jul 17, 2024
@NebularNerd
Copy link
Contributor

NebularNerd commented Jul 17, 2024

Part of the v2.0 plan is to better/faster/more awesome ways to perform matching, my experimental PR #65 would help with these fringe issues.

I never looked at a PDF header, I notice it has a version in there as well, something to file away for the future for more providing more details on matches (per #69)

@peterekepeter: Please add Closes #94 to the top of your post so your issue automatically closes when the PR is merged.

@peterekepeter
Copy link
Contributor Author

peterekepeter commented Jul 17, 2024

My PR does not close #94 it just covers more cases without rearchitecting anything.

I opened an issue separately because the PDF magic sequence can be at any offset inside the file... which is not something the library was planned to do at all.

@NebularNerd
Copy link
Contributor

NebularNerd commented Jul 18, 2024

My bad, I did skim the issue where I should have read it more before suggesting the close.

PDF's are something I wanted to look at more later on as I had a project where I needed to OCR them in bulk, being able to decipher what flavor they are before carrying out work on them would help cut down unnecessary work.

Looking at Wikipedia: PDF and PDF FileTypes, there is a lot we can look to extract detail wise in the future.

@cdgriffith
Copy link
Owner

Thank you for the addition and fix @peterekepeter !

@cdgriffith cdgriffith changed the base branch from master to develop August 7, 2024 20:50
@cdgriffith cdgriffith merged commit 85890e5 into cdgriffith:develop Aug 7, 2024
9 checks passed
@cdgriffith cdgriffith mentioned this pull request Aug 8, 2024
cdgriffith added a commit that referenced this pull request Aug 8, 2024
- Adding new verbose output to command line with `-v` or `--verbose`
- Adding #92 include py.typed in sdist (thanks to Nicholas Bollweg - bollwyvl)
- Adding #93 Improve PDF file detection, fix json description (thanks to Péter - peterekepeter)
- Fixing #96 #86 stream does not work properly on opened small files (thanks to Felipe Lema and Andy - NebularNerd)
- Removing expected invalid WinZip signature

---------

Co-authored-by: Nicholas Bollweg <[email protected]>
Co-authored-by: Péter <[email protected]>
Co-authored-by: Andy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants