-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve PDF file detection, fix description #93
Conversation
Part of the v2.0 plan is to better/faster/more awesome ways to perform matching, my experimental PR #65 would help with these fringe issues. I never looked at a PDF header, I notice it has a version in there as well, something to file away for the future for more providing more details on matches (per #69) @peterekepeter: Please add |
My PR does not close #94 it just covers more cases without rearchitecting anything. I opened an issue separately because the PDF magic sequence can be at any offset inside the file... which is not something the library was planned to do at all. |
My bad, I did skim the issue where I should have read it more before suggesting the close. PDF's are something I wanted to look at more later on as I had a project where I needed to OCR them in bulk, being able to decipher what flavor they are before carrying out work on them would help cut down unnecessary work. Looking at Wikipedia: PDF and PDF FileTypes, there is a lot we can look to extract detail wise in the future. |
Thank you for the addition and fix @peterekepeter ! |
- Adding new verbose output to command line with `-v` or `--verbose` - Adding #92 include py.typed in sdist (thanks to Nicholas Bollweg - bollwyvl) - Adding #93 Improve PDF file detection, fix json description (thanks to Péter - peterekepeter) - Fixing #96 #86 stream does not work properly on opened small files (thanks to Felipe Lema and Andy - NebularNerd) - Removing expected invalid WinZip signature --------- Co-authored-by: Nicholas Bollweg <[email protected]> Co-authored-by: Péter <[email protected]> Co-authored-by: Andy <[email protected]>
Hi!
There is at least one system out in the wild that produces pdf files which start with a CRLF.
I added it as an extra entry.
Though from my testing, you can have any junk in front of the file as long as at some point you encounter the
%PDF-
string so a proper fix would be to look for the sequence of bytes/characters.Anyways, stay safe out there!