Improve PDF file detection, fix description #93

peterekepeter · 2024-07-17T12:13:10Z

Hi!

There is at least one system out in the wild that produces pdf files which start with a CRLF.

I added it as an extra entry.

Though from my testing, you can have any junk in front of the file as long as at some point you encounter the %PDF- string so a proper fix would be to look for the sequence of bytes/characters.

Anyways, stay safe out there!

NebularNerd · 2024-07-17T16:52:56Z

Part of the v2.0 plan is to better/faster/more awesome ways to perform matching, my experimental PR #65 would help with these fringe issues.

I never looked at a PDF header, I notice it has a version in there as well, something to file away for the future for more providing more details on matches (per #69)

@peterekepeter: Please add Closes #94 to the top of your post so your issue automatically closes when the PR is merged.

peterekepeter · 2024-07-17T21:11:54Z

My PR does not close #94 it just covers more cases without rearchitecting anything.

I opened an issue separately because the PDF magic sequence can be at any offset inside the file... which is not something the library was planned to do at all.

NebularNerd · 2024-07-18T09:26:20Z

My bad, I did skim the issue where I should have read it more before suggesting the close.

PDF's are something I wanted to look at more later on as I had a project where I needed to OCR them in bulk, being able to decipher what flavor they are before carrying out work on them would help cut down unnecessary work.

Looking at Wikipedia: PDF and PDF FileTypes, there is a lot we can look to extract detail wise in the future.

cdgriffith · 2024-08-07T20:49:49Z

Thank you for the addition and fix @peterekepeter !

- Adding new verbose output to command line with `-v` or `--verbose` - Adding #92 include py.typed in sdist (thanks to Nicholas Bollweg - bollwyvl) - Adding #93 Improve PDF file detection, fix json description (thanks to Péter - peterekepeter) - Fixing #96 #86 stream does not work properly on opened small files (thanks to Felipe Lema and Andy - NebularNerd) - Removing expected invalid WinZip signature --------- Co-authored-by: Nicholas Bollweg <[email protected]> Co-authored-by: Péter <[email protected]> Co-authored-by: Andy <[email protected]>

peterekepeter added 3 commits July 17, 2024 14:03

Detect PDF files that start with CRLF

44cd38f

Add pdf to detect by extension

f672dc1

Fix description for JSON files

5b3b408

peterekepeter changed the title ~~Detect PDF files that start with CRLF~~ Improve PDF file detection, fix description Jul 17, 2024

cdgriffith changed the base branch from master to develop August 7, 2024 20:50

cdgriffith merged commit 85890e5 into cdgriffith:develop Aug 7, 2024
9 checks passed

cdgriffith mentioned this pull request Aug 8, 2024

Version 1.27 #98

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve PDF file detection, fix description #93

Improve PDF file detection, fix description #93

peterekepeter commented Jul 17, 2024 •

edited

Loading

NebularNerd commented Jul 17, 2024 •

edited

Loading

peterekepeter commented Jul 17, 2024 •

edited

Loading

NebularNerd commented Jul 18, 2024 •

edited

Loading

cdgriffith commented Aug 7, 2024

Improve PDF file detection, fix description #93

Improve PDF file detection, fix description #93

Conversation

peterekepeter commented Jul 17, 2024 • edited Loading

NebularNerd commented Jul 17, 2024 • edited Loading

peterekepeter commented Jul 17, 2024 • edited Loading

NebularNerd commented Jul 18, 2024 • edited Loading

cdgriffith commented Aug 7, 2024

peterekepeter commented Jul 17, 2024 •

edited

Loading

NebularNerd commented Jul 17, 2024 •

edited

Loading

peterekepeter commented Jul 17, 2024 •

edited

Loading

NebularNerd commented Jul 18, 2024 •

edited

Loading