Several PDF analysis has already been done, I reassembled a lot of them with additional tips & tools here
https://www.adobe.com/devnet/pdf/pdf_reference.html
https://blog.didierstevens.com/2008/04/09/quickpost-about-the-physical-and-logical-structure-of-pdf-files/
https://web.archive.org/web/20141010035745/http://gnupdf.org/Introduction_to_PDF
http://blog.didierstevens.com/programs/pdf-tools/
https://github.com/sans-dfir/sift-files/tree/master/pdf-tools
$ file file.pdf
$ pdfinfo -box -meta -js -rawdates file.pdf
$ python pdfdid.py -aefv file.pdf
Search for /OpenAction /AA /Launch /GoTo /GoToR /SubmitForm /Richmedia (for Flash) /JS /JavaScript /URI - Encode - Cipher - Shell code - Obfuscation...
Automatically with ParanoiDF
$ python paranoiDF.py -fl file.pdf
Or with pdf-parser
$ python pdf-parser.py -v file.pdf
With an hexadecimal analyser
$ bless file.pdf
pdf-parser to extract a js object for example
$ pdf-parser --object 32 --raw > extractedObject.js
pdfextract from Origami
$ pdfextract file.pdf
Beware to don't leak any important/professional/personnal data or to expose your research
https://www.hybrid-analysis.com/
$ file file.pdf
$ pdfinfo file.pdf
$ pdfinfo -box -meta -js -rawdates file.pdf
$ pyew file.pdf
$ peepdf -fl file.pdf
$ peepdf --interactive file.pdf
PDF Stream Dumper
https://github.com/dzzie/pdfstreamdumper
Get metadata
$ exiftool -a -u -g2 file.pdf
Get metadata recursivly from current directory
$ exiftool -r -ext pdf .
Change an element
$ exiftool -Title="New title" file.pdf
Remove metadata
$ exiftool -all= file.pdf && exiftool -all:all= file.pdf && qpdf --linearize file.pdf filewithoutmeta.pdf
$ mat file.pdf # latest version of mat doesn't support pdf format anymore...
Remove metadata recursively from the current directory : Very dirty but work well The filename must not have space at the moment, the commande will be optimized
$ find . -name "*.pdf" -print0 | while read -d $'\0' file; do echo ${file:2} && mv ${file:2} ${file:2}.pdf && exiftool -all= ${file:2}.pdf && exiftool -all:all= ${file:2}.pdf && qpdf --linearize ${file:2}.pdf ${file:2} && rm ${file:2}.pdf && rm ${file:2}.pdf_original; done
Search for older "hidden" versions
$ pdfresurrect file.pdf -i
$ exiftool -pdf-update:all= file.pdf
Name | URL |
---|---|
Malwr | https://malwr.com/submission/ |
Hybrid analysis | https://www.hybrid-analysis.com/ |
Malware Tracker | https://www.malwaretracker.com/pdf.php |
VirusTotal | http://www.virustotal.com/ |
PDF examiner | http://www.pdfexaminer.com/ |
Document Analyzer | http://www.document-analyzer.net/ |
Jotti | https://virusscan.jotti.org/ |
PDF X-ray | http://www.pdfxray.com/ |
PDF Online | https://www.pdf-online.com/ |
Extract PDF | http://www.extractpdf.com |
Char conversion | https://kt.pe/tools.html#conv/ |
Calcul byte statistics, entropy min and max, ASCII count, ... from a PDF
$ python byte-stats.py file.pdf
Visual analysis of a PDF or a binary file
http://binvis.io
$ python pdfid.py --all --extra --force --verbose file.pdf
$ pdf-parser file.pdf | ./pdfobjflow
$ eog pdfobjflow.png
Search for :
/OpenAction /AA specifies the script or action to run automatically.
/Names /AcroForm /Action can also specify and launch scripts or actions.
/JavaScript specifies JavaScript to run.
/GoTo changes the view to a specified destination within the PDF or in another PDF file.
/Launch a program or opens a document.
/URI accesses a resource by its URL.
/SubmitForm /GoToR can send data to URL.
/RichMedia can be used to embed Flash in PDF.
/ObjStm can hide objects inside an Object Stream.
/JavaScript > /J#61vaScript Beware on obfuscation technique with hex codes
With ParanoiDF
$ python paranoiDF.py -fl file.pdf
With pdf-parser
$ python pdf-parser.py -v file.pdf
With an hexadecimal analyser
$ bless file.pdf
With dumppdf
$ dumppdf -a file.pdf
Search for compression
$ strings file.pdf | grep --color "/Filter"
2 ways to decompress a PDF
$ pdftk compressed.pdf output uncompressed.pdf uncompress
$ qpdf --stream-data=uncompress compressed.pdf uncompressed.pdf
4 ways to search for embeded files/scripts inside a PDF
$ binwalk file.pdf
$ foremost -a -v file.pdf
$ hachoir-subfile file.pdf
$ scalpel file.pdf
Extract file corresponding to object ID, jpg for example
$ dumppdf.py -i 32 -r file.pdf > image.jpg
Extract js from an object for example
$ pdf-parser --object 32 --raw > extractedObject.js
pdfextract from Origami
$ pdfextract file.pdf
PDF to Postscript
$ pdftops file.pdf
PDF to TXT
$ pdftotext file.pdf
PDF to JPG
$ convert file.pdf image.jpg
Non-exhaustive list of possible conversion
Convert a PDF to Postscript without the LZWDecode filter
$ qpdf --stream-data=uncompress original.pdf decoded.pdf # Decompress it
$ pdftops decoded.pdf decoded.ps # Convert it
PDF supports RC4 encryption (40 to 128 bits keys) and AES (128 to 256 with the Extension Level 3).
Beware with empty password.
Brute force a PDF with pdfcrack
$ pdfcrack -w yourDictionnary.txt file.pdf
With john
$ pdf2john.py file.pdf > x.hash
$ john --wordlist=yourDictionnary.txt x.hash
2 ways to search for Javascript
$ pdf-parser --search=JavaScript file.pdf
$ pdfinfo -js file.pdf
Extract an object With jsunpack
$ jsunpack-extractjs file.pdf
With pdf-parser
$ pdf-parser --object 32 --raw file.pdf > file.js
With pdfextract from Origami
$ pdfextract --js file.pdf
https://github.com/urule99/jsunpack-n
Online :
http://jsunpack.jeek.org/java/
Malzilla and SpiderMonkey can also help deobfuscate JavaScript.
Malzilla :
http://www.malzilla.org/downloads.html
SpiderMonkey :
http://www.didierstevens.com/files/software/js-1.7.0-mod.tar.gz
More details coming soon.
https://didierstevens.com/files/software/make-pdf_V0_1_6.zip
https://neonprimetime.blogspot.fr/2015/03/how-to-add-javascript-to-pdf.html
$ python pdfid.py --disarm file.pdf
Search for flash
$ python pdf-parser.py --search flash file.pdf
Extract flash with swf_mastah
$ python swf_mastah.py -f file.pdf -o ./
$ file *.swf
With pdf-parser
$ pdf-parser.py --object 32 --filter --raw file.pdf > flashFile.swf
$ file flashFile.swf
Analysing flash program
$ swfdump -Ddu flashFile.swf > flashFile.txt
More details coming soon.
https://blog.didierstevens.com/category/pdf/
http://www.decalage.info/file_formats_security/pdf
https://zeltser.com/analyzing-malicious-documents/
https://code.google.com/archive/p/corkami/wikis/PDFTricks.wiki
https://www.sans.org/reading-room/whitepapers/malicious/owned-malicious-pdf-analysis-33443
https://digital-forensics.sans.org/blog/2009/12/14/pdf-malware-analysis/
http://fileformats.archiveteam.org/wiki/PDF