pdfindex

Builds a compressed and naive index over given pdfs and searches in them.

To get the raw text of a pdf, it has to be parsed. This can't be done quickly, so it makes sense to index the pdf once, compress it and search in this index structure. That's where this tool steps in.

The index will be saved at ~/.pdfindex and is compressed with zlib. Some more advanced index structure is planned but not yet implemented. If you ever move or copy some pdf file, this tool compares the sha256 of the files and it doesn't have to reparse it.

In german texts there are often umlauts, which we try to fix by using some simple replacements.

A quick comparison between pdfgrep and pdfindex:

	pdfgrep	pdfindex
index format	multiple files named by sha1 of the file	one file
file recognition	sha1	filename, modification time, sha256
index compression	No	Yes, zlib

Requirements

To parse the pdf files we use pdftotext, it's part of the poppler package. The tool is written in Python2.7, so python2 has to be installed as well.

Usage

./pdfindex.py query [directory|file]

If you want to search for test in the current directory, just run:

./pdfindex.py test

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
LICENSE		LICENSE
README.md		README.md
pdfindex.py		pdfindex.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pdfindex

Requirements

Usage

About

Releases

Packages

Languages

License

karawitan/pdfindex

Folders and files

Latest commit

History

Repository files navigation

pdfindex

Requirements

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages