PDF Data Checker

We want to create a data extractor and checker from PDF with a local database (mostly Excel in this case).

ExtractPDF

Trying to Extract Data from not-so-perfect PDFs

About

A little project dealing with information retrieval from PDF documents.

For the time being, I am only dealing with a single pdf at a time, eventually using this model for a batch procedure. I will upload all the major iteration changes I go through! My actual motive is to be able to extract marks from mark sheets taken by students. These mark sheets might be taken in various orientations and could involve many real-world problems.

Installation

1. Dependencies

Install Dependencies using

pip install -r Dependencies.txt

2. Download Tesseract OCR

Use the https://github.com/tesseract-ocr/tesseract/releases repository (You will need it LOCALLY on your system for OCR). You can get the Windows Installer from https://github.com/UB-Mannheim/tesseract/wiki directly.

3. PATH

Enter the path to your pdf in pdf_path in the declaration outside. Add your Tesseract location to your PATH, or add your Tesseract PATH implicitly in the declared field.

Even with all this, I had trouble finding meaningful output using Tesseract OCR.

An error I received a lot was Error during OSD detection: (1, 'Estimating resolution as 346 <could be any number here between ranges 0 to 400 according to my observations> Warning. Invalid resolution 0 dpi. Using 70 I only found 1 solution to the issue, which is more like a "bypass" rather than a cure. Thanks to Esraa anyhow!

Let me share a sample input and output of one such case:

Input:

Output:

Not too great right?

4. Groq

Now before I started the project, I was convinced I could (obviously with the help of Github and Stackoverflow) get the PDF OCR running after using Tesseract. I couldn't have possibly been more incorrect. Someone suggested I try using AI, and well API keys don't come cheap.

In comes Groq. I stumbled across a Reddit thread that convinced me to use Groq, and so I did. You can read the documentation at https://console.groq.com/docs/quickstart directly.

You need to run export GROQ_API_KEY=<your-api-key-here> on your terminal of choice. Use set instead of export if you are on CMD. I'd suggest using $env:GROQ_API_KEY="<your_api_key>" because this worked for me.

Many thanks to artworkjpm!

The main purpose of Groq is to use llama for any correction of the bad output from Tesseract OCR. This is where I hit a brick wall, as even with Groq the error bound was quite substantial.

I had to change my approach to the OCR itself.

5. Google Cloud Vision API

The strongest tool available to us - Google Cloud Vision API. This is famously used in image detection, mainly in Google Lens. So I had to give it a shot. as always, you can read the documentation and the guides over at https://cloud.google.com/vision/docs/how-to.

Just like Groq, you need to set an environment variable GOOGLE_APPLICATION_CREDENTIALS . You can refer to Point 4, or just simply use $env:GOOGLE_APPLICATION_CREDENTIALS="<your_api_key>". You also need to create an "Application Credential Secret Key" from Google Cloud, which will authenticate your API use. This is going to be a .json download. You must specify the .json path in the credentials section of the code. For more clarification on the process, you can refer to this video: https://www.youtube.com/watch?v=TTeVtJNWdmI. You can open your .json key and get the private_key_id. This ID is to be used for the environment variable!

Now with the help of both llama3-8b and Vision AI, let me share a sample IO:

Input:

Output:

That is almost Perfection! Now I just need to tweak the prompt a bit for Groq and we are golden!

With this, I conclude the "ExtractPDF" portion.

Installation

1) Dependencies

Install the required libraries using the following command:

  pip install pandas numpy openpyxl

2)File Path

i)Specify the path to your Excel file in the variable excel_file_path in the code.

ii)Specify the path to the folder containing OCR-corrected output text files in the variable corrected_folder.

3) Workflow and Flag Representation

For each OCR-corrected output text file, the script will:

i)Validate the NAME, REGISTRATION NUMBER, and SUBJECT MARKS by comparing them with the data in the Excel database.

ii)Generate a flag based on the comparison results, which will be stored in the Excel database.

iii)Create a discrepancy log for any mismatched data.

Flag Representation System

The flag will be a 4-digit number ( _ _ _ _ ), where each digit represents the following:

x x 0 0: OCR output contains less data due to the poor quality of the PDF.

x x 0 1: Data mismatch—subject marks may have been interchanged between OCR output and Excel data. Requires manual review.

x x 1 0: Marks match between the PDF and Excel database.

x 0 x x: The student name does not match between the OCR output and the Excel database.

x 1 x x: The student name matches between the OCR output and the Excel database.

0 x x x: The registration number does not match between the OCR output and the Excel database.

1 x x x: The registration number matches between the OCR output and the Excel database.

Example Flags:

Flag = 1010: The registration number and marks match, but the student’s name does not match.

Flag = 1101: The registration number and student’s name match, but the marks may have been interchanged.

Flag = 1110: All data match (registration number, student name, and marks).

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
ExcelChecker		ExcelChecker
ExtractPDF		ExtractPDF
ExtractPDF.py		ExtractPDF.py
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Data Checker

We want to create a data extractor and checker from PDF with a local database (mostly Excel in this case).

ExtractPDF

Trying to Extract Data from not-so-perfect PDFs

About

A little project dealing with information retrieval from PDF documents.

Installation

1. Dependencies

2. Download Tesseract OCR

3. PATH

Even with all this, I had trouble finding meaningful output using Tesseract OCR.

4. Groq

5. Google Cloud Vision API

That is almost Perfection! Now I just need to tweak the prompt a bit for Groq and we are golden!

Installation

1) Dependencies

2)File Path

3) Workflow and Flag Representation

Flag Representation System

Example Flags:

About

Releases

Packages

Contributors 3

Languages

deborshikun/PDF_DataChecker

Folders and files

Latest commit

History

Repository files navigation

PDF Data Checker

We want to create a data extractor and checker from PDF with a local database (mostly Excel in this case).

ExtractPDF

Trying to Extract Data from not-so-perfect PDFs

About

A little project dealing with information retrieval from PDF documents.

Installation

1. Dependencies

2. Download Tesseract OCR

3. PATH

Even with all this, I had trouble finding meaningful output using Tesseract OCR.

4. Groq

5. Google Cloud Vision API

That is almost Perfection! Now I just need to tweak the prompt a bit for Groq and we are golden!

Installation

1) Dependencies

2)File Path

3) Workflow and Flag Representation

Flag Representation System

Example Flags:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages