Skip to content

It's designed for transmuting PDFs into HTML. Harness the power of OCR, image processing, and web technologies to unlock the secrets within your PDF documents.

License

Notifications You must be signed in to change notification settings

OtenMoten/pdf-alchemist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

👩🏼‍🔬 PDF Alchemist: A PDF to HTML Transmuter

Welcome to the realm of PDF Alchemist, where the secrets of PDFs are transmuted into HTML.

🌟 Project Overview

This Python application lovely named PDF Alchemist is a sophisticated, open-source toolkit that combines the arcane arts of PDF parsing, OCR, image processing, and HTML generation. It's designed for those who seek to unlock the knowledge sealed within the enigmatic tomes we call PDFs.

This project brings together a fellowship of powerful components:

  • PDFParser: The Document Detective, powered by PyMuPDF
  • OCREngine: The Text Archaeologist, empowered by Tesseract
  • ImageProcessor: The Digital Alchemist, enhanced by Pillow
  • HTMLGenerator: The Web Illusionist, crafted with Dominate
  • ProgressTracker: The Expedition Chronicler, utilizing Python's built-in logging module

✨ Capabilities

  • Unearth text and images from PDF archives
  • Decipher text using advanced OCR incantations
  • Transmute images into optimized, base64-encoded artifacts
  • Weave extracted elements into responsive HTML tapestries
  • Chronicle the expedition with detailed logs and progress tracking

🧪 Installation

To establish your own PDF Alchemist's laboratory:

  1. Clone this arcane repository:
    git clone https://github.com/team-bitfuture/pdf-alchemist.git
    
  2. Enter the sacred circle:
    cd pdf-alchemist
    
  3. Summon the required artifacts:
    pip install -r requirements.txt
    
  4. Ensure you possess the Tesseract grimoire. If not, acquire it here.

🔮 Usage

To initiate the PDF transmutation ritual:

if __name__ == "__main__":
    pdf_path = "input.pdf" 
    output_dir = "output"
    os.makedirs(output_dir, exist_ok=True) 
    main(pdf_path, output_dir)

This will transmute your PDF into a series of HTML pages, complete with extracted text, images, and layout information.

🧬 Running Tests

To ensure your PDF Alchemist is operating at peak efficiency:

pytest tests/

This will execute a series of arcane trials, testing each component of the PDF Alchemist.

🤝 Contributing

We welcome fellow arcane researchers to join our quest. If you wish to contribute:

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/MagicSpell)
  3. Commit your changes (git commit -m 'Add MagicSpell')
  4. Push to the branch (git push origin feature/MagicSpell)
  5. Open a Pull Request

📜 License

This project is licensed under the GPL3.0 License - see the LICENSE.md file for details.

🧙‍♂️ Authors

See also the list of contributors who participated in this arcane project.

🌟 Connect with Team BitFuture

May your PDFs always yield their secrets, and your HTML render with perfection. 📜🌐

About

It's designed for transmuting PDFs into HTML. Harness the power of OCR, image processing, and web technologies to unlock the secrets within your PDF documents.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages