Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: Invalid input type 'PdfDocument' #235

Open
Liu-XinYuan opened this issue Jul 21, 2024 · 6 comments
Open

TypeError: Invalid input type 'PdfDocument' #235

Liu-XinYuan opened this issue Jul 21, 2024 · 6 comments

Comments

@Liu-XinYuan
Copy link

I encountered the following error when running the following command:
(venv) (base) MacBook-Pro-2:contract-master dylan$ marker_single /Users/dylan/xxxx.pdf /Users/dylan --language Chinese Loading detection model vikp/surya_det2 on device cpu with dtype torch.float32 Loading detection model vikp/surya_layout2 on device cpu with dtype torch.float32 Loading reading order model vikp/surya_order on device mps with dtype torch.float16 Loaded texify model to mps with torch.float16 dtype Traceback (most recent call last): File "/Users/dylan/ai/contract-master/venv/bin/marker_single", line 8, in <module> sys.exit(main()) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/convert_single.py", line 26, in main full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/marker/convert.py", line 65, in convert_single_pdf pages, toc = get_text_blocks( File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/marker/pdf/extract_text.py", line 85, in get_text_blocks char_blocks = dictionary_output(doc, page_range=page_range, keep_chars=True) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pdftext/extraction.py", line 75, in dictionary_output pages = _get_pages(pdf_path, model, page_range, workers=workers) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pdftext/extraction.py", line 26, in _get_pages pdf_doc = pdfium.PdfDocument(pdf_path) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pypdfium2/_helpers/document.py", line 78, in __init__ self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pypdfium2/_helpers/document.py", line 674, in _open_pdf raise TypeError(f"Invalid input type '{type(input_data).__name__}'") TypeError: Invalid input type 'PdfDocument'

@stupidcupid
Copy link

same problem ;)

2 similar comments
@iksk
Copy link

iksk commented Aug 9, 2024

same problem ;)

@MissTeven
Copy link

same problem ;)

@kenZhangCn
Copy link

pip install pdftext==0.3.7 pip install marker_pdf==0.2.6(mac-inter), reference #183

@mara004
Copy link

mara004 commented Oct 8, 2024

Looks like some caller tries to pass a PdfDocument instance as input to a new PdfDocument, which is nonsense. If you already have a document handle, use it.

Update: see VikParuchuri's answer in VikParuchuri/pdftext#10 (comment): "I think the issues there were with mismatched pdftext/marker versions"

@scottgigante-sightline
Copy link

pip install pdftext==0.3.7 pip install marker_pdf==0.2.6(mac-inter), reference #183

This is the right answer :) For those of us still using Python 3.11, I'd love to see a 0.2.6.post1 which pins pdftext to 0.3.7 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants