TypeError: Invalid input type 'PdfDocument' #235

Liu-XinYuan · 2024-07-21T08:15:06Z

I encountered the following error when running the following command:
(venv) (base) MacBook-Pro-2:contract-master dylan$ marker_single /Users/dylan/xxxx.pdf /Users/dylan --language Chinese Loading detection model vikp/surya_det2 on device cpu with dtype torch.float32 Loading detection model vikp/surya_layout2 on device cpu with dtype torch.float32 Loading reading order model vikp/surya_order on device mps with dtype torch.float16 Loaded texify model to mps with torch.float16 dtype Traceback (most recent call last): File "/Users/dylan/ai/contract-master/venv/bin/marker_single", line 8, in <module> sys.exit(main()) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/convert_single.py", line 26, in main full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/marker/convert.py", line 65, in convert_single_pdf pages, toc = get_text_blocks( File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/marker/pdf/extract_text.py", line 85, in get_text_blocks char_blocks = dictionary_output(doc, page_range=page_range, keep_chars=True) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pdftext/extraction.py", line 75, in dictionary_output pages = _get_pages(pdf_path, model, page_range, workers=workers) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pdftext/extraction.py", line 26, in _get_pages pdf_doc = pdfium.PdfDocument(pdf_path) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pypdfium2/_helpers/document.py", line 78, in __init__ self.raw, to_hold, to_close = _open_pdf(self._input, self._password, self._autoclose) File "/Users/dylan/ai/contract-master/venv/lib/python3.10/site-packages/pypdfium2/_helpers/document.py", line 674, in _open_pdf raise TypeError(f"Invalid input type '{type(input_data).__name__}'") TypeError: Invalid input type 'PdfDocument'

The text was updated successfully, but these errors were encountered:

stupidcupid · 2024-08-07T09:29:37Z

same problem ；）

iksk · 2024-08-09T16:06:23Z

same problem ；）

MissTeven · 2024-08-15T01:46:55Z

same problem ；）

kenZhangCn · 2024-08-16T09:41:09Z

pip install pdftext==0.3.7 pip install marker_pdf==0.2.6(mac-inter), reference #183

mara004 · 2024-10-08T18:38:19Z

Looks like some caller tries to pass a PdfDocument instance as input to a new PdfDocument, which is nonsense. If you already have a document handle, use it.

Update: see VikParuchuri's answer in VikParuchuri/pdftext#10 (comment): "I think the issues there were with mismatched pdftext/marker versions"

scottgigante-sightline · 2024-11-08T20:25:20Z

pip install pdftext==0.3.7 pip install marker_pdf==0.2.6(mac-inter), reference #183

This is the right answer :) For those of us still using Python 3.11, I'd love to see a 0.2.6.post1 which pins pdftext to 0.3.7 :)

This was referenced Oct 8, 2024

TypeError: Invalid input type 'PdfDocument' #183

Closed

TypeError: Invalid input type 'PdfDocument' #137

Closed

Fix document loading bug VikParuchuri/pdftext#10

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: Invalid input type 'PdfDocument' #235

TypeError: Invalid input type 'PdfDocument' #235

Liu-XinYuan commented Jul 21, 2024

stupidcupid commented Aug 7, 2024

iksk commented Aug 9, 2024

MissTeven commented Aug 15, 2024

kenZhangCn commented Aug 16, 2024

mara004 commented Oct 8, 2024 •

edited

Loading

scottgigante-sightline commented Nov 8, 2024

TypeError: Invalid input type 'PdfDocument' #235

TypeError: Invalid input type 'PdfDocument' #235

Comments

Liu-XinYuan commented Jul 21, 2024

stupidcupid commented Aug 7, 2024

iksk commented Aug 9, 2024

MissTeven commented Aug 15, 2024

kenZhangCn commented Aug 16, 2024

mara004 commented Oct 8, 2024 • edited Loading

scottgigante-sightline commented Nov 8, 2024

mara004 commented Oct 8, 2024 •

edited

Loading