You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think it might be helpful to add a null check before accessing the elements of lines_text_list.
if lines_text_list[i][0].isdigit():
line[ListLineTag.IS_LIST_START_LINE] = True
if lines_text_list[i][-1] in LIST_END_FLAG:
line[ListLineTag.IS_LIST_END_LINE] = True
↓
if lines_text_list:
if lines_text_list[i][0].isdigit():
line[ListLineTag.IS_LIST_START_LINE] = True
if lines_text_list[i][-1] in LIST_END_FLAG:
line[ListLineTag.IS_LIST_END_LINE] = True
Error log:
2024-11-14 02:55:50.962 | ERROR | magic_pdf.tools.cli:parse_doc:109 - string index out of range
Traceback (most recent call last):
File "/opt/mineru_venv/bin/magic-pdf", line 8, in <module>
sys.exit(cli())
│ │ └ <Command cli>
│ └ <built-in function exit>
└ <module 'sys' (built-in)>
File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
│ │ │ └ {}
│ │ └ ()
│ └ <function BaseCommand.main at 0x7212e51113f0>
└ <Command cli>
File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
│ │ └ <click.core.Context object at 0x7212e559fc10>
│ └ <function Command.invoke at 0x7212e5111ea0>
└ <Command cli>
File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
│ │ │ │ │ └ {'path': 'example.pdf', 'output_dir': './', 'lang': 'japan', 'method': 'txt', 'debug_able': False, 'start_page_id': 0, 'end_pag...
│ │ │ │ └ <click.core.Context object at 0x7212e559fc10>
│ │ │ └ <function cli at 0x721285f3f760>
│ │ └ <Command cli>
│ └ <function Context.invoke at 0x7212e5110c10>
└ <click.core.Context object at 0x7212e559fc10>
File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
│ └ {'path': 'example.pdf', 'output_dir': './', 'lang': 'japan', 'method': 'txt', 'debug_able': False, 'start_page_id': 0, 'end_pag...
└ ()
File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 115, in cli
parse_doc(path)
│ └ 'example.pdf'
└ <function cli.<locals>.parse_doc at 0x7212e533d750>
> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 96, in parse_doc
do_parse(
└ <function do_parse at 0x721285f3eb00>
File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 93, in do_parse
pipe.pipe_parse()
│ └ <function TXTPipe.pipe_parse at 0x721285f3ed40>
└ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/pipe/TXTPipe.py", line 29, in pipe_parse
self.pdf_mid_data = parse_txt_pdf(self.pdf_bytes, self.model_list, self.image_writer, is_debug=self.is_debug,
│ │ │ │ │ │ │ │ │ │ └ True
│ │ │ │ │ │ │ │ │ └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
│ │ │ │ │ │ │ │ └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x721285f1b430>
│ │ │ │ │ │ │ └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
│ │ │ │ │ │ └ [{'layout_dets': [{'category_id': 1, 'poly': [515.2504272460938, 1037.1888427734375, 1195.464111328125, 1037.1888427734375, 1...
│ │ │ │ │ └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
│ │ │ │ └ b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(ja) /StructTreeRoot 682 0 R/MarkInfo<</Marked ...
│ │ │ └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
│ │ └ <function parse_txt_pdf at 0x721285f11fc0>
│ └ None
└ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/user_api.py", line 34, in parse_txt_pdf
pdf_info_dict = parse_pdf_by_txt(
└ <function parse_pdf_by_txt at 0x721285f3e560>
File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py", line 15, in parse_pdf_by_txt
return pdf_parse_union(dataset,
│ └ <magic_pdf.data.dataset.PymuDocDataset object at 0x72111501cf70>
└ <function pdf_parse_union at 0x721285f3e4d0>
File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 630, in pdf_parse_union
para_split(pdf_info_dict, debug_mode=debug_mode)
│ │ └ True
│ └ {'page_0': {'preproc_blocks': [{'type': 'title', 'bbox': [175, 270, 418, 327], 'lines': [{'bbox': [177.13999938964844, 274.15...
└ <function para_split at 0x72128b305c60>
File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 309, in para_split
__para_merge_page(all_blocks)
│ └ [{'type': 'title', 'bbox': [175, 270, 418, 327], 'lines': [{'bbox': [177.13999938964844, 274.15997314453125, 417.910003662109...
└ <function __para_merge_page at 0x72128b305bd0>
File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 272, in __para_merge_page
block_type = __is_list_or_index_block(block)
│ └ {'type': 'text', 'bbox': [85, 527, 516, 594], 'lines': [{'bbox': [73.8239974975586, 530, 513.4500732421875, 542], 'spans': [{...
└ <function __is_list_or_index_block at 0x72128b305990>
File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 197, in __is_list_or_index_block
if lines_text_list[i][0].isdigit():
│ └ 2
└ ['text1', '', 'text2']
IndexError: string index out of range
Description of the bug | 错误描述
An IndexError: string index out of range occurs at
MinerU/magic_pdf/para/para_split_v3.py
Line 197 in d0558ab
Proposed Fix:
I think it might be helpful to add a null check before accessing the elements of lines_text_list.
↓
Error log:
How to reproduce the bug | 如何复现
Operating system | 操作系统
Linux
Python version | Python 版本
3.10
Software version | 软件版本 (magic-pdf --version)
0.9.x
Device mode | 设备模式
cuda
The text was updated successfully, but these errors were encountered: