Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal to Fix IndexError: string index out of range in magic_pdf/para/para_split_v3.py #953

Closed
HiroshigeAoki opened this issue Nov 14, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@HiroshigeAoki
Copy link

HiroshigeAoki commented Nov 14, 2024

Description of the bug | 错误描述

An IndexError: string index out of range occurs at

if lines_text_list[i][0].isdigit():

Proposed Fix:

I think it might be helpful to add a null check before accessing the elements of lines_text_list.

                    if lines_text_list[i][0].isdigit():
                        line[ListLineTag.IS_LIST_START_LINE] = True
                    if lines_text_list[i][-1] in LIST_END_FLAG:
                        line[ListLineTag.IS_LIST_END_LINE] = True

                if lines_text_list:
                    if lines_text_list[i][0].isdigit():
                        line[ListLineTag.IS_LIST_START_LINE] = True
                    if lines_text_list[i][-1] in LIST_END_FLAG:
                        line[ListLineTag.IS_LIST_END_LINE] = True

Error log:

2024-11-14 02:55:50.962 | ERROR    | magic_pdf.tools.cli:parse_doc:109 - string index out of range
Traceback (most recent call last):

  File "/opt/mineru_venv/bin/magic-pdf", line 8, in <module>
    sys.exit(cli())
    │   │    └ <Command cli>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>
  File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           │    │     │       └ {}
           │    │     └ ()
           │    └ <function BaseCommand.main at 0x7212e51113f0>
           └ <Command cli>
  File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         │    │      └ <click.core.Context object at 0x7212e559fc10>
         │    └ <function Command.invoke at 0x7212e5111ea0>
         └ <Command cli>
  File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           │   │      │    │           │   └ {'path': 'example.pdf', 'output_dir': './', 'lang': 'japan', 'method': 'txt', 'debug_able': False, 'start_page_id': 0, 'end_pag...
           │   │      │    │           └ <click.core.Context object at 0x7212e559fc10>
           │   │      │    └ <function cli at 0x721285f3f760>
           │   │      └ <Command cli>
           │   └ <function Context.invoke at 0x7212e5110c10>
           └ <click.core.Context object at 0x7212e559fc10>
  File "/opt/mineru_venv/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
                       │       └ {'path': 'example.pdf', 'output_dir': './', 'lang': 'japan', 'method': 'txt', 'debug_able': False, 'start_page_id': 0, 'end_pag...
                       └ ()
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 115, in cli
    parse_doc(path)
    │         └ 'example.pdf'
    └ <function cli.<locals>.parse_doc at 0x7212e533d750>
> File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/cli.py", line 96, in parse_doc
    do_parse(
    └ <function do_parse at 0x721285f3eb00>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/tools/common.py", line 93, in do_parse
    pipe.pipe_parse()
    │    └ <function TXTPipe.pipe_parse at 0x721285f3ed40>
    └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/pipe/TXTPipe.py", line 29, in pipe_parse
    self.pdf_mid_data = parse_txt_pdf(self.pdf_bytes, self.model_list, self.image_writer, is_debug=self.is_debug,
    │    │              │             │    │          │    │           │    │                      │    └ True
    │    │              │             │    │          │    │           │    │                      └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
    │    │              │             │    │          │    │           │    └ <magic_pdf.rw.DiskReaderWriter.DiskReaderWriter object at 0x721285f1b430>
    │    │              │             │    │          │    │           └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
    │    │              │             │    │          │    └ [{'layout_dets': [{'category_id': 1, 'poly': [515.2504272460938, 1037.1888427734375, 1195.464111328125, 1037.1888427734375, 1...
    │    │              │             │    │          └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
    │    │              │             │    └ b'%PDF-1.7\r\n%\xb5\xb5\xb5\xb5\r\n1 0 obj\r\n<</Type/Catalog/Pages 2 0 R/Lang(ja) /StructTreeRoot 682 0 R/MarkInfo<</Marked ...
    │    │              │             └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
    │    │              └ <function parse_txt_pdf at 0x721285f11fc0>
    │    └ None
    └ <magic_pdf.pipe.TXTPipe.TXTPipe object at 0x721285f1b5b0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/user_api.py", line 34, in parse_txt_pdf
    pdf_info_dict = parse_pdf_by_txt(
                    └ <function parse_pdf_by_txt at 0x721285f3e560>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/pdf_parse_by_txt.py", line 15, in parse_pdf_by_txt
    return pdf_parse_union(dataset,
           │               └ <magic_pdf.data.dataset.PymuDocDataset object at 0x72111501cf70>
           └ <function pdf_parse_union at 0x721285f3e4d0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/pdf_parse_union_core_v2.py", line 630, in pdf_parse_union
    para_split(pdf_info_dict, debug_mode=debug_mode)
    │          │                         └ True
    │          └ {'page_0': {'preproc_blocks': [{'type': 'title', 'bbox': [175, 270, 418, 327], 'lines': [{'bbox': [177.13999938964844, 274.15...
    └ <function para_split at 0x72128b305c60>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 309, in para_split
    __para_merge_page(all_blocks)
    │                 └ [{'type': 'title', 'bbox': [175, 270, 418, 327], 'lines': [{'bbox': [177.13999938964844, 274.15997314453125, 417.910003662109...
    └ <function __para_merge_page at 0x72128b305bd0>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 272, in __para_merge_page
    block_type = __is_list_or_index_block(block)
                 │                        └ {'type': 'text', 'bbox': [85, 527, 516, 594], 'lines': [{'bbox': [73.8239974975586, 530, 513.4500732421875, 542], 'spans': [{...
                 └ <function __is_list_or_index_block at 0x72128b305990>
  File "/opt/mineru_venv/lib/python3.10/site-packages/magic_pdf/para/para_split_v3.py", line 197, in __is_list_or_index_block
    if lines_text_list[i][0].isdigit():
       │               └ 2
       └ ['text1', '', 'text2']
IndexError: string index out of range

How to reproduce the bug | 如何复现

  1. Build and run Docker
wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
docker build -t mineru:latest .
docker run --rm -it --gpus=all mineru:latest /bin/bash
root@[container id]:/#cat magic-pdf.template.json > magic-pdf.json
  1. Modify magic-pdf.json
{
    "bucket_info":{
        "bucket-name-1":["ak", "sk", "endpoint"],
        "bucket-name-2":["ak", "sk", "endpoint"]
    },
    "models-dir":"/tmp/models",
    "layoutreader-model-dir":"/tmp/layoutreader",
    "device-mode":"cuda", # Update
    "layout-config": {
        "model": "layoutlmv3"
    },
    "formula-config": {
        "mfd_model": "yolo_v8_mfd",
        "mfr_model": "unimernet_small",
        "enable": false # Update
    },
    "table-config": {
        "model": "tablemaster",
        "enable": true, # Update
        "max_time": 400
    },
    "config_version": "1.0.0"
}
  1. Copy target pdf to container and run magic-pdf command
docker cp example.pdf [container id]:/
root@[container id]:/#magic-pdf -p example.pdf -o ./ -l japan -m auto

Operating system | 操作系统

Linux

Python version | Python 版本

3.10

Software version | 软件版本 (magic-pdf --version)

0.9.x

Device mode | 设备模式

cuda

@HiroshigeAoki HiroshigeAoki added the bug Something isn't working label Nov 14, 2024
@myhloli
Copy link
Collaborator

myhloli commented Nov 14, 2024

Thank you for your feedback. We have already fixed the issue in the dev branch and will release a new patched version as soon as possible.

@myhloli myhloli closed this as completed Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants