Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] column parameter of read_pdf currently needs to be list, not generic iterable #389

Closed
3 tasks done
rbubley opened this issue May 19, 2024 · 3 comments
Closed
3 tasks done
Labels
bug good first issue Good for first contribution

Comments

@rbubley
Copy link
Contributor

rbubley commented May 19, 2024

Summary

Docs say columns parameter can be iterable, but code requires it to be list.

Did you read the FAQ?

  • I have read the FAQ

Did you search GitHub issues?

  • I have searched the issues

Did you search GitHub Discussions?

  • I have searched the discussions

(Optional) PDF URL

No response

About your environment

Python version:
    3.12.3 (main, Apr  9 2024, 08:09:14) [Clang 15.0.0 (clang-1500.3.9.4)]
Java version:
    openjdk version "21.0.3" 2024-04-16
OpenJDK Runtime Environment Homebrew (build 21.0.3)
OpenJDK 64-Bit Server VM Homebrew (build 21.0.3, mixed mode, sharing)
tabula-py version: 2.9.1
platform: macOS-14.5-arm64-arm-64bit
uname:
    uname_result(system='Darwin', node='Russs-MacBook-Pro-2.local', release='23.5.0', version='Darwin Kernel Version 23.5.0: Wed May  1 20:12:58 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6000', machine='arm64')
linux_distribution: ('Darwin', '23.5.0', '')
mac_ver: ('14.5', ('', '', ''), 'arm64')

What did you do when you faced the problem?

Used the read_pdf function with columns parameter being a tuple.

Problem arises because of this line 238 in util.py:

            if self.columns != sorted(self.columns):

It should presumably say:

            if list(self.columns) != sorted(self.columns):

Code

import tabula

# Read pdf into list of DataFrame
dfs = tabula.read_pdf("test.pdf",
                      stream=True,
                      area = (145.39, 26.05, 584.21, 584.02),
                      columns = (26.05, 142.55, 175.47, 215.61, 252.32, 385.59, 487.44, 583.15),
                      pages='all')

Expected behavior

function executes successfully.

Actual behavior

Relevant part of Traceback

 File "/Users/russ/project/venv/lib/python3.12/site-packages/tabula/util.py", line 239, in build_option_list
    raise ValueError("columns option should be sorted")
ValueError: columns option should be sorted

Related issues

No response

@rbubley rbubley added the triage label May 19, 2024
@chezou chezou added bug good first issue Good for first contribution and removed triage labels May 19, 2024
@chezou
Copy link
Owner

chezou commented May 19, 2024

@rbubley Thanks for reporting. That should be an unexpected error and the option should work.

I'm happy if you can contribute on it!

@rbubley
Copy link
Contributor Author

rbubley commented May 19, 2024

Sure. Two things are possible:

  1. Change the type hint/documentation from iterable to collections.Sequence, and check that it is sorted
  2. Don't bother checking if it's already sorted, just sort it within the function.

Do you have a preference?

@chezou
Copy link
Owner

chezou commented May 19, 2024

Thanks for your suggestion. I prefer Option 1 since hidden conversion could cause an unintended issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug good first issue Good for first contribution
Projects
None yet
Development

No branches or pull requests

2 participants