Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv: parameter usecols has wrong type hint #963

Open
clo-vis opened this issue Jul 24, 2024 · 17 comments
Open

read_csv: parameter usecols has wrong type hint #963

clo-vis opened this issue Jul 24, 2024 · 17 comments

Comments

@clo-vis
Copy link

clo-vis commented Jul 24, 2024

To Reproduce

from collections.abc import Sequence
from pandas import read_csv
cols1: Sequence[str] = ["a"]
def cols2(x: set[float]) -> bool:  return sum(x) < 1.0
read_csv("file.csv", usecols=cols1) # mypy: "No overload variant of "read_csv" matches argument types "str", "Sequence[str]"
read_csv("file.csv", usecols=cols2) # no error from mypy

Expected behavior:
No error for usecols=cols1, but an error for usecols=cols2

Actual behavior:
cols1 is not accepted, even though Sequence[str] is "list like" (at least I think so; the term is nowhere defined) and its elements are "strings that correspond to column names"
cols2 is accepted, even though it is not a callable that can be "evaluated against the column names, returning names where the callable function evaluates to True"

Please complete the following information:

  • OS: Microsoft Windows
  • OS Version 10.0.19045.4651
  • python version 3.11.9
  • version of type checker: mypy 1.11.0
  • version of installed pandas-stubs: 2.2.2.240603
@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 24, 2024

There are 2 different issues here:

  1. With respect to cols1, when you declare that cols1 is Sequence[str], that means that cols1 could be a list of strings OR a single string. We explicitly disallow a single string as the usecols argument. So you're better off making cols1 a list[str] .
  2. With respect to cols2, we allow a Callable[[HashableT], bool] and sets are hashable. So that is why your code is accepted.

It may be that the solution is to change this line:

SequenceNotStr[Hashable] | range | AnyArrayLike | Callable[[HashableT], bool] | None

to use SequenceNotStr[str | int] and Callable[[str | int], bool], but I'd like to hear the opinion of @twoertwein on this.

@clo-vis
Copy link
Author

clo-vis commented Jul 24, 2024

>>> print(hash({1.0}))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'set'

sets are not hashable.

@clo-vis
Copy link
Author

clo-vis commented Jul 24, 2024

typing.py

class Hashable(Protocol, metaclass=ABCMeta):
    # TODO: This is special, in that a subclass of a hashable class may not be hashable
    #   (for example, list vs. object). It's not obvious how to represent this. This class
    #   is currently mostly useless for static checking.
    @abstractmethod
    def __hash__(self) -> int: ...

@clo-vis
Copy link
Author

clo-vis commented Jul 24, 2024

from collections.abc import Sequence
from pandas import read_csv

cols1: Sequence[int] = [ 1 ]
read_csv("file_csv", usecols=cols1)

mypy: error: No overload variant of "read_csv" matches argument types "str", "Sequence[int]"
documentation: If list-like, all elements must either be positional (i.e. integer indices into the document columns) or ...

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 24, 2024

from collections.abc import Sequence
from pandas import read_csv

cols1: Sequence[int] = [ 1 ]
read_csv("file_csv", usecols=cols1)

mypy: error: No overload variant of "read_csv" matches argument types "str", "Sequence[int]" documentation: If list-like, all elements must either be positional (i.e. integer indices into the document columns) or ...

That seems to be a mypy error. pyright accepts the above code.

Having said that, pyright accepts the example you had with a Callable, so we'll have to see what (if anything) we can do about that.

@clo-vis
Copy link
Author

clo-vis commented Jul 24, 2024

In strict mode, pyright says Type of "read_csv" is partially unknown.
So if pyright doesn't know everything about read_csv, perhaps it cannot issue an error message either.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 24, 2024

In strict mode, pyright says Type of "read_csv" is partially unknown. So if pyright doesn't know everything about read_csv, perhaps it cannot issue an error message either.

We only test and support pyright in basic mode. Strict is really hard to make work.

If I change the Callable[[HashableT], bool] to Callable[[Hashable], bool], then pyright in basic mode picks up the error. So a PR with that would solve that problem, at least for pyright. But mypy still doesn't flag the error.

Not sure what to do when mypy has a bug.

@clo-vis
Copy link
Author

clo-vis commented Jul 24, 2024

If I change the Callable[[HashableT], bool] to Callable[[Hashable], bool], then pyright in basic mode picks up the error. So a PR with that would solve that problem,

It would probably create another problem, as

def cols(x: str) -> bool: return x in ["A", "B", "C"]
read_csv("file_csv", usecols=cols)

should be acceptable (and cols is not a Callable[[Hashable], bool])

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 24, 2024

should be acceptable (and cols is not a Callable[[Hashable], bool])

Good point. Seems like we need to change it to: Callable[[int], bool] | Callable[[str], bool]

@twoertwein
Copy link
Member

to use SequenceNotStr[str | int] and Callable[[str | int], bool], but I'd like to hear the opinion of @twoertwein on this.

I think that could be too strict (but might cover typical usage).

It seems to work as expected on pyright playground when inlining the TypeAlias (I think the issue is that we don't set the generic type of the TypeAlias when using it in the read_csv definition)

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 24, 2024

It seems to work as expected on pyright playground when inlining the TypeAlias (I think the issue is that we don't set the generic type of the TypeAlias when using it in the read_csv definition)

So something is going on that is different with what is in pandas-stubs versus the simple test that you wrote.

@twoertwein
Copy link
Member

I think it might be sufficient to replace with UsecolArgs with UsecolArgs[HashableT] (pyright in strict mode points this out and says that it therefore assumes "unknown" as the input argument for the Callable part, which explains no error for set).

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Jul 25, 2024

I think it might be sufficient to replace with UsecolArgs with UsecolArgs[HashableT] (pyright in strict mode points this out and says that it therefore assumes "unknown" as the input argument for the Callable part, which explains no error for set).

I tried that out locally and it works. So a PR that changes UseColArgsType in pandas-stubs\io\parsers\readers.pyi to UseColArgsType[HashableT] would fix the second issue reported here.

@byildiz
Copy link

byildiz commented Aug 16, 2024

I have the similar issue with usecols as the type of list of a string.

import pandas as pd
df = pd.read_csv("test.csv", usecols=["a", "b"]) # this line causes type conflict

Pyright gives the following diagnostic message, but according to the pandas docs usecols can be a list of strings. I guess the type annotation of usecols is not accurate.

Diagnostics:
1. No overloads for "read_csv" match the provided arguments [reportCallIssue]
2. Argument of type "list[str]" cannot be assigned to parameter "usecols" of type "UsecolsArgType[Unknown]" in function "read_csv"
     Type "list[str]" is incompatible with type "UsecolsArgType[Unknown]"
       "list[str]" is incompatible with protocol "SequenceNotStr[Hashable]"
         "index" is an incompatible type
           Type "(value: str, start: SupportsIndex = 0, stop: SupportsIndex = sys.maxsize, /) -> int" is incompatible with type "(value: Any, /, start: int = 0, stop: int = ...) -> int"
             Missing keyword parameter "start"
             Missing keyword parameter "stop"
       "list[str]" is incompatible with "range"
       "list[str]" is incompatible with "ExtensionArray"
     ... [reportArgumentType]

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 16, 2024

I have the similar issue with usecols as the type of list of a string.

import pandas as pd
df = pd.read_csv("test.csv", usecols=["a", "b"]) # this line causes type conflict

Pyright gives the following diagnostic message, but according to the pandas docs usecols can be a list of strings. I guess the type annotation of usecols is not accurate.

I cannot reproduce this. We explicitly test for lists of strings with usecols, so maybe you are using strict pyright, which might be the cause of your issue.

@byildiz
Copy link

byildiz commented Aug 17, 2024

@Dr-Irv Thank you for your reply.

When I looked at the Pyright configuration docs, it is stated that for reportArgumentType, all three modes (basic, standard, and strict) have the same reporting severity, which is error. Therefore, I don't think Pyright's type-checking mode is the problem.

On the other hand, when I checked the SequenceNotStr definition, I noticed that the version I have is slightly different from the latest version. I also found this change done 8 months ago:

-    def index(self, value: Any, /, start: int = 0, stop: int = ...) -> int: ...
+    def index(self, value: Any, start: int = ..., stop: int = ..., /) -> int: ...

So, the issue seems to be the version I have. I have still the old version. However, what I don't understand is that I am using the latest version (2.2.2) of Pandas, and I still don't have the change mentioned above. I downloaded the wheel file and inspected it, and it indeed contains the old version of the stub file. This is quite surprising.

I have probed some of the latest pandas packages from both pypi and conda and seen that all of them still contain old version of SequenceNotStr which causes the problem with latest version of pyright.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Aug 17, 2024

I have probed some of the latest pandas packages from both pypi and conda and seen that all of them still contain old version of SequenceNotStr which causes the problem with latest version of pyright.

You need to install pandas-stubs . That will properly type-check your code. It is a separate package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants