read_csv: parameter usecols has wrong type hint #963

clo-vis · 2024-07-24T13:34:57Z

To Reproduce

from collections.abc import Sequence
from pandas import read_csv
cols1: Sequence[str] = ["a"]
def cols2(x: set[float]) -> bool:  return sum(x) < 1.0
read_csv("file.csv", usecols=cols1) # mypy: "No overload variant of "read_csv" matches argument types "str", "Sequence[str]"
read_csv("file.csv", usecols=cols2) # no error from mypy

Expected behavior:
No error for usecols=cols1, but an error for usecols=cols2

Actual behavior:
cols1 is not accepted, even though Sequence[str] is "list like" (at least I think so; the term is nowhere defined) and its elements are "strings that correspond to column names"
cols2 is accepted, even though it is not a callable that can be "evaluated against the column names, returning names where the callable function evaluates to True"

Please complete the following information:

OS: Microsoft Windows
OS Version 10.0.19045.4651
python version 3.11.9
version of type checker: mypy 1.11.0
version of installed pandas-stubs: 2.2.2.240603

The text was updated successfully, but these errors were encountered:

Dr-Irv · 2024-07-24T13:58:56Z

There are 2 different issues here:

With respect to cols1, when you declare that cols1 is Sequence[str], that means that cols1 could be a list of strings OR a single string. We explicitly disallow a single string as the usecols argument. So you're better off making cols1 a list[str] .
With respect to cols2, we allow a Callable[[HashableT], bool] and sets are hashable. So that is why your code is accepted.

It may be that the solution is to change this line:

pandas-stubs/pandas-stubs/_typing.pyi

Line 529 in 48ca4b0

    
           SequenceNotStr[Hashable] | range | AnyArrayLike | Callable[[HashableT], bool] | None

to use SequenceNotStr[str | int] and Callable[[str | int], bool], but I'd like to hear the opinion of @twoertwein on this.

clo-vis · 2024-07-24T14:25:33Z

>>> print(hash({1.0}))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'set'

sets are not hashable.

clo-vis · 2024-07-24T14:27:48Z

typing.py

class Hashable(Protocol, metaclass=ABCMeta):
    # TODO: This is special, in that a subclass of a hashable class may not be hashable
    #   (for example, list vs. object). It's not obvious how to represent this. This class
    #   is currently mostly useless for static checking.
    @abstractmethod
    def __hash__(self) -> int: ...

clo-vis · 2024-07-24T14:45:34Z

from collections.abc import Sequence
from pandas import read_csv

cols1: Sequence[int] = [ 1 ]
read_csv("file_csv", usecols=cols1)

mypy: error: No overload variant of "read_csv" matches argument types "str", "Sequence[int]"
documentation: If list-like, all elements must either be positional (i.e. integer indices into the document columns) or ...

Dr-Irv · 2024-07-24T15:15:43Z

from collections.abc import Sequence
from pandas import read_csv

cols1: Sequence[int] = [ 1 ]
read_csv("file_csv", usecols=cols1)
mypy: error: No overload variant of "read_csv" matches argument types "str", "Sequence[int]" documentation: If list-like, all elements must either be positional (i.e. integer indices into the document columns) or ...

That seems to be a mypy error. pyright accepts the above code.

Having said that, pyright accepts the example you had with a Callable, so we'll have to see what (if anything) we can do about that.

clo-vis · 2024-07-24T15:20:06Z

In strict mode, pyright says Type of "read_csv" is partially unknown.
So if pyright doesn't know everything about read_csv, perhaps it cannot issue an error message either.

Dr-Irv · 2024-07-24T15:23:02Z

In strict mode, pyright says Type of "read_csv" is partially unknown. So if pyright doesn't know everything about read_csv, perhaps it cannot issue an error message either.

We only test and support pyright in basic mode. Strict is really hard to make work.

If I change the Callable[[HashableT], bool] to Callable[[Hashable], bool], then pyright in basic mode picks up the error. So a PR with that would solve that problem, at least for pyright. But mypy still doesn't flag the error.

Not sure what to do when mypy has a bug.

clo-vis · 2024-07-24T16:01:14Z

If I change the Callable[[HashableT], bool] to Callable[[Hashable], bool], then pyright in basic mode picks up the error. So a PR with that would solve that problem,

It would probably create another problem, as

def cols(x: str) -> bool: return x in ["A", "B", "C"]
read_csv("file_csv", usecols=cols)

should be acceptable (and cols is not a Callable[[Hashable], bool])

Dr-Irv · 2024-07-24T20:59:00Z

should be acceptable (and cols is not a Callable[[Hashable], bool])

Good point. Seems like we need to change it to: Callable[[int], bool] | Callable[[str], bool]

twoertwein · 2024-07-24T21:28:46Z

to use SequenceNotStr[str | int] and Callable[[str | int], bool], but I'd like to hear the opinion of @twoertwein on this.

I think that could be too strict (but might cover typical usage).

It seems to work as expected on pyright playground when inlining the TypeAlias (I think the issue is that we don't set the generic type of the TypeAlias when using it in the read_csv definition)

Dr-Irv · 2024-07-24T21:45:23Z

It seems to work as expected on pyright playground when inlining the TypeAlias (I think the issue is that we don't set the generic type of the TypeAlias when using it in the read_csv definition)

So something is going on that is different with what is in pandas-stubs versus the simple test that you wrote.

twoertwein · 2024-07-24T21:52:32Z

I think it might be sufficient to replace with UsecolArgs with UsecolArgs[HashableT] (pyright in strict mode points this out and says that it therefore assumes "unknown" as the input argument for the Callable part, which explains no error for set).

Dr-Irv · 2024-07-25T03:08:39Z

I think it might be sufficient to replace with UsecolArgs with UsecolArgs[HashableT] (pyright in strict mode points this out and says that it therefore assumes "unknown" as the input argument for the Callable part, which explains no error for set).

I tried that out locally and it works. So a PR that changes UseColArgsType in pandas-stubs\io\parsers\readers.pyi to UseColArgsType[HashableT] would fix the second issue reported here.

byildiz · 2024-08-16T20:39:12Z

I have the similar issue with usecols as the type of list of a string.

import pandas as pd
df = pd.read_csv("test.csv", usecols=["a", "b"]) # this line causes type conflict

Pyright gives the following diagnostic message, but according to the pandas docs usecols can be a list of strings. I guess the type annotation of usecols is not accurate.

Diagnostics:
1. No overloads for "read_csv" match the provided arguments [reportCallIssue]
2. Argument of type "list[str]" cannot be assigned to parameter "usecols" of type "UsecolsArgType[Unknown]" in function "read_csv"
     Type "list[str]" is incompatible with type "UsecolsArgType[Unknown]"
       "list[str]" is incompatible with protocol "SequenceNotStr[Hashable]"
         "index" is an incompatible type
           Type "(value: str, start: SupportsIndex = 0, stop: SupportsIndex = sys.maxsize, /) -> int" is incompatible with type "(value: Any, /, start: int = 0, stop: int = ...) -> int"
             Missing keyword parameter "start"
             Missing keyword parameter "stop"
       "list[str]" is incompatible with "range"
       "list[str]" is incompatible with "ExtensionArray"
     ... [reportArgumentType]

Dr-Irv · 2024-08-16T23:03:24Z

I have the similar issue with usecols as the type of list of a string.
import pandas as pd
df = pd.read_csv("test.csv", usecols=["a", "b"]) # this line causes type conflict
Pyright gives the following diagnostic message, but according to the pandas docs usecols can be a list of strings. I guess the type annotation of usecols is not accurate.

I cannot reproduce this. We explicitly test for lists of strings with usecols, so maybe you are using strict pyright, which might be the cause of your issue.

byildiz · 2024-08-17T01:33:53Z

@Dr-Irv Thank you for your reply.

When I looked at the Pyright configuration docs, it is stated that for reportArgumentType, all three modes (basic, standard, and strict) have the same reporting severity, which is error. Therefore, I don't think Pyright's type-checking mode is the problem.

On the other hand, when I checked the SequenceNotStr definition, I noticed that the version I have is slightly different from the latest version. I also found this change done 8 months ago:

-    def index(self, value: Any, /, start: int = 0, stop: int = ...) -> int: ...
+    def index(self, value: Any, start: int = ..., stop: int = ..., /) -> int: ...

So, the issue seems to be the version I have. I have still the old version. However, what I don't understand is that I am using the latest version (2.2.2) of Pandas, and I still don't have the change mentioned above. I downloaded the wheel file and inspected it, and it indeed contains the old version of the stub file. This is quite surprising.

I have probed some of the latest pandas packages from both pypi and conda and seen that all of them still contain old version of SequenceNotStr which causes the problem with latest version of pyright.

Dr-Irv · 2024-08-17T16:13:50Z

I have probed some of the latest pandas packages from both pypi and conda and seen that all of them still contain old version of SequenceNotStr which causes the problem with latest version of pyright.

You need to install pandas-stubs . That will properly type-check your code. It is a separate package.

debonte mentioned this issue Aug 8, 2024

Panda's read_csv gives an error: Type of "read_csv" is partially unknown microsoft/pylance-release#230

Closed

byildiz mentioned this issue Aug 16, 2024

BUG: read_csv's usecols type hint isn't match with list of strings pandas-dev/pandas#59530

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_csv: parameter usecols has wrong type hint #963

read_csv: parameter usecols has wrong type hint #963

clo-vis commented Jul 24, 2024

Dr-Irv commented Jul 24, 2024

clo-vis commented Jul 24, 2024

clo-vis commented Jul 24, 2024

clo-vis commented Jul 24, 2024

Dr-Irv commented Jul 24, 2024

clo-vis commented Jul 24, 2024

Dr-Irv commented Jul 24, 2024

clo-vis commented Jul 24, 2024

Dr-Irv commented Jul 24, 2024

twoertwein commented Jul 24, 2024

Dr-Irv commented Jul 24, 2024

twoertwein commented Jul 24, 2024

Dr-Irv commented Jul 25, 2024

byildiz commented Aug 16, 2024 •

edited

Loading

Dr-Irv commented Aug 16, 2024

byildiz commented Aug 17, 2024 •

edited

Loading

Dr-Irv commented Aug 17, 2024

read_csv: parameter usecols has wrong type hint #963

read_csv: parameter usecols has wrong type hint #963

Comments

clo-vis commented Jul 24, 2024

Dr-Irv commented Jul 24, 2024

clo-vis commented Jul 24, 2024

clo-vis commented Jul 24, 2024

clo-vis commented Jul 24, 2024

Dr-Irv commented Jul 24, 2024

clo-vis commented Jul 24, 2024

Dr-Irv commented Jul 24, 2024

clo-vis commented Jul 24, 2024

Dr-Irv commented Jul 24, 2024

twoertwein commented Jul 24, 2024

Dr-Irv commented Jul 24, 2024

twoertwein commented Jul 24, 2024

Dr-Irv commented Jul 25, 2024

byildiz commented Aug 16, 2024 • edited Loading

Dr-Irv commented Aug 16, 2024

byildiz commented Aug 17, 2024 • edited Loading

Dr-Irv commented Aug 17, 2024

byildiz commented Aug 16, 2024 •

edited

Loading

byildiz commented Aug 17, 2024 •

edited

Loading