Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Type Hints for Python Package #7742

Merged
merged 26 commits into from
May 17, 2022
Merged

Add Type Hints for Python Package #7742

merged 26 commits into from
May 17, 2022

Conversation

bridgream
Copy link
Contributor

@bridgream bridgream commented Mar 22, 2022

Following #7707, I plan to resolve most mypy errors across the python package if not all.

We should probably populate compat.py with stubs for even more packages, such as cupy.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, please let me know if you need any help

python-package/xgboost/_typing.py Outdated Show resolved Hide resolved
missing,
nthread,
data: DataType,
missing: float,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float is actually not accurate, we can also take np.float32 and related types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about using FloatCompatible = Union[float, np.float32, np.float64] (do we also need to add np.float16?)

@bridgream
Copy link
Contributor Author

Sorry I wasn't able to follow up recently... Can I ask what cat_codes is? For example, https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L1151

DataType = Any

# xgboost accepts some other possible types in practice due to historical reason, which is
# lesser tested. For now we encourage users to pass a simple list of string.
FeatureNames = Optional[List[str]]
FeatureInfo = Optional[Sequence[str]]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we change to Sequence[str] here, and add Optional in the signatures when needed? Some functions require feature_names to be not None, for example, https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/core.py#L2562-L2563

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mypy suggests using Sequence for function input and more specific type like List and Tuple as function return types. https://mypy.readthedocs.io/en/stable/generics.html?highlight=variance#variance-of-generic-types

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we're talking about the same thing... I meant to use FeatureInfo = Sequence[str] here, and when referring to it, use Optional[FeatureInfo]
I've also drafted a commit to better illustrate my idea: bridgream@5e3d4af

This isn't something major, though...

@trivialfis
Copy link
Member

@bridgream

Sorry I wasn't able to follow up recently... Can I ask what cat_codes is? For example

No worries, mypy needs to mature and we are not rushing into it. cat code is the encoded categorical feature from cudf. Let's just mark it as a list for now.

@bridgream bridgream marked this pull request as ready for review April 21, 2022 11:46
@trivialfis
Copy link
Member

Hi, @bridgream could you please resolve the conflicts?

@bridgream
Copy link
Contributor Author

Hi @trivialfis , could you please restart the Travis CI job? It looks the job failed at random.

@trivialfis
Copy link
Member

Please ignore it for now. It's SSH policy is changed.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we are a lot closer to fully integrated typing. ;-)

@@ -1603,7 +1609,7 @@ def attr(self, key: str) -> Optional[str]:
return py_str(ret.value)
return None

def attributes(self) -> Dict[str, str]:
def attributes(self) -> Dict[str, Optional[str]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dict[str, str] should be correct. Have you found a case where only the key is available?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the case is right above this function at line 1591:
def attr(self, key: str) -> Optional[str]:
attr(key) is called in attributes() at line 1626:
return {n: self.attr(n) for n in attr_names}

@@ -132,7 +140,7 @@ def get_config():
set_config: Set global XGBoost configuration
get_config: Get current values of the global configuration
""")
def config_context(**new_config):
def config_context(**new_config: Any) -> Iterator[None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a more general generator or an iterator?

Copy link
Contributor Author

@bridgream bridgream May 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, it seems that Generator is more proper than Iterator, because config_context is implemented as a Generator here. However, as config_context is essentially a context manager, it's not wrong to annotate the return type as an Iterator. In fact, Iterator is more general than Generator, as Generator is a subclass of Iterator:
assert issubclass(Generator, Iterator)

For mypy, however, the choice remains controversial as tracked by this open issue: python/typeshed#2772

On a side note, mypy says
The return type of a generator function should be "Generator" or one of its supertypes
when I use a wrong type hint for config_context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, Iterator is more general than Generator,

Thank you for sharing. I need to remind myself about coroutines. In that case Iterator should be fine.

python-package/xgboost/_typing.py Outdated Show resolved Hide resolved
@@ -1347,7 +1353,7 @@ class Booster:

def __init__(
self,
params: Optional[Dict] = None,
params: Optional[Union[Dict, List]] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you have defined an alias in the _typing module.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be fixed together with the previous comment, by introducing BoosterParam.

python-package/xgboost/core.py Outdated Show resolved Hide resolved
python-package/xgboost/training.py Outdated Show resolved Hide resolved
@bridgream
Copy link
Contributor Author

Rebased master. There are 15 mypy errors remaining but they all look a bit complex to tackle.

@trivialfis
Copy link
Member

@bridgream Hi, do you think it's ready to be merged?

@bridgream
Copy link
Contributor Author

Hi @trivialfis , thanks for the commits! Yes, I believe we're ready.

@trivialfis trivialfis merged commit 806c92c into dmlc:master May 17, 2022
@trivialfis
Copy link
Member

Great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants