-
Notifications
You must be signed in to change notification settings - Fork 4k
[python-package] [dask] [ci] fix predict() type hints, enforce mypy in CI #7140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| predict_fn, | ||
| chunks=chunks, | ||
| meta=pred_row, | ||
| dtype=dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dtype argument was brought over from dask-lightgbm (#3515).
The code looks like it's intended to allow setting the dtype of the output, but that's not how it works.
Passing dtype to map_blocks() does not change the dtype of the output. Consider the following:
import dask.array as da
import numpy as np
x = da.arange(6, chunks=3)
x.map_blocks(lambda x: x * 2).compute().dtype
# dtype('int64')
x.map_blocks(lambda x: x * 2, dtype=np.float64).compute().dtype
# dtype('int64')Instead, it's just there to avoid Dask trying to infer the output dtype of whatever the function passed to map_blocks() returns.
See https://docs.dask.org/en/stable/_modules/dask/array/core.html#map_blocks
dtype
np.dtype, optional
The dtype of the output array. It is recommended to provide this. If not provided, will be inferred by applying the function to a small set of fake data.
It should be safe to allow that type inference, because we're providing the meta input which is the result of calling predict() on a single row of input.
LightGBM/python-package/lightgbm/dask.py
Line 1033 in 80ab6d3
| pred_row = predict_fn(data_row) # type: ignore[misc] |
LightGBM/python-package/lightgbm/dask.py
Lines 1040 to 1043 in 80ab6d3
| return data.map_blocks( | |
| predict_fn, | |
| chunks=chunks, | |
| meta=pred_row, |
That's nice because it also avoids needing to encode the logic of which output dtypes match to which mix of input types and raw_score / pred_contrib / pred_leaf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
|
||
|
|
||
| def run_minimal_test(X_type, y_type, g_type, task, rng): | ||
| def _run_minimal_test(*, X_type, y_type, g_type, task, rng): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small cosmetic change...marking this internal and forcing the use of keyword arguments makes the calls a little stricter and clearer, in my opinion.
| nrow = preds.shape[0] | ||
| elif isinstance(data, scipy.sparse.csr_matrix): | ||
| preds, nrow = self.__pred_for_csr( | ||
| # TODO: remove 'type: ignore[assignment]' when https://github.com/microsoft/LightGBM/pull/6348 is resolved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After fixing the Booster.__pred_for_csr() and similar return type hints, # type: ignore comments like these are necessary fix these mypy warnings:
basic.py:1190: error: Incompatible types in assignment (expression has type "Any | list[Any]", variable has type "ndarray[tuple[Any, ...], dtype[float64]]") [assignment]
basic.py:1197: error: Incompatible types in assignment (expression has type "Any | list[Any]", variable has type "ndarray[tuple[Any, ...], dtype[float64]]") [assignment]
basic.py:1234: error: Incompatible types in assignment (expression has type "Any | list[Any]", variable has type "ndarray[tuple[Any, ...], dtype[float64]]") [assignment]
Those are necessary because mypy assigns a type to preds the first time it's assigned.
This is the type of complexity that can go away once #6348 is completed (I'm planning to return to that soon).
|
|
||
| def __get_num_preds( | ||
| self, | ||
| *, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Continuing the work I've been doing (e.g. #7111) to enforce more use of keyword-only arguments in internal functions, to make the data flow clearer.
Touching this because some calls to __get_num_preds() were implicated in mypy warnings.
| self.__get_num_preds( | ||
| start_iteration=start_iteration, | ||
| num_iteration=num_iteration, | ||
| nrow=int(i), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast to int() and the other one like it fix these mypy warnings:
basic.py:1331: error: Argument 3 to "__get_num_preds" of "_InnerPredictor" has incompatible type "signedinteger[_32Bit | _64Bit]"; expected "int" [arg-type]
basic.py:1546: error: Argument 3 to "__get_num_preds" of "_InnerPredictor" has incompatible type "signedinteger[_32Bit | _64Bit]"; expected "int" [arg-type]
| data_type: int, | ||
| is_csr: bool, | ||
| ) -> Union[List[scipy.sparse.csc_matrix], List[scipy.sparse.csr_matrix]]: | ||
| ) -> _LGBM_PredictSparseReturnType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type was wrong.
The output isn't always a list:
LightGBM/python-package/lightgbm/basic.py
Lines 1415 to 1416 in 80ab6d3
| if len(cs_output_matrices) == 1: | |
| return cs_output_matrices[0] |
That change results in all the other similar changes to _LGBM_PredictSparseReturnType in this file.
| f"predict(X) for lightgbm.dask estimators should always return an array, not '{type(result)}', when X is a pandas Dataframe. " | ||
| "If you're seeing this message, it's a bug in lightgbm. Please report it at https://github.com/microsoft/LightGBM/issues." | ||
| ) | ||
| assert hasattr(result, "shape"), error_msg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resolves this:
dask.py:894: error: Item "list[Any]" of "ndarray[tuple[Any, ...], dtype[Any]] | Any | list[Any]" has no attribute "shape" [union-attr]
We know that predict() can only return a list if the input is a scipy sparse matrix, but mypy doesn't. It sees that a list is technically a possible output type, and correctly warnings that a list doesn't have a .shape attribute.
This type of workaround can be removed when #6348 is completed.
| # use a small sub-sample (to keep the tests fast) | ||
| if output.startswith("dataframe"): | ||
| dX_sample = dX.sample(frac=0.001) | ||
| else: | ||
| dX_sample = dX[:1,] | ||
| dX_sample.persist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my local testing (macOS, dask==2024.11.2), this cut the total time for all new test cases here from 65s to around 10s.
| pred_proba: bool = False, | ||
| pred_leaf: bool = False, | ||
| pred_contrib: bool = False, | ||
| dtype: _PredictionDtype = np.float32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started looking into all this dtype stuff in the Dask estimators because of this mypy warning:
dask.py:1293: error: Argument "dtype" to "_predict" has incompatible type "dtype[Any]"; expected "type[floating[_32Bit]] | type[float64] | type[signedinteger[_32Bit]] | type[signedinteger[_64Bit]]" [arg-type]
| # * regression: float64 | ||
| # | ||
| if task.endswith("classification"): | ||
| assert preds.dtype == np.int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On AppVeyor, these are all int32:
# default predictions:
#
# * classification: int64
# * ranking: float64
# * regression: float64
#
if task.endswith("classification"):
> assert preds.dtype == np.int64
E AssertionError: assert dtype('int32') == <class 'numpy.int64'>
E + where dtype('int32') = array([2, 0, 1, ..., 1, 1, 1]).dtype
E + and <class 'numpy.int64'> = np.int64
tests\python_package_test\test_sklearn.py:2012: AssertionError
But not on the other Windows Python jobs here: https://github.com/microsoft/LightGBM/actions/runs/21328342343/job/61389234344?pr=7140
I suspect that's not about Windows, but using older versions (Python or numpy)
GitHub Actions
- Python 3.13
numpy==2.4.1pandas==3.0.0
Appveyor
- Python 3.9
numpy==1.22.4pandas==1.3.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be specific to Windows or to something else about the AppVeyor builds. I'm not able to reproduce it using the exact same library versions on my Mac
I can reproduce this on my Mac 😕
conda create \
-y \
-n lgb-dev-py3.9 \
--file .ci/conda-envs/ci-core-py39.txt \
python=3.9
source activate lgb-dev-py3.9
cmake -B build -S .
cmake --build build --target _lightgbm -j4
sh build-python.sh --precompile install
pytest \ 'tests/python_package_test/test_sklearn.py::test_classification_and_regression_minimally_work_with_all_accepted_data_types'
# ==== 108 passed, 136 warnings in 5.33s ====There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I have most of an answer, though unsure yet why this is happening on AppVeyor and not GitHub Actions.
Looks to me like the output of _InnerPredictor.__inner_predict_np2d() will always be np.float64 if a pre-allocated array is not passed.
LightGBM/python-package/lightgbm/basic.py
Lines 1290 to 1291 in 80ab6d3
| if preds is None: | |
| preds = np.empty(n_preds, dtype=np.float64) |
That's only called in _InnerPredictor.__pred_for_np2d(), which does not pre-allocate and so does not change the output dtype
LightGBM/python-package/lightgbm/basic.py
Lines 1349 to 1355 in 80ab6d3
| return self.__inner_predict_np2d( | |
| mat=mat, | |
| start_iteration=start_iteration, | |
| num_iteration=num_iteration, | |
| predict_type=predict_type, | |
| preds=None, | |
| ) |
That's only called in _InnerPredictor.predict():
LightGBM/python-package/lightgbm/basic.py
Lines 1199 to 1205 in 80ab6d3
| elif isinstance(data, np.ndarray): | |
| preds, nrow = self.__pred_for_np2d( | |
| mat=data, | |
| start_iteration=start_iteration, | |
| num_iteration=num_iteration, | |
| predict_type=predict_type, | |
| ) |
And the only other code that runs after it is this, which doesn't affect it for normal (default) predictions for classification, so doesn't change the type.
LightGBM/python-package/lightgbm/basic.py
Lines 1236 to 1244 in 80ab6d3
| if pred_leaf: | |
| preds = preds.astype(np.int32) | |
| is_sparse = isinstance(preds, (list, scipy.sparse.spmatrix)) | |
| if not is_sparse and (preds.size != nrow or pred_leaf or pred_contrib): | |
| if preds.size % nrow == 0: | |
| preds = preds.reshape(nrow, -1) | |
| else: | |
| raise ValueError(f"Length of predict result ({preds.size}) cannot be divide nrow ({nrow})") | |
| return preds |
That's only called in Booster.predict(), which ALSO does not change the type:
LightGBM/python-package/lightgbm/basic.py
Lines 4756 to 4765 in 80ab6d3
| return predictor.predict( | |
| data=data, | |
| start_iteration=start_iteration, | |
| num_iteration=num_iteration, | |
| raw_score=raw_score, | |
| pred_leaf=pred_leaf, | |
| pred_contrib=pred_contrib, | |
| data_has_header=data_has_header, | |
| validate_features=validate_features, | |
| ) |
So the type for classification predictions must be getting changed to int32 / int64 when scikit-learn code transforms the output of Booster.predict() into classes. Here's how that goes:
LGBMClassifier.predict_proba() returns the predictions in probability form:
LightGBM/python-package/lightgbm/sklearn.py
Line 1704 in 80ab6d3
| result = super().predict( |
That's passed to ._le.inverse_transform():
LightGBM/python-package/lightgbm/sklearn.py
Lines 1684 to 1688 in 80ab6d3
| if callable(self._objective) or raw_score or pred_leaf or pred_contrib: | |
| return result | |
| else: | |
| class_index = np.argmax(result, axis=1) | |
| return self._le.inverse_transform(class_index) |
That comes from scikit-learn. That attribute's inherited from scikit-learn's classes, and that's an sklearn.preprocessing.LabelEncoder.
In scikit-learn 1.0, looks like LabelEncoder.inverse_transformer() called a method LabelEncoder._inverse_binarize_thresholding() that did roughly this:
y_inv = _inverse_binarize_thresholding(
Y, self.y_type_, self.classes_, threshold
)(scikit-learn / scikit-learn - sklearn/preprocessing/_label.py)
That used dtype=int (as of scikit-learn/scikit-learn#17687)
y = np.array(y > threshold, dtype=int)(https://github.com/scikit-learn/scikit-learn/blame/1.0.X/sklearn/preprocessing/_label.py#L650)
I'm guessing that in numpy==1.22.4 (the version getting downloaded in the AppVeyor jobs), int on some platforms mapped to np.int32 and on others to np.int64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a good hint here: https://numpy.org/devdocs/numpy_2_0_migration_guide.html#windows-default-integer
The default integer used by NumPy is now 64bit on all 64bit systems (and 32bit on 32bit system). For historic reasons related to Python 2 it was previously equivalent to the C
longtype. The default integer is now equivalent tonp.intp.
Closes #3756
Closes #3867
mypywarningsmypyin CIscipysparse matricesdtypeargument in internal Dask prediction functionspredict()methods on Dask andscikit-learnestimatorsFixes these
mypywarnings: