[python-package] [dask] [ci] fix predict() type hints, enforce mypy in CI #7140

jameslamb · 2026-01-25T06:02:00Z

Closes #3756
Closes #3867

resolves all remaining mypy warnings
enforces mypy in CI
fixes return type hints for private methods related to predicting on scipy sparse matrices
removes unnecessary dtype argument in internal Dask prediction functions
adds unit tests covering the expected dtypes for predict() methods on Dask and scikit-learn estimators

Fixes these mypy warnings:

basic.py:1331: error: Argument 3 to "__get_num_preds" of "_InnerPredictor" has incompatible type "signedinteger[_32Bit | _64Bit]"; expected "int"  [arg-type]

basic.py:1536: error: Incompatible return value type (got "tuple[list[Any], int]", expected "tuple[ndarray[tuple[Any, ...], dtype[Any]], int]")  [return-value]

basic.py:1546: error: Argument 3 to "__get_num_preds" of "_InnerPredictor" has incompatible type "signedinteger[_32Bit | _64Bit]"; expected "int"  [arg-type]

basic.py:1645: error: Incompatible return value type (got "tuple[list[Any], int]", expected "tuple[ndarray[tuple[Any, ...], dtype[Any]], int]")  [return-value]

dask.py:894: error: Item "list[Any]" of "ndarray[tuple[Any, ...], dtype[Any]] | Any | list[Any]" has no attribute "shape"  [union-attr]

dask.py:1293: error: Argument "dtype" to "_predict" has incompatible type "dtype[Any]"; expected "type[floating[_32Bit]] | type[float64] | type[signedinteger[_32Bit]] | type[signedinteger[_64Bit]]"  [arg-type]

Found 6 errors in 2 files (checked 9 source files)

jameslamb · 2026-01-25T06:36:45Z

python-package/lightgbm/dask.py

            predict_fn,
            chunks=chunks,
            meta=pred_row,
-            dtype=dtype,


This dtype argument was brought over from dask-lightgbm (#3515).

The code looks like it's intended to allow setting the dtype of the output, but that's not how it works.

Passing dtype to map_blocks() does not change the dtype of the output. Consider the following:

import dask.array as da import numpy as np x = da.arange(6, chunks=3) x.map_blocks(lambda x: x * 2).compute().dtype # dtype('int64') x.map_blocks(lambda x: x * 2, dtype=np.float64).compute().dtype # dtype('int64')

Instead, it's just there to avoid Dask trying to infer the output dtype of whatever the function passed to map_blocks() returns.

See https://docs.dask.org/en/stable/_modules/dask/array/core.html#map_blocks

dtype
np.dtype, optional
The dtype of the output array. It is recommended to provide this. If not provided, will be inferred by applying the function to a small set of fake data.

It should be safe to allow that type inference, because we're providing the meta input which is the result of calling predict() on a single row of input.

LightGBM/python-package/lightgbm/dask.py

Line 1033 in 80ab6d3

pred_row = predict_fn(data_row) # type: ignore[misc]

LightGBM/python-package/lightgbm/dask.py

Lines 1040 to 1043 in 80ab6d3

return data.map_blocks(

predict_fn,

chunks=chunks,

meta=pred_row,

That's nice because it also avoids needing to encode the logic of which output dtypes match to which mix of input types and raw_score / pred_contrib / pred_leaf.

If you have time, @jmoralez @ffineis I'd appreciate if one of you could double-check my claims in this comment.

jameslamb · 2026-01-25T06:38:17Z

tests/python_package_test/test_sklearn.py



-def run_minimal_test(X_type, y_type, g_type, task, rng):
+def _run_minimal_test(*, X_type, y_type, g_type, task, rng):


Just a small cosmetic change...marking this internal and forcing the use of keyword arguments makes the calls a little stricter and clearer, in my opinion.

jameslamb · 2026-01-25T06:40:35Z

python-package/lightgbm/basic.py

                nrow = preds.shape[0]
        elif isinstance(data, scipy.sparse.csr_matrix):
-            preds, nrow = self.__pred_for_csr(
+            # TODO: remove 'type: ignore[assignment]' when https://github.com/microsoft/LightGBM/pull/6348 is resolved.


After fixing the Booster.__pred_for_csr() and similar return type hints, # type: ignore comments like these are necessary fix these mypy warnings:

basic.py:1190: error: Incompatible types in assignment (expression has type "Any | list[Any]", variable has type "ndarray[tuple[Any, ...], dtype[float64]]") [assignment] basic.py:1197: error: Incompatible types in assignment (expression has type "Any | list[Any]", variable has type "ndarray[tuple[Any, ...], dtype[float64]]") [assignment] basic.py:1234: error: Incompatible types in assignment (expression has type "Any | list[Any]", variable has type "ndarray[tuple[Any, ...], dtype[float64]]") [assignment]

Those are necessary because mypy assigns a type to preds the first time it's assigned.

This is the type of complexity that can go away once #6348 is completed (I'm planning to return to that soon).

jameslamb · 2026-01-25T06:41:54Z

python-package/lightgbm/basic.py


    def __get_num_preds(
        self,
+        *,


Continuing the work I've been doing (e.g. #7111) to enforce more use of keyword-only arguments in internal functions, to make the data flow clearer.

Touching this because some calls to __get_num_preds() were implicated in mypy warnings.

jameslamb · 2026-01-25T06:43:31Z

python-package/lightgbm/basic.py

+                self.__get_num_preds(
+                    start_iteration=start_iteration,
+                    num_iteration=num_iteration,
+                    nrow=int(i),


This cast to int() and the other one like it fix these mypy warnings:

basic.py:1331: error: Argument 3 to "__get_num_preds" of "_InnerPredictor" has incompatible type "signedinteger[_32Bit | _64Bit]"; expected "int" [arg-type] basic.py:1546: error: Argument 3 to "__get_num_preds" of "_InnerPredictor" has incompatible type "signedinteger[_32Bit | _64Bit]"; expected "int" [arg-type]

jameslamb · 2026-01-25T06:44:32Z

python-package/lightgbm/basic.py

        data_type: int,
        is_csr: bool,
-    ) -> Union[List[scipy.sparse.csc_matrix], List[scipy.sparse.csr_matrix]]:
+    ) -> _LGBM_PredictSparseReturnType:


This type was wrong.

The output isn't always a list:

LightGBM/python-package/lightgbm/basic.py

Lines 1415 to 1416 in 80ab6d3

if len(cs_output_matrices) == 1:

return cs_output_matrices[0]

That change results in all the other similar changes to _LGBM_PredictSparseReturnType in this file.

jameslamb · 2026-01-25T06:48:58Z

python-package/lightgbm/dask.py

+            f"predict(X) for lightgbm.dask estimators should always return an array, not '{type(result)}', when X is a pandas Dataframe. "
+            "If you're seeing this message, it's a bug in lightgbm. Please report it at https://github.com/microsoft/LightGBM/issues."
+        )
+        assert hasattr(result, "shape"), error_msg


Resolves this:

dask.py:894: error: Item "list[Any]" of "ndarray[tuple[Any, ...], dtype[Any]] | Any | list[Any]" has no attribute "shape" [union-attr]

We know that predict() can only return a list if the input is a scipy sparse matrix, but mypy doesn't. It sees that a list is technically a possible output type, and correctly warnings that a list doesn't have a .shape attribute.

This type of workaround can be removed when #6348 is completed.

jameslamb · 2026-01-25T06:50:15Z

tests/python_package_test/test_dask.py

+        # use a small sub-sample (to keep the tests fast)
+        if output.startswith("dataframe"):
+            dX_sample = dX.sample(frac=0.001)
+        else:
+            dX_sample = dX[:1,]
+            dX_sample.persist()


In my local testing (macOS, dask==2024.11.2), this cut the total time for all new test cases here from 65s to around 10s.

jameslamb · 2026-01-25T06:54:35Z

python-package/lightgbm/dask.py

    pred_proba: bool = False,
    pred_leaf: bool = False,
    pred_contrib: bool = False,
-    dtype: _PredictionDtype = np.float32,


Started looking into all this dtype stuff in the Dask estimators because of this mypy warning:

dask.py:1293: error: Argument "dtype" to "_predict" has incompatible type "dtype[Any]"; expected "type[floating[_32Bit]] | type[float64] | type[signedinteger[_32Bit]] | type[signedinteger[_64Bit]]" [arg-type]

jameslamb · 2026-01-25T07:12:26Z

tests/python_package_test/test_sklearn.py

+    #  * regression: float64
+    #
+    if task.endswith("classification"):
+        assert preds.dtype == np.int64


On AppVeyor, these are all int32:

# default predictions: # # * classification: int64 # * ranking: float64 # * regression: float64 # if task.endswith("classification"): > assert preds.dtype == np.int64 E AssertionError: assert dtype('int32') == <class 'numpy.int64'> E + where dtype('int32') = array([2, 0, 1, ..., 1, 1, 1]).dtype E + and <class 'numpy.int64'> = np.int64 tests\python_package_test\test_sklearn.py:2012: AssertionError

(build link)

But not on the other Windows Python jobs here: https://github.com/microsoft/LightGBM/actions/runs/21328342343/job/61389234344?pr=7140

I suspect that's not about Windows, but using older versions (Python or numpy)

GitHub Actions

Python 3.13

numpy==2.4.1

pandas==3.0.0

Appveyor

Python 3.9

numpy==1.22.4

pandas==1.3.5

This might be specific to Windows or to something else about the AppVeyor builds. I'm not able to reproduce it using the exact same library versions on my Mac

I can reproduce this on my Mac 😕

conda create \ -y \ -n lgb-dev-py3.9 \ --file .ci/conda-envs/ci-core-py39.txt \ python=3.9 source activate lgb-dev-py3.9 cmake -B build -S . cmake --build build --target _lightgbm -j4 sh build-python.sh --precompile install pytest \ 'tests/python_package_test/test_sklearn.py::test_classification_and_regression_minimally_work_with_all_accepted_data_types' # ==== 108 passed, 136 warnings in 5.33s ====

Ok I have most of an answer, though unsure yet why this is happening on AppVeyor and not GitHub Actions.

Looks to me like the output of _InnerPredictor.__inner_predict_np2d() will always be np.float64 if a pre-allocated array is not passed.

LightGBM/python-package/lightgbm/basic.py

Lines 1290 to 1291 in 80ab6d3

if preds is None:

preds = np.empty(n_preds, dtype=np.float64)

That's only called in _InnerPredictor.__pred_for_np2d(), which does not pre-allocate and so does not change the output dtype

LightGBM/python-package/lightgbm/basic.py

Lines 1349 to 1355 in 80ab6d3

return self.__inner_predict_np2d(

mat=mat,

start_iteration=start_iteration,

num_iteration=num_iteration,

predict_type=predict_type,

preds=None,

)

That's only called in _InnerPredictor.predict():

LightGBM/python-package/lightgbm/basic.py

Lines 1199 to 1205 in 80ab6d3

elif isinstance(data, np.ndarray):

preds, nrow = self.__pred_for_np2d(

mat=data,

start_iteration=start_iteration,

num_iteration=num_iteration,

predict_type=predict_type,

)

And the only other code that runs after it is this, which doesn't affect it for normal (default) predictions for classification, so doesn't change the type.

LightGBM/python-package/lightgbm/basic.py

Lines 1236 to 1244 in 80ab6d3

if pred_leaf:

preds = preds.astype(np.int32)

is_sparse = isinstance(preds, (list, scipy.sparse.spmatrix))

if not is_sparse and (preds.size != nrow or pred_leaf or pred_contrib):

if preds.size % nrow == 0:

preds = preds.reshape(nrow, -1)

else:

raise ValueError(f"Length of predict result ({preds.size}) cannot be divide nrow ({nrow})")

return preds

That's only called in Booster.predict(), which ALSO does not change the type:

LightGBM/python-package/lightgbm/basic.py

Lines 4756 to 4765 in 80ab6d3

return predictor.predict(

data=data,

start_iteration=start_iteration,

num_iteration=num_iteration,

raw_score=raw_score,

pred_leaf=pred_leaf,

pred_contrib=pred_contrib,

data_has_header=data_has_header,

validate_features=validate_features,

)

So the type for classification predictions must be getting changed to int32 / int64 when scikit-learn code transforms the output of Booster.predict() into classes. Here's how that goes:

LGBMClassifier.predict_proba() returns the predictions in probability form:

LightGBM/python-package/lightgbm/sklearn.py

Line 1704 in 80ab6d3

result = super().predict(

That's passed to ._le.inverse_transform():

LightGBM/python-package/lightgbm/sklearn.py

Lines 1684 to 1688 in 80ab6d3

if callable(self._objective) or raw_score or pred_leaf or pred_contrib:

return result

else:

class_index = np.argmax(result, axis=1)

return self._le.inverse_transform(class_index)

That comes from scikit-learn. That attribute's inherited from scikit-learn's classes, and that's an sklearn.preprocessing.LabelEncoder.

In scikit-learn 1.0, looks like LabelEncoder.inverse_transformer() called a method LabelEncoder._inverse_binarize_thresholding() that did roughly this:

y_inv = _inverse_binarize_thresholding( Y, self.y_type_, self.classes_, threshold )

(scikit-learn / scikit-learn - sklearn/preprocessing/_label.py)

That used dtype=int (as of scikit-learn/scikit-learn#17687)

y = np.array(y > threshold, dtype=int)

(https://github.com/scikit-learn/scikit-learn/blame/1.0.X/sklearn/preprocessing/_label.py#L650)

I'm guessing that in numpy==1.22.4 (the version getting downloaded in the AppVeyor jobs), int on some platforms mapped to np.int32 and on others to np.int64.

Looks like a good hint here: https://numpy.org/devdocs/numpy_2_0_migration_guide.html#windows-default-integer

The default integer used by NumPy is now 64bit on all 64bit systems (and 32bit on 32bit system). For historic reasons related to Python 2 it was previously equivalent to the C long type. The default integer is now equivalent to np.intp.

[python-package] [ci] fix predict() type hints, enforce mypy in CI

f21e4b6

jameslamb added in progress fix labels Jan 25, 2026

fix tests

0965ed5

jameslamb commented Jan 25, 2026

View reviewed changes

jameslamb added the dask label Jan 25, 2026

jameslamb changed the title ~~WIP: [python-package] [ci] fix predict() type hints, enforce mypy in CI~~ Jan 25, 2026

jameslamb commented Jan 25, 2026

View reviewed changes

jameslamb added 5 commits January 25, 2026 20:59

get more information from CI

8e129cd

commented out too much

75a1bbd

fix tests

fe7eb93

restore all CI

0fe8032

revert more debugging

791da85

jameslamb mentioned this pull request Jan 26, 2026

[python-package] type hints in python package #3756

Open

12 tasks

jameslamb added 2 commits January 26, 2026 09:01

restore other AppVeyor job

a6fd99b

fix test comments

8c761bb

jameslamb changed the title ~~WIP: [python-package] [dask] [ci] fix predict() type hints, enforce mypy in CI~~ Jan 26, 2026

jameslamb added awaiting review and removed in progress labels Jan 26, 2026

jameslamb marked this pull request as ready for review January 26, 2026 15:05

jameslamb requested review from guolinke and shiyu1994 as code owners January 26, 2026 15:05

jameslamb requested review from StrikerRUS, borchero and jmoralez as code owners January 26, 2026 15:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python-package] [dask] [ci] fix predict() type hints, enforce mypy in CI #7140

[python-package] [dask] [ci] fix predict() type hints, enforce mypy in CI #7140

jameslamb commented Jan 25, 2026 •

edited

Loading

jameslamb Jan 25, 2026 •

edited

Loading

jameslamb Jan 25, 2026

jameslamb Jan 25, 2026

jameslamb Jan 25, 2026

jameslamb Jan 25, 2026 •

edited

Loading

jameslamb Jan 25, 2026

jameslamb Jan 25, 2026 •

edited

Loading

jameslamb Jan 25, 2026

jameslamb Jan 25, 2026

jameslamb Jan 25, 2026

jameslamb Jan 25, 2026 •

edited

Loading

jameslamb Jan 26, 2026

jameslamb Jan 26, 2026

jameslamb Jan 26, 2026

Labels

2 participants

	return data.map_blocks(
	predict_fn,
	chunks=chunks,
	meta=pred_row,



		def run_minimal_test(X_type, y_type, g_type, task, rng):
		def _run_minimal_test(*, X_type, y_type, g_type, task, rng):

	if len(cs_output_matrices) == 1:
	return cs_output_matrices[0]

	if preds is None:
	preds = np.empty(n_preds, dtype=np.float64)

	return self.__inner_predict_np2d(
	mat=mat,
	start_iteration=start_iteration,
	num_iteration=num_iteration,
	predict_type=predict_type,
	preds=None,
	)

	elif isinstance(data, np.ndarray):
	preds, nrow = self.__pred_for_np2d(
	mat=data,
	start_iteration=start_iteration,
	num_iteration=num_iteration,
	predict_type=predict_type,
	)

	if pred_leaf:
	preds = preds.astype(np.int32)
	is_sparse = isinstance(preds, (list, scipy.sparse.spmatrix))
	if not is_sparse and (preds.size != nrow or pred_leaf or pred_contrib):
	if preds.size % nrow == 0:
	preds = preds.reshape(nrow, -1)
	else:
	raise ValueError(f"Length of predict result ({preds.size}) cannot be divide nrow ({nrow})")
	return preds

	return predictor.predict(
	data=data,
	start_iteration=start_iteration,
	num_iteration=num_iteration,
	raw_score=raw_score,
	pred_leaf=pred_leaf,
	pred_contrib=pred_contrib,
	data_has_header=data_has_header,
	validate_features=validate_features,
	)

	if callable(self._objective) or raw_score or pred_leaf or pred_contrib:
	return result
	else:
	class_index = np.argmax(result, axis=1)
	return self._le.inverse_transform(class_index)

[python-package] [dask] [ci] fix predict() type hints, enforce mypy in CI #7140

Are you sure you want to change the base?

[python-package] [dask] [ci] fix predict() type hints, enforce mypy in CI #7140

Conversation

jameslamb commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

jameslamb Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jameslamb Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Labels

2 participants

jameslamb commented Jan 25, 2026 •

edited

Loading

jameslamb Jan 25, 2026 •

edited

Loading

jameslamb Jan 25, 2026 •

edited

Loading

jameslamb Jan 25, 2026 •

edited

Loading

jameslamb Jan 25, 2026 •

edited

Loading