DOC Clarify how mixed array input types handled in array api #31452

lucyleeow · 2025-05-29T10:54:47Z

Reference Issues/PRs

Update documentation now that we have reach consensus on these issues:

closes #31286 (I think no more work is required for this as currently all converted metrics are outputting array in the same namespace/device as input)

Towards #28668
Towards #31274

I have not said close for these 2 because we would need to implement conversion of mixed inputs.

What does this implement/fix? Explain your changes.

Clarify in docs:

everything follows X for estimators
everything follows y_pred for scoring functions
when a scoring function outputs several values, an array of the same namespace and device is output

I realise that this update to the docs is not yet true, but I think this okay as I don't think this will get into 1.7, thus will be in dev docs only, and I think we will have implemented it by next release.
Otherwise, happy to leave this PR and implement the mixed input type conversion first.

Any other comments?

I tried to give reasons as to why we made these decisions. Hopefully it is clear and concise (and correct...), but happy to change/fix.

cc @ogrisel @betatim maybe @StefanieSenger

github-actions · 2025-05-29T10:55:44Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit: 22774b5. Link to the linter CI: here}

StefanieSenger

Thanks for your work @lucyleeow!

I like the clarity of how this new addition is written and I am sure it will help many users. I have left a few minor comments, and I have some more general comments.

What about starting this section with a fast to grasp takeaway for our user's mental model, that we have taken months to uncover: To clearly state that other array inputs follow X and y_pred in terms of array library and device?
Another feedback would be to re-think the heading of this paragraph. It now not only talks about return types, but also on the internal processing of inputs that follow X and y_pred. Users interested in that might not expect this information under this heading.
And, there is something about a sub-part that I do not understand (and which might also happen to users): In lines 194-212 you showcase an example, where Xs namespace is changed during the process of going through a pipeline and there are several things unclear to me:

When a pipeline starts out on user input with X and y both being on CPU, then I would not expect scikit-learn to take any measures on automatically processing these on GPU for intermediate steps, since we don't even know if users have GPU...
At first sight it seems that the y follows X principle here is broken and upon thinking about it, I could then see that this part is to demonstrate how y being on the different namespace on the beginning of a new step of a pipeline demonstrates exactly how it follows X without that the user needs to take care of it. But right now, there is only a very subtle hint on this in this paragraph and this idea is not clearly expressed.
When you write "Note that scikit-learn pipelines do not allow transformation of y (to avoid
:ref:leakage <data_leakage>).": does that even include changes in which namespace y is stored, is that the reason you bring this up here?
I wonder, why X and y both need to be on CPU for TargetEncoder? Is there a technical reason or a more goal-oriented reason in terms of performance?

Oh my, that has gotten to become a lot of feedback. 😬 (sorry)

doc/modules/array_api.rst

StefanieSenger · 2025-05-29T11:45:20Z

doc/modules/array_api.rst

+The `predict` and `transform` method subsequently expect
 inputs from the same array library and device as the data passed to the `fit`
 method.


Maybe mention that as of now, scikit-learn does not interfere with how array libraries handle arrays from a different namespace or device passed to predict or transform. Something like:

"If arrays from a different library or on a different device are passed, behavior depends on the array library: it may raise an error or silently convert them."

Sorry I am confused here, why would an array library be doing things within our predict and transform methods? 🤔

When arrays from different namespaces are passed into a function, then the namespace that the function comes from determines the handling. For instance in xp.add(array1, array2 ) the namespace of xp could be torch and array1 is a torch array, but array2 is not.Then torch will handle this or raise an error.

Okay you've raised some good points 🤔 I'm going to raise these back in the issue

doc/modules/array_api.rst

lucyleeow · 2025-05-30T06:21:29Z

These are very good suggestions. I have tried to address them all.

then I would not expect scikit-learn to take any measures on automatically processing these on GPU for intermediate steps, since we don't even know if users have GPU...

Sorry I have made it more explicit now. The function transformer step is: FunctionTransformer(partial(torch.asarray, device="cuda:0")), - which explicitly moves X to GPU.
For reference I was mostly copying Oliviers example here: #31274 (comment)

I wonder, why X and y both need to be on CPU for TargetEncoder? Is there a technical reason or a more goal-oriented reason in terms of performance?

In my mind, the categorical (string) data in X was being target encoded to numbers in TargetEncoder so X has to be on CPU (as due to it containing string data), and y "needs" to be on the same device as X for the estimator to work (TargetEncoder also uses y) - though if we are doing everything follows X, y technically doesn't need to be on CPU. I will try to make things clearer. I was worried I was going into too much detail and making things too long 🤔

When you write "Note that scikit-learn pipelines do not allow transformation of y (to avoid
:ref:leakage <data_leakage>).": does that even include changes in which namespace y is stored, is that the reason you bring this up here?

Yes, pipelines can't touch y.

betatim · 2025-06-02T13:07:17Z

doc/modules/array_api.rst

+Estimators and scoring functions are able to accept input arrays
+from different array libraries and/or devices. When mixed input arrays are
+passed, scikit-learn silently converted to make them all consistent.


I've just realised that this means that I can/should/could call some_estimator.fit(X_torch_on_gpu0, y_numpy, sample_weights=weights_cupy_gpu1) or some other more obscure combination. Feels like mayhem, but I guess that is "fine"? Maybe somewhere in a guide we can mention that this is technically allowed but that maybe people shouldn't do this?

Suggestion (copied and slightly adjusted from the numpy docs):

.. warning:: While the mixing of array types from different array libraries may be convenient, it is not recommended. It might have unexpected behavior in corner cases. Users should prefer explicitly passing matching types whenever possible.

I had exactly the same thought. The other consideration is that, currently implementation wise:

supported arrays -> numpy (_convert_to_numpy does a reasonable job)
numpy -> supported arrays (xp.asarray is reasonable)

supported array -> different supported array (you need to convert to numpy and then convert to the desired array type)

I am wondering if we should issue a warning when doing the latter (especially as it requires 2 conversions)?

Also I have been meaning to check dlpack support in torch and cupy (I heard it is supported in cupy v13.x), which could help our decision. Though even when it is supported, what do we want to do about min versions...?

doc/modules/array_api.rst

betatim · 2025-06-02T13:11:41Z

doc/modules/array_api.rst

+For scoring functions the rule is **"everything follows `y_pred`"** - mixed array
+inputs are converted so they are all match the array library and device of `y_pred`.
+
+When a function or method has been called with Array API compatible inputs, the
 convention is to return array values of the same array container type and


How about using "array container type" as a way to make it clear that we are talking about numpy, cupy, pytorch, etc arrays and not "an array of float32s"?

The sentences above are already quite long, so I'm not sure if we should do this.

I find the expression "array container type" (which is already in the sentence) quite vague. In other parts of the docs we have talked about "array library" and "array namespace", so maybe like that (?):

Suggested change

convention is to return array values of the same array container type and

convention is to return a type from the same array library and

Thanks all, I used:

Suggested change

convention is to return array values of the same array container type and

convention is to return arrays from the same array library and on the same

device as the input data.

We rarely return more than one array but I went for plural. Suggestions welcome!

doc/modules/array_api.rst

betatim · 2025-06-02T13:19:52Z

doc/modules/array_api.rst

+This allows estimators to accept mixed input types, enabling `X` to be moved
+to a different device within a pipeline, without explicitly moving `y`.
+Note that scikit-learn pipelines do not allow transformation of `y` (to avoid
+:ref:`leakage <data_leakage>`).


Suggested change

This allows estimators to accept mixed input types, enabling `X` to be moved

to a different device within a pipeline, without explicitly moving `y`.

Note that scikit-learn pipelines do not allow transformation of `y` (to avoid

:ref:`leakage <data_leakage>`).

This behaviour enables pipelines to switch from processing on

the CPU to processing on the GPU at a specific point in the pipeline.

This is still a bit clunky :-/ I think it is enough to mention the use-case and then show the example below, which contains a longer explanation/reminder about the fact that you can't move y.

What about:
"This behaviour enables pipelines to switch from processing on
the CPU to processing on the GPU within the pipeline."
?

Maybe even:

This behaviour enables switching from processing on the CPU to processing on the GPU at any point within a pipeline.

Nice, changed to this!

doc/modules/array_api.rst

ogrisel

Some partial feedback.

doc/modules/array_api.rst

StefanieSenger

These are cool improvements, thank you @lucyleeow!
Upon reading through, I came across some minor things, that could be improved.

StefanieSenger · 2025-06-03T07:11:32Z

doc/modules/array_api.rst

+For scoring functions the rule is **"everything follows `y_pred`"** - mixed array
+inputs are converted so they are all match the array library and device of `y_pred`.
+
+When a function or method has been called with Array API compatible inputs, the
 convention is to return array values of the same array container type and


I find the expression "array container type" (which is already in the sentence) quite vague. In other parts of the docs we have talked about "array library" and "array namespace", so maybe like that (?):

Suggested change

convention is to return array values of the same array container type and

convention is to return a type from the same array library and

doc/modules/array_api.rst

lucyleeow · 2025-06-16T02:11:04Z

Thanks for all the reviews! Changes made!

StefanieSenger

(Checked through the whole change again: it looks very good!)

That's a clear enhancement to the docs that also adds reasoning and usage examples. It's going to be very helpful to the users, thank you!

betatim

Looks good to me. This is a good improvement over what we have so far

ogrisel

Thanks for the PR. I think the new sections would deserve to be moved one level up in the header hierarchy.

Besides the following comments, LGTM.

doc/modules/array_api.rst

ogrisel · 2025-06-18T14:29:26Z

doc/modules/array_api.rst

+This allows estimators to accept mixed input types, enabling `X` to be moved
+to a different device within a pipeline, without explicitly moving `y`.
+Note that scikit-learn pipelines do not allow transformation of `y` (to avoid
+:ref:`leakage <data_leakage>`).


Maybe even:

This behaviour enables switching from processing on the CPU to processing on the GPU at any point within a pipeline.

lucyleeow · 2025-06-23T05:38:13Z

Thanks all for your input and reviews. Changes made.

I am wondering if we should wait until we have actually implemented mixed array handling before merging this PR?
However, I also don't want to lose momentum and make people re-review/keep this PR updated with main.

I am okay with either option.

ogrisel · 2025-06-26T13:35:22Z

I think we should merge to avoid loosing momentum. Maybe we could open a meta issue to track implementation of what is now recommended in the doc (maybe with a new common test or an extension of the existing tests).

lucyleeow added 2 commits May 29, 2025 15:54

doc mixed input

24d8f1a

wording

e2fc189

lucyleeow added the Array API label May 29, 2025

github-actions bot added the Documentation label May 29, 2025

lucyleeow mentioned this pull request May 29, 2025

Add array API support to median_absolute_error #31406

Merged

StefanieSenger reviewed May 29, 2025

View reviewed changes

review

5833e9a

lucyleeow commented May 30, 2025

View reviewed changes

doc/modules/array_api.rst Show resolved Hide resolved

lucyleeow mentioned this pull request Jun 2, 2025

Automatically move y (and sample_weight) to the same device and namespace as X #28668

Open