-
-
Notifications
You must be signed in to change notification settings - Fork 26k
ENH Add support for np.nan values in SplineTransformer #28043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Add support for np.nan values in SplineTransformer #28043
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR looks very good but it needs to be merged with main
(there are conflicts in the changelog).
Also, I think the get_output_feature_names()
method needs to be updated. The tests should be expanded accordingly, maybe to also include a test with .set_output(transform="pandas")
(this is how I found out that there was a problem with the output feature names).
I think we should add support for using those two options together. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a more in depth pass of review. There is indeed a fundamental problem with the current code: the missingness indicators from the training set (when calling .fit
or .fit_transform
) should not be stored as an estimator attribute and reapplied to the test set (when calling .transform
). Instead the missingness pattern from the test set should be extracted.
See more details below:
sklearn/preprocessing/_polynomial.py
Outdated
if self.include_bias: | ||
return XBS | ||
return self._concatenate_indicator(XBS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The missingness indicators computed from the X
passed to .transform
(which can be a test set) should be passed as argument to _concatenate_indicator
instead of reusing the mask extracted from the training set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ogrisel, thanks for reviewing and your help.
I went through your comments and could resolve most of the issues.
I've named the new option handle_missing="constant"
, but that's just an idea. I found that indicator
doesn't fit so well anymore, if we don't add an indicator column to X. Though with constant
as well as with zeros
I feel that it's not quite clear from the naming, where in the process the nans become something else (before or after calculating the splines). Maybe we can find a name, that conveys that info.
There are quite a few things, I am a bit confused about:
Generally, I don't know if we want SplineTransformer to change or keep behaviour if nan values are present.
If we want it to keep behaviour, instead of having this test data for comparing equality:
X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
X = np.array([[1, 1], [2, 2], [3, 3], [4, 4]])
it should maybe rather be
X_nan = np.array([[1, 1], [2, 2], [3, 3], [np.nan, 4], [4, 4]])
X = np.array([[1, 1], [2, 2], [3, 3], [99, 4], [4, 4]])
and in this case, the current implementation is wrong. Maybe you can shed a light on this so that I know how to go on.
I will check the issue with the feature names next.
I was trying to find about the problem with the feature names, that you have mentioned here, @ogrisel, but I cannot recreate it. Maybe it's been resolved when I worked on the other issues? This is what I tried (using the code from the existing feature_name_out test):
Everything behaves as it should, I believe. But also maybe I didn't understand what you exactly ran into. |
The EDIT: I will try to answer your other questions/comments early enough next week. |
About |
Hey @ogrisel, can you give me some feedback? My current understanding is that if we introduce new 0-values in X_transformed (due to nan values in X), then we also expect different stats for the transformer compared to when no nan values are present. This would mean, that we expect (and test for)
|
Hi @ogrisel, I have worked on what you had proposed and the tests now all pass. |
|
||
def __sklearn_tags__(self): | ||
tags = super().__sklearn_tags__() | ||
tags.input_tags.allow_nan = self.handle_missing == "zeros" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way of defining a tag for nan support is per instance, which I think is most user-friendly and is also needed to pass all the common tests. With tags.input_tags.allow_nan = True
(or False
) either one fails, so we cannot do that.
However, doc/sphinxext/allow_nan_estimators.py
that allows to automatically generate a list of estimators is doing it with the default parameter settings (almost as if the tags were defined at class level).
I think I would like to work on a follow-up PR to fix the generation and take cases like this into account. The question would be how large to span that. What do you think, @adrinjalali?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a bit sad indeed. But I think we can live with the fact that this list only reflects the default behavior, and that some estimators can be made to accept nans with specific hyper-parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make sure _construct_instances
creates at least one instance where it supports nan
, and instead of checking only the first instance returned by the generator, check them all here
scikit-learn/doc/sphinxext/allow_nan_estimators.py
Lines 22 to 25 in 675736a
# Here we generate the text only for one instance. This directive | |
# should not be used for meta-estimators where tags depend on the | |
# sub-estimator. | |
est = next(_construct_instances(est_class)) |
But certainly for a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you very much for the final push @StefanieSenger!
cc @lorentzenchr for a second review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good.
Feedback: Review was a bit harder than necessary due to renaming of variables.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
…kit-learn into nan_SplineTransformer
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @lorentzenchr. I have implemented your suggestions and left comments on two I would rather like to leave as they are (the copy() and the backticks). See my comments for reasoning.
# later. | ||
x[mask_inv] = spl.t[self.degree] | ||
outside_range_mask = ~inside_range_mask | ||
x = X[:, feature_idx].copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do need the copy. That was actually pretty hard for me to figure out, because I was not familiar with views on arrays back then and I had naively implemented everything without a copy. I have also checked again: Without a copy, we're changing X
with x[outside_range_mask] = xmin
, and it matters, because earlier we rely on X.
(I think it is in
x = spl.t[spl.k] + (X[:, feature_idx] - spl.t[spl.k]) % (
spl.t[n] - spl.t[spl.k]
above.)
We get a lot of errors too without the copy.
Co-authored-by: Christian Lorentzen <lorentzen.ch@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @lorentzenchr! I have commited your suggestions. Now there is only one question on a re-factoring left, where I find it unclear which extend you wanted me to do.
Reference Issues/PRs
Closes #26793
What does this implement/fix? Explain your changes.
Adds support for np.nan values in SplineTransformer.
handle_missing : {'error', 'zeros'}
to init, whereerror
preserves the previous behaviour andzeros
handles nan values by setting their spline values to all 0s(very outdated, should have put it in a separate comment:)
Yet to solve:I believe in
_get_base_knot_positions
I have to prepare_weighted_percentile
for excluding nan values similarity to hownp.nanpercentile
excludes nan values for the calculation of the base knots. I tried, but it was quite tricky. Edit: Just found thatnp.nanpercentile
will have a sample_weight option soon: PR 24254 in numpyShould an error also be raised in case the SplineTransformer was instantiated with (handle_missing="error"
), then fitted without missing values and the X then contains missing values in transform?