-
-
Notifications
You must be signed in to change notification settings - Fork 26k
FEA D2 Brier Score #28971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
FEA D2 Brier Score #28971
Conversation
#22046 seems like a blocker |
@lorentzenchr @ogrisel I updated this PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. Here is a first pass of feedback.
Note: once this PR is merged, feel free to follow-up with a PR to add array API support.
>>> y_true = [0, 1, 1, 0] | ||
>>> y_pred = [0.15, 0.9, 0.85, 0.25] | ||
>>> d2_brier_score(y_true, y_pred) | ||
0.882... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be interesting to see the D2 values for the Brier score for the exact same 3 toy data cases used in the D2 code snippets for the log loss above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just saw that Christian said the opposite above. For educational reason, I find it interesting to see the three cases (good, bad and very bad models) to see that the value can be negative.
It's very often the case that the readers just read the code snippets in the doc and ignore the text. If they see a negative value, I suspect that will pick their interest and nudge them into reading why this can be negative.
if _num_samples(y_proba) < 2: | ||
msg = "D^2 score is not well-defined with less than two samples." | ||
warnings.warn(msg, UndefinedMetricWarning) | ||
return float("nan") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it could be helpful to expand the warning message to give the use a code snippet to either turn this warning into an exception or to silence it:
We could also expose an replace_undefined_by
argument, but this can be done in a follow-up PR for all D2 scores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the same warning we are raising for all D^2
metrics and even the R^2
score. We could change it here but wouldn't it be better to just leave this as it is for consistency. Also if we want to update it how about adding an additional statement instead of a code snippet mentioning that y_proba should have at least more than 2 samples for the D^2 score to provide anything meaningful.
|
||
|
||
def test_d2_brier_score(): | ||
sample_weight = [2, 2, 3, 1, 1, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could parametrize this test "use_sample_weight", [True, False]
and consistently pass the sample_weight argument everywhere, either with None
or with [2, 2, 3, 1, 1, 1]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a bit difficult because the null model scores or the y_proba (equaling the null model) that are hardcoded, are different with sample weight and without sample weight. If we have two parameter values in the test overall we would need to add a bunch of if and else statements which would get awkward. Should we separate this out into binary and multiclass cases?
# Also confirm that the order of the labels does not affect the d2_score | ||
labels = [2, 0, 1] | ||
new_d2_score = d2_brier_score(y_true=y_true, y_proba=y_proba, labels=labels) | ||
assert new_d2_score == pytest.approx(d2_score) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test function is a bit long. We could split the edge cases where the expected assert d2_score == 0
in a dedicated function.
Reference Issues/PRs
Closes #20943
What does this implement/fix? Explain your changes.
Any other comments?
CC: @lorentzenchr