Skip to content

FEA D2 Brier Score #28971

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

Conversation

OmarManzoor
Copy link
Contributor

Reference Issues/PRs

Closes #20943

What does this implement/fix? Explain your changes.

  • Adds the D2 Brier score which is the D2 score for brier_score_loss

Any other comments?

CC: @lorentzenchr

Copy link

github-actions bot commented May 7, 2024

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 4b6936c. Link to the linter CI: here

@OmarManzoor OmarManzoor changed the title D2 Brier Score FEA D2 Brier Score May 7, 2024
@lorentzenchr
Copy link
Member

#22046 seems like a blocker

@OmarManzoor
Copy link
Contributor Author

@lorentzenchr @ogrisel I updated this PR

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. Here is a first pass of feedback.

Note: once this PR is merged, feel free to follow-up with a PR to add array API support.

>>> y_true = [0, 1, 1, 0]
>>> y_pred = [0.15, 0.9, 0.85, 0.25]
>>> d2_brier_score(y_true, y_pred)
0.882...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be interesting to see the D2 values for the Brier score for the exact same 3 toy data cases used in the D2 code snippets for the log loss above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just saw that Christian said the opposite above. For educational reason, I find it interesting to see the three cases (good, bad and very bad models) to see that the value can be negative.

It's very often the case that the readers just read the code snippets in the doc and ignore the text. If they see a negative value, I suspect that will pick their interest and nudge them into reading why this can be negative.

if _num_samples(y_proba) < 2:
msg = "D^2 score is not well-defined with less than two samples."
warnings.warn(msg, UndefinedMetricWarning)
return float("nan")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be helpful to expand the warning message to give the use a code snippet to either turn this warning into an exception or to silence it:

#29048 (comment)

We could also expose an replace_undefined_by argument, but this can be done in a follow-up PR for all D2 scores.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the same warning we are raising for all D^2 metrics and even the R^2 score. We could change it here but wouldn't it be better to just leave this as it is for consistency. Also if we want to update it how about adding an additional statement instead of a code snippet mentioning that y_proba should have at least more than 2 samples for the D^2 score to provide anything meaningful.



def test_d2_brier_score():
sample_weight = [2, 2, 3, 1, 1, 1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could parametrize this test "use_sample_weight", [True, False] and consistently pass the sample_weight argument everywhere, either with None or with [2, 2, 3, 1, 1, 1].

Copy link
Contributor Author

@OmarManzoor OmarManzoor Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a bit difficult because the null model scores or the y_proba (equaling the null model) that are hardcoded, are different with sample weight and without sample weight. If we have two parameter values in the test overall we would need to add a bunch of if and else statements which would get awkward. Should we separate this out into binary and multiclass cases?

# Also confirm that the order of the labels does not affect the d2_score
labels = [2, 0, 1]
new_d2_score = d2_brier_score(y_true=y_true, y_proba=y_proba, labels=labels)
assert new_d2_score == pytest.approx(d2_score)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test function is a bit long. We could split the edge cases where the expected assert d2_score == 0 in a dedicated function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add more D2 scores
3 participants