Skip to content

ENH Make GaussianProcessRegressor.predict faster when return_std and return_cov are false #31431

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jul 1, 2025

Conversation

RafaAyGar
Copy link
Contributor

Reference Issues/PRs

Fixes #31374

What does this implement/fix? Explain your changes.

This PR avoids the unnecessary execution of the solve_triangular() function inside the GaussianProcessRegressor() predict() function when the arguments return_std and return_cov are set to False.
A non-regression test is also implemented to check that y_mean is returned alone (not a tuple) when return_std=False and return_cov=False. This behavior also existed previously and was not covered by the tests.

Any other comments?

N/A.

Copy link

github-actions bot commented May 26, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 88c5011. Link to the linter CI: here

@lesteve
Copy link
Member

lesteve commented Jun 10, 2025

Thanks for the PR!

The fix looks fine.

Could you do a quick benchmark with some toy data to show that this PR actually fixes the performance issue reported in #31374?

@RafaAyGar
Copy link
Contributor Author

Thanks for the PR!

The fix looks fine.

Could you do a quick benchmark with some toy data to show that this PR actually fixes the performance issue reported in #31374?

Sure! Thanks.

@RafaAyGar
Copy link
Contributor Author

RafaAyGar commented Jun 14, 2025

Thanks for the PR!

The fix looks fine.

Could you do a quick benchmark with some toy data to show that this PR actually fixes the performance issue reported in #31374?

I experimented with the diabetes dataset on a x86_64 system with an Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz CPU.

Code:

import numpy as np
import time
from sklearn.datasets import load_diabetes
from sklearn.gaussian_process import GaussianProcessRegressor

diabetes = load_diabetes()

X = diabetes.data
y = diabetes.target

seeds = 30
times = []

for seed in range(seeds):
    gpr = GaussianProcessRegressor(
        random_state=seed
    )
    gpr.fit(X, y)

    start_time = time.time()
    predictions = gpr.predict(X, return_std=False, return_cov=False)
    times.append(time.time() - start_time)

# print(f"Mean time to predict along {seeds} runs in new version: {np.mean(times)} seconds")
print(f"Mean time to predict along {seeds} runs in old version: {np.mean(times)} seconds")

Output is:

Mean time to predict along 30 runs in new version: 0.0054 seconds
Mean time to predict along 30 runs in old version: 0.0114 seconds

In conclusion, avoiding the solve_triangular function when it is not needed makes the predict function approximately 50% faster.

@lesteve
Copy link
Member

lesteve commented Jun 16, 2025

Thanks for the quick benchmark, I put together a slightly different one just to double-check and it seems like indeed solve_triangular takes a significant amount of time in .predict:

import numpy as np
import time
from sklearn.datasets import make_regression
from sklearn.gaussian_process import GaussianProcessRegressor

X, y = make_regression(n_samples=5_000)

gpr = GaussianProcessRegressor(
    random_state=0
)
gpr.fit(X, y)

%timeit gpr.predict(X, return_std=False, return_cov=False)

%prun -s cumulative -l 10 gpr.predict(X, return_std=False, return_cov=False)

Output on main (.predict takes 1.58s total with ~0.66s taken by solve_triangular):

1.58 s ± 12.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
         212 function calls in 1.596 seconds

   Ordered by: cumulative time
   List reduced from 99 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    1.596    1.596 {built-in method builtins.exec}
        1    0.001    0.001    1.596    1.596 <string>:1(<module>)
        1    0.004    0.004    1.595    1.595 _gpr.py:367(predict)
        1    0.036    0.036    0.926    0.926 kernels.py:931(__call__)
        1    0.128    0.128    0.868    0.868 kernels.py:1525(__call__)
        1    0.000    0.000    0.740    0.740 distance.py:2786(cdist)
        1    0.740    0.740    0.740    0.740 {built-in method scipy.spatial._distance_pybind.cdist_sqeuclidean}
        1    0.000    0.000    0.664    0.664 _basic.py:411(solve_triangular)
        1    0.664    0.664    0.664    0.664 _basic.py:503(_solve_triangular)
        1    0.000    0.000    0.022    0.022 kernels.py:1239(__call__)

Output in this PR (.predict take 921ms):

921 ms ± 6.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
         193 function calls in 0.916 seconds

   Ordered by: cumulative time
   List reduced from 93 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.916    0.916 {built-in method builtins.exec}
        1    0.001    0.001    0.916    0.916 <string>:1(<module>)
        1    0.005    0.005    0.915    0.915 _gpr.py:367(predict)
        1    0.036    0.036    0.910    0.910 kernels.py:931(__call__)
        1    0.127    0.127    0.851    0.851 kernels.py:1525(__call__)
        1    0.000    0.000    0.724    0.724 distance.py:2786(cdist)
        1    0.724    0.724    0.724    0.724 {built-in method scipy.spatial._distance_pybind.cdist_sqeuclidean}
        1    0.000    0.000    0.023    0.023 kernels.py:1239(__call__)
        1    0.023    0.023    0.023    0.023 numeric.py:290(full)
        1    0.000    0.000    0.001    0.001 validation.py:2856(validate_data)

@lesteve lesteve changed the title [MRG] Avoid solve_triangular computation in GaussianProcessRegressor predict when return_std and return_cov are false ENH Avoid solve_triangular computation in GaussianProcessRegressor predict when return_std and return_cov are false Jun 16, 2025
@lesteve lesteve changed the title ENH Avoid solve_triangular computation in GaussianProcessRegressor predict when return_std and return_cov are false ENH Make GaussianProcessRegressor.predict faster when return_std and return_cov are false Jun 16, 2025
@lesteve lesteve added the Quick Review For PRs that are quick to review label Jun 16, 2025
@jeremiedbb jeremiedbb added the To backport PR merged in master that need a backport to a release branch defined based on the milestone. label Jul 1, 2025
@jeremiedbb jeremiedbb added this to the 1.7.1 milestone Jul 1, 2025
Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @RafaAyGar

@jeremiedbb jeremiedbb enabled auto-merge (squash) July 1, 2025 11:05
@jeremiedbb jeremiedbb merged commit aa2131f into scikit-learn:main Jul 1, 2025
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:gaussian_process Quick Review For PRs that are quick to review To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suggested fix: GaussianProcessRegressor.predict wastes significant time when both return_std and return_cov are False
3 participants