Skip to content

CI Avoid Windows timeout by switching to OpenBLAS #31641

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 26, 2025

Conversation

glemaitre
Copy link
Member

PR to debug the issue where the Windows CI fails when a node with a single physical core is detected.

Copy link

github-actions bot commented Jun 23, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: cec9b9b. Link to the linter CI: here

@glemaitre
Copy link
Member Author

OK, so I could spot two architectures of CPU:

CPU Model: AMD EPYC 7763 64-Core Processor                
Architecture: 9
Physical Cores: 1
Logical Processors: 2
Max Clock Speed: 2445 MHz
Current Clock Speed: 2445 MHz
L2 Cache Size:  KB
L3 Cache Size: 0 KB

and

CPU Model: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Architecture: 9
Physical Cores: 2
Logical Processors: 2
Max Clock Speed: 2594 MHz
Current Clock Speed: 2594 MHz
L2 Cache Size:  KB
L3 Cache Size: 0 KB

The failure seems therefore related to MKL and the AMD CPU. I'll switch to openblas to see if the tests are passing and we might want to try reporting the issue.

@@ -297,7 +297,7 @@ def remove_from(alist, to_remove):
],
"package_constraints": {
"python": "3.10",
"blas": "[build=mkl]",
"blas": "[build=openblas]",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should rename the mkl to openblas but I wanted to test without to have to change anything else.

@ogrisel
Copy link
Member

ogrisel commented Jun 25, 2025

Could you please update the lock file only for the windows build instead? There is a commandline option in the script to do so. Use --help to find its name.

@lesteve lesteve changed the title DEBUG investigate failure Windows worker with single worker CI Avoid Windows timeout by switching to OpenBLAS Jun 26, 2025
@lesteve
Copy link
Member

lesteve commented Jun 26, 2025

I pushed a commit changing only the Windows lock-file and renaming the lock-file to have openblas rather than mkl.

Let's see what the CI says 🤞.

@lesteve
Copy link
Member

lesteve commented Jun 26, 2025

Windows CI passed and it ran on the problematic AMD EPYC 7763 64-Core Processor build log where we have seen failures and timeout occur consistently.

Copy link
Member

@lesteve lesteve left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a good enough work-around to get back to a somewhat normal CI.

@lesteve lesteve merged commit c92330f into scikit-learn:main Jun 26, 2025
37 checks passed
@lesteve
Copy link
Member

lesteve commented Jun 26, 2025

Merging this to be able to make other PRs go forward.

One possible improvement for a later PR would be to put the CPU info on Windows in the same place as we do lscpu in build_tools/test_script.sh.

In an ideal world, we would also further understand the issue but this seems time-consuming (would first need a reproducer with scipy/numpy only, and maybe eventually a reproducer in C) and if in the end if this is indeed a MKL issue, I am not sure where to report.

Note for completeness: I was able to reproduce by using action-tmate to have an interactive SSH session in a GHA runner (same AMD CPU as in Azure), see https://github.com/mxschmitt/action-tmate?rgh-link-date=2025-06-24T14%3A01%3A52Z.

@glemaitre
Copy link
Member Author

Thanks @lesteve to have finish this one.

@lorentzenchr
Copy link
Member

After some time of silence, the jury of the maintainability award judges this PR worth of the 5th trophy 🏆 (after #25069 (comment)). As the Nobel committee does, the price is split between @glemaitre and @lesteve.

Race:
@glemaitre and @lesteve both 1.5 , @jeremiedbb 1, @thomasjpfan 1.

The jury kindly asks for support about future suggestions of nominees.

@lesteve
Copy link
Member

lesteve commented Jun 27, 2025

Not sure about a maintainability award for this one, maybe more of a firefighting mode award 😅.

This may blow up in our face one day 1 but oh well, for now things are back to normal, let's hope for some time 🤞.

Footnotes

  1. see https://github.com/scikit-learn/scikit-learn/pull/31644#issuecomment-3007641735 for more details where it fails on GHA same AMD CPU but 2 physical cores instead of 1 in Azure ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants