Skip to content

Add more tests for Unicode case-insensitivity in regexes. #5198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 25, 2025

Conversation

sjrd
Copy link
Member

@sjrd sjrd commented Jun 16, 2025

This was woefully under-tested.

assertMatches(ranges, "\u0180") // neither mapped to nor from, but in the specified range
// 0540-0550
assertMatches(ranges, "\u0547") // in range
assertMatches(ranges, "\u0577") // mapped from 0577
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong code point in comment?

Also: When you write "mapped from X" do you mean the code point in the test is folded to from that value?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yes, the comment should say "0547".

Also: When you write "mapped from X" do you mean the code point in the test is folded to from that value?

Yes, that's right. It might be clearer to write "0547 folds to 0577" in the comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that would be much clearer.

// FB00
assertMatches(ranges, "\uFB00") // ff LATIN SMALL LIGATURE FF
// 0175-0182 (contains 017F which folds to 's')
if (!executingInJVM) { // looks like a JVM bug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you able to find a reference for this? (or is this why I'm not requested as a reviewer on this PR yet and I'm prematurely reviewing? :P).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I haven't been able to find a reference. The other way around does match (using r-t in the pattern, and matching "\u017F", though, so it is most likely a bug. I should add that reverse thing as a test case to demonstrate that, actually.

I should probably file a bug to OpenJDK as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug filed; pending review.

@sjrd
Copy link
Member Author

sjrd commented Jun 21, 2025

Also FYI this is in the context of scala-wasm#83.

@sjrd sjrd force-pushed the regex-more-unicode-case-insensitive-tests branch from eae9211 to 1cf0160 Compare June 24, 2025 08:22
@sjrd sjrd requested a review from gzm0 June 24, 2025 08:22
@sjrd
Copy link
Member Author

sjrd commented Jun 24, 2025

Can be reviewed again. I suggest we wait for the bug to be reviewed and added to the bug database before merging, so that we can directly include a link.

sjrd added 2 commits June 25, 2025 07:59
`Formatter` is an entire beast of its own. It is a poor fit for
the debugging output of another area of the test suite.
@sjrd sjrd force-pushed the regex-more-unicode-case-insensitive-tests branch from 1cf0160 to 5696a1a Compare June 25, 2025 05:59
@sjrd
Copy link
Member Author

sjrd commented Jun 25, 2025

Bug report accepted upstream. I inserted the link.

@sjrd sjrd enabled auto-merge June 25, 2025 06:46
@sjrd sjrd merged commit 37df9c2 into scala-js:main Jun 25, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants