Skip to content

[Yaml] Remove escaping regex #19782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

GuilhemN
Copy link
Contributor

@GuilhemN GuilhemN commented Aug 29, 2016

Q A
Branch? master
Bug fix? no
New feature? no
BC breaks? no
Deprecations? no
Tests pass? yes
Fixed tickets
License MIT
Doc PR

This PR replaces regexs in Escaper by call to strcspn. It is much easier to read imo and performance aren't degraded.
I also updated the data provider of a test because it was in practice testing nothing (it is checking if values are quoted as expected but as they always began with \t it was always quoted anyway).

Edit: quotes are moved to indicators as they are supported inside plain scalars as said by @stefk:

It seems those characters are perfectly valid in unquoted strings:

http://yaml.org/spec/1.2/spec.html#id2788859
http://yaml-online-parser.appspot.com/?yaml=foo%3A+ba%22r%0Abaz%3A+q%27u%27x%27%0A&type=canonical_yaml

@@ -48,7 +52,7 @@ class Escaper
*/
public static function requiresDoubleQuoting($value)
{
return preg_match('/'.self::REGEX_CHARACTER_TO_ESCAPE.'/u', $value);
return strlen($value) !== strcspn($value, implode('', self::$escapees));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would force escaping values containing ", / and \ that weren't escaped before. Does it matter ?

@GuilhemN GuilhemN force-pushed the ESCAPING branch 3 times, most recently from b534d61 to 87bc631 Compare August 30, 2016 19:50
@@ -0,0 +1,15 @@
<?php
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about this file :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, i should stop using -A ^^

'\\x18', '\\x19', '\\x1a', '\\e', '\\x1c', '\\x1d', '\\x1e', '\\x1f',
'\\N', '\\_', '\\L', '\\P');
private static $escapees = array(
'\\', '\\\\', '\\"', '"', '/',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW this fixes a "bug", / was missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be done in 2.7 then?

Copy link
Contributor Author

@GuilhemN GuilhemN Sep 2, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works in 2.7, it should be escaped but that's not mandatory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why doing it then?

Copy link
Contributor Author

@GuilhemN GuilhemN Dec 7, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that's not clear what we should do here. Non-printable characters must be escaped but / is a printable character (the spec says \/ is available for json compatibily), so I guess it's up to us to decide what to do.

For readability / should indeed be better while \/ is more compliant with json. If you prefer / I'll revert this change.

@fabpot
Copy link
Member

fabpot commented Aug 31, 2016

As it's not a bug fix, this should be done in master.

@GuilhemN GuilhemN changed the base branch from 2.7 to master August 31, 2016 18:07
@GuilhemN GuilhemN force-pushed the ESCAPING branch 2 times, most recently from d317288 to b377df8 Compare August 31, 2016 18:14
@GuilhemN
Copy link
Contributor Author

ok, retargeted and rebased.
So no refactoring won't be merged in older branches? That's probably a good thing :)

@fabpot
Copy link
Member

fabpot commented Aug 31, 2016

That's the main idea, refactorings do not fix anything, so they should be done in master (the only exception is when it gives a huge performance improvement and the changes are trivial).

@@ -684,7 +684,7 @@ public static function evaluateBinaryScalar($scalar)

private static function isBinaryString($value)
{
return !preg_match('//u', $value) || preg_match('/[^\x09-\x0d\x20-\xff]/', $value);
return !preg_match('//u', $value) || preg_match('/[^\x00-\x1f\x20-\xff]/', $value);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these caracters are escapable in double quoted strings

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, maybe we should add a comment with some reference to an external source or something like that to help understanding the code.

Copy link
Contributor Author

@GuilhemN GuilhemN Sep 27, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact i'm not even sure we should use !!binary when dumping, all characters are supported in double-quoted scalars anyway.
I'll try to find what the spec says about that.

@nicolas-grekas
Copy link
Member

nicolas-grekas commented Dec 6, 2016

Well, since you changed some char ranges, this looks like a bug fix to me. About the const, removing it is a BC break.
Thus, I think I'd prefer keeping the regexp and focus on the other changes for 2.7.

@stof
Copy link
Member

stof commented Dec 6, 2016

About the const, removing it is a BC break.

@nicolas-grekas the whole class is marked as internal, so not covered by the BC promise

return preg_match('/[ \s \' " \: \{ \} \[ \] , & \* \# \?] | \A[ \- ? | < > = ! % @ ` ]/x', $value);
// First character is an indicator
// @see http://yaml.org/spec/1.2/spec.html#c-indicator
if ($value && 1 === strspn($value[0], '-?&*!|<>\'"=%@`')) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is broken for the empty string (btw, the simpler fix for that is probably to add the empty string in the array of values above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, good catch!

'slash' => array('/', '/'),
'backslash' => array('\\', '\\'),
'next-line' => array("\xC2\x85", '"\\N"'),
'non-breaking-space' => array('�', '�'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be "\_" in YAML ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, the test is wrong.

@nicolas-grekas
Copy link
Member

So, to help moving this one forward, I'm personnaly 👎 on this:

This PR replaces regexs in Escaper by call to strcspn. It is much easier to read imo and performance aren't degraded.

This is subjective, with no clear technical benefit, and will create merge conflicts that are thus not worth it.

@GuilhemN
Copy link
Contributor Author

GuilhemN commented Dec 27, 2016

@nicolas-grekas this PR mostly intents to improve readability and maintainability (do you think it's obvious what preg_match('/[ \s \' " \: \{ \} \[ \] , & \* \# \?] | \A[ \- ? | < > = ! % @ ]/x', $value); does ?).
About merge conflicts I understand your concern but this file is not modified often (last modification of a line updated here was in january 2015) so I don't think it would be a problem.

It also improves a bit scalars dumping when quotes are in the middle of the scalar (e.g. foo'bar was dumped as 'foo''bar' now it is dumped as foo'bar as it is a valid plain scalar).

Anyway, we should at least update the tests which are currently useless.

@fabpot
Copy link
Member

fabpot commented Mar 5, 2017

Closing as I agree with @nicolas-grekas here. @GuilhemN Can you submit a PR for the tests though?

@fabpot fabpot closed this Mar 5, 2017
@GuilhemN GuilhemN deleted the ESCAPING branch March 6, 2017 20:30
@GuilhemN GuilhemN mentioned this pull request Mar 6, 2017
fabpot added a commit that referenced this pull request Mar 6, 2017
This PR was merged into the 2.7 branch.

Discussion
----------

[Yaml] Fix the tests

| Q             | A
| ------------- | ---
| Branch?       | 2.7
| Bug fix?      | no
| New feature?  | no
| BC breaks?    | no
| Deprecations? | no
| Tests pass?   | yes
| Fixed tickets |
| License       | MIT
| Doc PR        |

Fix tests that are currently useless (previously part of #19782).

Commits
-------

5107b94 [Yaml] Fix the tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants