Re: PREG \d vs. [0-9] [message #181944 is a reply to message #181943] |
Wed, 26 June 2013 16:02 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma:
|
Senior Member |
|
|
Christoph Michael Becker wrote:
> Thomas 'PointedEars' Lahn wrote:
>> Tony Mountifield wrote:
>>> That's because you have an unescaped / within your regex, so it sees
>>> /^M?(([0-9]?)[ ]?([0-9])(/ followed by a ? as a regex modifier.
>>
>> Good catch. Also, in POSIX Extended Regular Expressions (ERE) this is
>> written simpler
>>
>> ^M?(([0-9]?) ?([0-9])(…
>>
>> and in Perl-Compatible Regular Expressions (PCRE) it is written simpler
>>
>> ^M?((\d?) ?(\d)(…
>
> Isn't the exact interpretation of \d locale dependent? I was not able
> to find this information on php.net and I am not able to verify this, as
> I do not have locales available, which have decimal digits other than
> 0-9. However, at least when one works with UTF-8 encoded strings and
> uses the u modifier for the regular expression, \d is not the same as
> [0-9]:
>
>>>> $zero = "\xe0\xa5\xa6" // DEVANAGARI DIGIT ZERO
>>>> preg_match('/[0-9]/u', $zero)
> 0
>>>> preg_match('/\d/u', $zero)
> 1
PHP uses Perl-Compatible Regular Expressions (PCRE) here:
<http://php.net/preg_match>
<http://php.net/pcre>
So this can be found in greater detail in the PCRE documentation:
<http://pcre.org/pcre.txt>
“\d” can match more than just “0” to “9” in PCRE, but (unlike in Perl [1])
the behavior is _not_ locale-dependent by default. There is a flag,
PCRE_UCP, to let \d be equivalent to \p{Digit} etc. (UCP stands for “Unicode
Character Properties”), but apparently it is not set at compile-time for the
default PHP distribution:
$ locale
LANG=de_CH.UTF-8
LANGUAGE=
LC_CTYPE="de_CH.UTF-8"
LC_NUMERIC="de_CH.UTF-8"
LC_TIME="de_CH.UTF-8"
LC_COLLATE="de_CH.UTF-8"
LC_MONETARY="de_CH.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="de_CH.UTF-8"
LC_NAME="de_CH.UTF-8"
LC_ADDRESS="de_CH.UTF-8"
LC_TELEPHONE="de_CH.UTF-8"
LC_MEASUREMENT="de_CH.UTF-8"
LC_IDENTIFICATION="de_CH.UTF-8"
LC_ALL=
$ php -r 'echo setlocale(LC_ALL, "de_CH.UTF-8") . "\n";
echo preg_match("/\d/", "१");'
de_CH.UTF-8
0
The “u” expression flag in PHP sets the PCRE_UTF8 run-time flag (as
documented), but apparently the PCRE_UCP run-time flag as well. Hence your
observation:
$ php -r 'echo preg_match("/\d/u", "१");'
1
$ php -v
PHP 5.4.15-1 (cli) (built: May 12 2013 12:17:45)
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2013 Zend Technologies
with XCache v3.0.1, Copyright (c) 2005-2013, by mOo
with Xdebug v2.2.1, Copyright (c) 2002-2012, by Derick Rethans
with XCache Optimizer v3.0.1, Copyright (c) 2005-2013, by mOo
with XCache Cacher v3.0.1, Copyright (c) 2005-2013, by mOo
with XCache Coverager v3.0.1, Copyright (c) 2005-2013, by mOo
In Perl there is the “a” flag to let Perl regular expressions match in ASCII
mode regardless of the locale, but it is not needed with PCRE (when PCRE_UCP
is not set at compile-time).
[1] <http://perldoc.perl.org/perlre.html>
PointedEars
--
Sometimes, what you learn is wrong. If those wrong ideas are close to the
root of the knowledge tree you build on a particular subject, pruning the
bad branches can sometimes cause the whole tree to collapse.
-- Mike Duffy in cljs, <news:Xns9FB6521286DB8invalidcom(at)94(dot)75(dot)214(dot)39>
|
|
|