FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » how to change old ereg?
Show: Today's Messages :: Polls :: Message Navigator
Return to the default flat view Create a new topic Submit Reply
Re: PREG \d vs. [0-9] [message #181944 is a reply to message #181943] Wed, 26 June 2013 16:02 Go to previous messageGo to previous message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma:
Senior Member
Christoph Michael Becker wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Tony Mountifield wrote:
>>> That's because you have an unescaped / within your regex, so it sees
>>> /^M?(([0-9]?)[ ]?([0-9])(/ followed by a ? as a regex modifier.
>>
>> Good catch. Also, in POSIX Extended Regular Expressions (ERE) this is
>> written simpler
>>
>> ^M?(([0-9]?) ?([0-9])(…
>>
>> and in Perl-Compatible Regular Expressions (PCRE) it is written simpler
>>
>> ^M?((\d?) ?(\d)(…
>
> Isn't the exact interpretation of \d locale dependent? I was not able
> to find this information on php.net and I am not able to verify this, as
> I do not have locales available, which have decimal digits other than
> 0-9. However, at least when one works with UTF-8 encoded strings and
> uses the u modifier for the regular expression, \d is not the same as
> [0-9]:
>
>>>> $zero = "\xe0\xa5\xa6" // DEVANAGARI DIGIT ZERO
>>>> preg_match('/[0-9]/u', $zero)
> 0
>>>> preg_match('/\d/u', $zero)
> 1

PHP uses Perl-Compatible Regular Expressions (PCRE) here:

<http://php.net/preg_match>
<http://php.net/pcre>

So this can be found in greater detail in the PCRE documentation:

<http://pcre.org/pcre.txt>

“\d” can match more than just “0” to “9” in PCRE, but (unlike in Perl [1])
the behavior is _not_ locale-dependent by default. There is a flag,
PCRE_UCP, to let \d be equivalent to \p{Digit} etc. (UCP stands for “Unicode
Character Properties”), but apparently it is not set at compile-time for the
default PHP distribution:

$ locale
LANG=de_CH.UTF-8
LANGUAGE=
LC_CTYPE="de_CH.UTF-8"
LC_NUMERIC="de_CH.UTF-8"
LC_TIME="de_CH.UTF-8"
LC_COLLATE="de_CH.UTF-8"
LC_MONETARY="de_CH.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="de_CH.UTF-8"
LC_NAME="de_CH.UTF-8"
LC_ADDRESS="de_CH.UTF-8"
LC_TELEPHONE="de_CH.UTF-8"
LC_MEASUREMENT="de_CH.UTF-8"
LC_IDENTIFICATION="de_CH.UTF-8"
LC_ALL=

$ php -r 'echo setlocale(LC_ALL, "de_CH.UTF-8") . "\n";
echo preg_match("/\d/", "१");'
de_CH.UTF-8
0

The “u” expression flag in PHP sets the PCRE_UTF8 run-time flag (as
documented), but apparently the PCRE_UCP run-time flag as well. Hence your
observation:

$ php -r 'echo preg_match("/\d/u", "१");'
1

$ php -v
PHP 5.4.15-1 (cli) (built: May 12 2013 12:17:45)
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2013 Zend Technologies
with XCache v3.0.1, Copyright (c) 2005-2013, by mOo
with Xdebug v2.2.1, Copyright (c) 2002-2012, by Derick Rethans
with XCache Optimizer v3.0.1, Copyright (c) 2005-2013, by mOo
with XCache Cacher v3.0.1, Copyright (c) 2005-2013, by mOo
with XCache Coverager v3.0.1, Copyright (c) 2005-2013, by mOo

In Perl there is the “a” flag to let Perl regular expressions match in ASCII
mode regardless of the locale, but it is not needed with PCRE (when PCRE_UCP
is not set at compile-time).

[1] <http://perldoc.perl.org/perlre.html>


PointedEars
--
Sometimes, what you learn is wrong. If those wrong ideas are close to the
root of the knowledge tree you build on a particular subject, pruning the
bad branches can sometimes cause the whole tree to collapse.
-- Mike Duffy in cljs, <news:Xns9FB6521286DB8invalidcom(at)94(dot)75(dot)214(dot)39>
[Message index]
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: FORMS, validating mail was sent
Next Topic: $referrer = $_SERVER['HTTP_REFERER'] echo
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Sun Nov 24 10:17:47 GMT 2024

Total time taken to generate the page: 2.92632 seconds