Re: Rejecting Certain Non-ASCII Characters [message #181196 is a reply to message #181166] |
Sat, 20 April 2013 12:08 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma:
|
Senior Member |
|
|
Christoph Becker wrote:
> Thomas 'PointedEars' Lahn wrote:
>> Christoph Becker wrote:
>>> In the former case you can simply replace the slashed zero with a
>>> standard zero. Assuming UTF-8 encoding:
>>>
>>> $input = str_replace('\xCC\xB8', '', $input);
>>
>> You do not consider that, ambiguity aside, even in UTF-8 there are
>> *several*
>> ways to produce Unicode characters. See: Unicode Normalization Forms.
>
> If I'm not mistaken here, this str_replace() will work for NFD and NFC,
> but indeed I had not considered NFKD and NFKC. Thanks for pointing this
> out.
[I admit that Unicode Normalization Forms are tricky. Even I did not know
the full range of their complexities before your posting, so thank you for
making me look into it more thoroughly.]
AIUI, your approach will work reliably *only* for Normalization Form D
(NFD), which is the result of “Canonical *D*ecomposition” [1].
Normalization Form C (NFC) is the result of “Canonical Decomposition,
followed by Canonical *C*omposition“ (ibid.) That is, if there is a
character in Unicode that is a composition of the used glyphs, then there
will be no code sequence for a combination character for you to remove.
[NFKD and NFKC are entirely different animals: The NFKD of U+1E9B (“ẛ”)
U+0323 is *U+0073* (“s”) U+0323 U+0307; its NFKC is *U+1E69* (“ṩ”).]
For example, if the original character was U+212B ANGSTROM SIGN (“Å”), then
with NFC there will be only a code sequence for U+00C5 LATIN CAPITAL LETTER
A WITH RING ABOVE (“Å”), and you will not find a code sequence for U+030A
COMBINING RING ABOVE that you could remove in order to get the US-ASCII-
compliant U+0041 LATIN CAPITAL LETTER A (“A”).
However, apparently iconv(), which I suggested, cannot help here either:
$ php -r 'echo iconv("UTF-8", "US-ASCII", "-\xE2\x84\xAB-") . "\n";'
PHP Notice: iconv(): Detected an illegal character in input string in
Command line code on line 1
PHP Stack trace:
PHP 1. {main}() Command line code:0
PHP 2. iconv() Command line code:1
-
$ php -r 'echo iconv("UTF-8", "US-ASCII//TRANSLIT", "-\xE2\x84\xAB-") .
"\n";'
-?-
$ php -r 'echo iconv("UTF-8", "US-ASCII//IGNORE", "-\xE2\x84\xAB-") . "\n";'
PHP Notice: iconv(): Detected an illegal character in input string in
Command line code on line 1
PHP Stack trace:
PHP 1. {main}() Command line code:0
PHP 2. iconv() Command line code:1
--
(Expected: -A-)
It might be possible using recode_string(), but I have not found out yet
how. My PHP was not compiled “--with-recode” and I can only get recode(1)
to print “"A” for “Ä”, which is not helpful here.
PointedEars
___________
[1] <http://www.unicode.org/reports/tr15/tr15-37.html>
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee
|
|
|