FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » Rejecting Certain Non-ASCII Characters
Show: Today's Messages :: Polls :: Message Navigator
Return to the default flat view Create a new topic Submit Reply
Re: Rejecting Certain Non-ASCII Characters [message #181196 is a reply to message #181166] Sat, 20 April 2013 12:08 Go to previous messageGo to previous message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma:
Senior Member
Christoph Becker wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Christoph Becker wrote:
>>> In the former case you can simply replace the slashed zero with a
>>> standard zero. Assuming UTF-8 encoding:
>>>
>>> $input = str_replace('\xCC\xB8', '', $input);
>>
>> You do not consider that, ambiguity aside, even in UTF-8 there are
>> *several*
>> ways to produce Unicode characters. See: Unicode Normalization Forms.
>
> If I'm not mistaken here, this str_replace() will work for NFD and NFC,
> but indeed I had not considered NFKD and NFKC. Thanks for pointing this
> out.

[I admit that Unicode Normalization Forms are tricky. Even I did not know
the full range of their complexities before your posting, so thank you for
making me look into it more thoroughly.]

AIUI, your approach will work reliably *only* for Normalization Form D
(NFD), which is the result of “Canonical *D*ecomposition” [1].
Normalization Form C (NFC) is the result of “Canonical Decomposition,
followed by Canonical *C*omposition“ (ibid.) That is, if there is a
character in Unicode that is a composition of the used glyphs, then there
will be no code sequence for a combination character for you to remove.

[NFKD and NFKC are entirely different animals: The NFKD of U+1E9B (“ẛ”)
U+0323 is *U+0073* (“s”) U+0323 U+0307; its NFKC is *U+1E69* (“ṩ”).]

For example, if the original character was U+212B ANGSTROM SIGN (“Å”), then
with NFC there will be only a code sequence for U+00C5 LATIN CAPITAL LETTER
A WITH RING ABOVE (“Å”), and you will not find a code sequence for U+030A
COMBINING RING ABOVE that you could remove in order to get the US-ASCII-
compliant U+0041 LATIN CAPITAL LETTER A (“A”).


However, apparently iconv(), which I suggested, cannot help here either:

$ php -r 'echo iconv("UTF-8", "US-ASCII", "-\xE2\x84\xAB-") . "\n";'
PHP Notice: iconv(): Detected an illegal character in input string in
Command line code on line 1
PHP Stack trace:
PHP 1. {main}() Command line code:0
PHP 2. iconv() Command line code:1
-

$ php -r 'echo iconv("UTF-8", "US-ASCII//TRANSLIT", "-\xE2\x84\xAB-") .
"\n";'
-?-

$ php -r 'echo iconv("UTF-8", "US-ASCII//IGNORE", "-\xE2\x84\xAB-") . "\n";'
PHP Notice: iconv(): Detected an illegal character in input string in
Command line code on line 1
PHP Stack trace:
PHP 1. {main}() Command line code:0
PHP 2. iconv() Command line code:1
--

(Expected: -A-)

It might be possible using recode_string(), but I have not found out yet
how. My PHP was not compiled “--with-recode” and I can only get recode(1)
to print “"A” for “Ä”, which is not helpful here.


PointedEars
___________
[1] <http://www.unicode.org/reports/tr15/tr15-37.html>
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee
[Message index]
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: googleapi problem
Next Topic: Undefined variable
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Sun Nov 24 14:32:16 GMT 2024

Total time taken to generate the page: 0.05176 seconds