Re: Rejecting Certain Non-ASCII Characters [message #181166 is a reply to message #181165] |
Fri, 19 April 2013 21:08 |
Christoph Becker
Messages: 91 Registered: June 2012
Karma:
|
Member |
|
|
Thomas 'PointedEars' Lahn wrote:
> Christoph Becker wrote:
>
>> Jim Higgins wrote:
>>> I have a problem with people entering a slashed zero vs a standard
>>> ASCII zero into HTML forms intended to store data in a MySQL database.
>>
>> Is it really a slashed zero (U+0030 U+0338) they're entering,
>
> Could also be U+0030 U+337 or any other (allowed composition) of the at
> least 65536 Unicode characters that look(s) similar. For example, it could
> be U+2205 EMPTY SET.
Which I was referring to in a abbreviated form in the rest of the
sentence. :)
>> or do they enter some similar looking character such as the Danish Ø?
>
> That character, U+00D8 LATIN CAPITAL LETTER O WITH STROKE *and* its
> lowercase counterpart, U+00F8 LATIN SMALL LETTER O WITH STROKE, are present
> at least in Danish, Norwegian, and Faroese; They are also used in the
> International Phonetic Alphabet (IPA).
>
>> In the former case you can simply replace the slashed zero with a standard
>> zero. Assuming UTF-8 encoding:
>>
>> $input = str_replace('\xCC\xB8', '', $input);
>
> You do not consider that, ambiguity aside, even in UTF-8 there are *several*
> ways to produce Unicode characters. See: Unicode Normalization Forms.
If I'm not mistaken here, this str_replace() will work for NFD and NFC,
but indeed I had not considered NFKD and NFKC. Thanks for pointing this
out.
>>> Is there a simple way in PHP to restrict input to the ASCII Character
>>> set, specifically hex 0x20 - 0x7E ? Or a simple way to detect
>>> characters outside this range before committing them to the database?
>>
>> If you're dealing with a numeric column, you may consider checking for
>> is_numeric().
>
> There are also regular expressions. Testing against '/[^\0x00-\x7F]/', and
> rejecting anything that matches, appears to be the best approach here.
For a numeric column? Why not at least trim down the possibilites of
setting illegal values for the SQL statement?
> If
> characters actually should be converted, iconv() should be used instead of a
> hardcoded conversion.
>
> However, it is better in the long term to convert the MySQL database (or the
> relevant table and column) to utf8_general_ci, and upgrade MySQL if
> necessary (character sets and collations were not supported before MySQL 5;
> the current stable version is 5.6).
>
> <http://php.net/pcre>
> <http://dev.mysql.com/>
--
Christoph M. Becker
|
|
|