Processing accented characters submitted from forms [message #184469] |
Thu, 02 January 2014 15:55  |
JohnT
Messages: 16 Registered: April 2011
Karma: 0
|
Junior Member |
|
|
One of the websites that I am working on is getting a lot of interest
from countries that make a lot of use of accented characters.
Usually accented characters come through fine.
However, some are replaced by the character codes. e.g. İ
This seems to be occurring for some Turkish and Romanian characters.
Is PHP doing this ?
ie: is it something I can fix with settings ?
I need to send the data in non-html emails, so I need the original
characters.
html_entity_decode() doesn't seem to work for me.
How do people usually handle this ?
Thanks
JohnT
|
|
|
|
|
|
Re: Processing accented characters submitted from forms [message #184486 is a reply to message #184474] |
Fri, 03 January 2014 11:40   |
JohnT
Messages: 16 Registered: April 2011
Karma: 0
|
Junior Member |
|
|
On Thu, 02 Jan 2014 20:52:37 +0100, Christoph Michael Becker wrote:
> JohnT wrote:
>
>> One of the websites that I am working on is getting a lot of interest
>> from countries that make a lot of use of accented characters.
>>
>> Usually accented characters come through fine.
>>
>> However, some are replaced by the character codes. e.g. İ This
>> seems to be occurring for some Turkish and Romanian characters.
>>
>> Is PHP doing this ?
>
> It seems to me that this might already been done by the browser.
> Unfortunately, I was not able to find any normative reference, so out of
> curiosity I set up a simple form with accept-charset=ISO-8859-1, and
> entered İ (U+0130) in the contained textarea. Firefox 26 and Chrome
> 31.0 send it as HTML entity İ, IE 11, however, sends it as Ý
> (U+00DD).
>
> A solution is to use UTF-8 encoding, as J.O. already mentioned.
Changing to UTF-8 is not an option, but I had already decided to use UTF-8
for any future websites as that is the PHP 5 default, and makes the
programming a lot easier.
Regards
JohnT
|
|
|
|
|
Re: Processing accented characters submitted from forms [message #184489 is a reply to message #184486] |
Fri, 03 January 2014 12:53   |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
JohnT wrote:
> On Thu, 02 Jan 2014 20:52:37 +0100, Christoph Michael Becker wrote:
>> JohnT wrote:
>>> One of the websites that I am working on is getting a lot of interest
>>> from countries that make a lot of use of accented characters.
>>>
>>> Usually accented characters come through fine.
>>>
>>> However, some are replaced by the character codes. e.g. İ This
>>> seems to be occurring for some Turkish and Romanian characters.
>>>
>>> Is PHP doing this ?
>>
>> It seems to me that this might already been done by the browser.
>> Unfortunately, I was not able to find any normative reference, so out of
>> curiosity I set up a simple form with accept-charset=ISO-8859-1, and
>> entered İ (U+0130) in the contained textarea. Firefox 26 and Chrome
>> 31.0 send it as HTML entity İ, IE 11, however, sends it as Ý
>> (U+00DD).
>>
>> A solution is to use UTF-8 encoding, as J.O. already mentioned.
>
> Changing to UTF-8 is not an option,
Why not?
> but I had already decided to use UTF-8 for any future websites
Good idea.
> as that is the PHP 5 default,
How did you get this idea?
> and makes the programming a lot easier.
If only it were so. PHP 5 still is oblivious as to character encoding.
PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)
|
|
|
Re: Processing accented characters submitted from forms [message #184490 is a reply to message #184487] |
Fri, 03 January 2014 12:53   |
JohnT
Messages: 16 Registered: April 2011
Karma: 0
|
Junior Member |
|
|
On Fri, 03 Jan 2014 12:37:27 +0000, Ben Bacarisse wrote:
> JohnT <john-sospam(at)jtresponse(dot)co(dot)uk> writes: <snip>
>> We're already using iso-8859-1 for the whole website. It will be a lot
>> of work to change all that, so I guess we'll have to put up with the
>> odd Turkish I causing problems.
>
> It's not clear (to me at least) what's happening to the data, but as far
> as any normal set of HTML pages are concerned (PHP generated or
> otherwise) you don't have to put up with a dotted I causing problems on
> an ISO-8859-1 encoded page. You can represent any Unicode character in
> a page using character entities (browser and font support is always and
> issue but not nowadays for anything as ordinary as İ).
I think it must be the browser that is encoding the character because İ
is not supported by iso-8859-1.
It arrives in the request data as the html numeric entity code, as that
is the only way it can be transmitted.
This causes issues:
As I always htmlencode user entered data before display, it means that it
gets encoded twice. I'll have to add the 'disable double encode' flag
thoughout my code :-)
Secondly, it will be added to the database as the entity code, so this
will break searching the database etc...
I think the proper fix would would be to convert to UTF-8.
But thats a lot of work. For now, I think I'll just manually translit the
codes that cause issues.
JohnT
|
|
|
|
|
Re: Processing accented characters submitted from forms [message #184498 is a reply to message #184490] |
Fri, 03 January 2014 14:30   |
Ben Bacarisse
Messages: 82 Registered: November 2013
Karma: 0
|
Member |
|
|
JohnT <john-sospam(at)jtresponse(dot)co(dot)uk> writes:
> On Fri, 03 Jan 2014 12:37:27 +0000, Ben Bacarisse wrote:
>
>> JohnT <john-sospam(at)jtresponse(dot)co(dot)uk> writes: <snip>
>>> We're already using iso-8859-1 for the whole website. It will be a lot
>>> of work to change all that, so I guess we'll have to put up with the
>>> odd Turkish I causing problems.
>>
>> It's not clear (to me at least) what's happening to the data, but as far
>> as any normal set of HTML pages are concerned (PHP generated or
>> otherwise) you don't have to put up with a dotted I causing problems on
>> an ISO-8859-1 encoded page. You can represent any Unicode character in
>> a page using character entities (browser and font support is always and
>> issue but not nowadays for anything as ordinary as İ).
>
> I think it must be the browser that is encoding the character because İ
> is not supported by iso-8859-1.
Note that the browser behaviour can be altered by form attributes
(specifically accept-charset). You can have a form that accepts UTF-8
on an ISO-8859-1 served page.
> It arrives in the request data as the html numeric entity code, as that
> is the only way it can be transmitted.
>
> This causes issues:
>
> As I always htmlencode user entered data before display, it means that it
> gets encoded twice. I'll have to add the 'disable double encode' flag
> thoughout my code :-)
Sure. One way or another you need to get the right encoding. This
method is not perfect since a user typing İ into a form may not
expect a dotted I to come out.
The best method is (probably) to:
(a) Give UTF-8 as the form's accept-charset.
(b) Encode htmlentities giving UTF-8 as the encoding. This should leave
the UTF-8 characters as UTF-8.
(c) Use mb_convert_encoding($etext, "HTML-ENTITIES", "UTF-8") to make
the string displayable in a page regardless of the page's character
encoding.
> Secondly, it will be added to the database as the entity code, so this
> will break searching the database etc...
If you take the approach of accepting UTF-8 from the form, you can put
that directly into the database.
> I think the proper fix would would be to convert to UTF-8.
> But thats a lot of work. For now, I think I'll just manually translit the
> codes that cause issues.
You really only need UTF-8 in the database. The page encoding is not
that important.
--
Ben.
|
|
|
Re: Processing accented characters submitted from forms [message #184500 is a reply to message #184496] |
Fri, 03 January 2014 15:03   |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
JohnT wrote:
^^^^^
Please fix.
> On Fri, 03 Jan 2014 13:53:04 +0100, Thomas 'PointedEars' Lahn wrote:
>> JohnT wrote:
>>> Changing to UTF-8 is not an option,
>> Why not?
>
> It's a big site.
> It would take too much work to rebuild it all.
Looks like an inherent design flaw to me. It is rather easy to switch a
properly developed site to UTF-8. BTDT.
>>> as that is the PHP 5 default,
>>
>> How did you get this idea?
>
> http://uk1.php.net/manual/en/function.htmlentities.php
>
> says:
> Like htmlspecialchars(), htmlentities() takes an optional third
> argument encoding which defines encoding used in conversion. If omitted,
> the default value for this argument is ISO-8859-1 in versions of PHP
> prior to 5.4.0, and UTF-8 from PHP 5.4.0 onwards.
>
>>> and makes the programming a lot easier.
>> If only it were so. PHP 5 still is oblivious as to character encoding.
>
> http://uk1.php.net/manual/en/book.iconv.php
That is interesting (I did not know about the new htmlentities() default),
but it does not refute my argument. First, there have been versions of
PHP 5 *before* 5.4.0. Second, so far you have to *tell* PHP 5 what encoding
you use; there is no automatism or assumed default encoding for source code
(as in some other recent programming languages) – *only* in the PHP 5.*4*
case *with* htmlentities() the default suffices. (Such an automatism is
considered for PHP *6*.)
That said, htmlentities() is insufficient to represent arbitrary Unicode
characters, encoded with UTF-8 server-side, in an HTML document if the
document encoding is not UTF-8; you would have to use htmlspecialchars()
which has the same default parameter value since PHP 5.4.0.
<http://php.net/htmlspecialchars>
PointedEars
--
Danny Goodman's books are out of date and teach practices that are
positively harmful for cross-browser scripting.
-- Richard Cornford, cljs, <cife6q$253$1$8300dec7(at)news(dot)demon(dot)co(dot)uk> (2004)
|
|
|
|
Re: Processing accented characters submitted from forms [message #184502 is a reply to message #184498] |
Fri, 03 January 2014 15:11   |
Jerry Stuckle
Messages: 2598 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On 1/3/2014 9:30 AM, Ben Bacarisse wrote:
> JohnT <john-sospam(at)jtresponse(dot)co(dot)uk> writes:
>
>> On Fri, 03 Jan 2014 12:37:27 +0000, Ben Bacarisse wrote:
>>
>>> JohnT <john-sospam(at)jtresponse(dot)co(dot)uk> writes: <snip>
>>>> We're already using iso-8859-1 for the whole website. It will be a lot
>>>> of work to change all that, so I guess we'll have to put up with the
>>>> odd Turkish I causing problems.
>>>
>>> It's not clear (to me at least) what's happening to the data, but as far
>>> as any normal set of HTML pages are concerned (PHP generated or
>>> otherwise) you don't have to put up with a dotted I causing problems on
>>> an ISO-8859-1 encoded page. You can represent any Unicode character in
>>> a page using character entities (browser and font support is always and
>>> issue but not nowadays for anything as ordinary as İ).
>>
>> I think it must be the browser that is encoding the character because İ
>> is not supported by iso-8859-1.
>
> Note that the browser behaviour can be altered by form attributes
> (specifically accept-charset). You can have a form that accepts UTF-8
> on an ISO-8859-1 served page.
>
>> It arrives in the request data as the html numeric entity code, as that
>> is the only way it can be transmitted.
>>
>> This causes issues:
>>
>> As I always htmlencode user entered data before display, it means that it
>> gets encoded twice. I'll have to add the 'disable double encode' flag
>> thoughout my code :-)
>
> Sure. One way or another you need to get the right encoding. This
> method is not perfect since a user typing İ into a form may not
> expect a dotted I to come out.
>
> The best method is (probably) to:
> (a) Give UTF-8 as the form's accept-charset.
> (b) Encode htmlentities giving UTF-8 as the encoding. This should leave
> the UTF-8 characters as UTF-8.
> (c) Use mb_convert_encoding($etext, "HTML-ENTITIES", "UTF-8") to make
> the string displayable in a page regardless of the page's character
> encoding.
>
>> Secondly, it will be added to the database as the entity code, so this
>> will break searching the database etc...
>
> If you take the approach of accepting UTF-8 from the form, you can put
> that directly into the database.
>
>> I think the proper fix would would be to convert to UTF-8.
>> But thats a lot of work. For now, I think I'll just manually translit the
>> codes that cause issues.
>
> You really only need UTF-8 in the database. The page encoding is not
> that important.
>
I beg to differ. Page encoding is important if you want the correct
characters displayed.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex(at)attglobal(dot)net
==================
|
|
|
Re: Processing accented characters submitted from forms [message #184503 is a reply to message #184500] |
Fri, 03 January 2014 16:08   |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Thomas 'Ingrid' Lahn wrote:
> JohnT wrote:
>> On Fri, 03 Jan 2014 13:53:04 +0100, Thomas 'PointedEars' Lahn wrote:
>>> JohnT wrote:
>>>> [UTF-8] is the PHP 5 default,
>>>
>>> How did you get this idea?
>>
>> http://uk1.php.net/manual/en/function.htmlentities.php
>>
>> says:
>> Like htmlspecialchars(), htmlentities() takes an optional third
>> argument encoding which defines encoding used in conversion. If omitted,
>> the default value for this argument is ISO-8859-1 in versions of PHP
>> prior to 5.4.0, and UTF-8 from PHP 5.4.0 onwards.
>
> […]
> That said, htmlentities() is insufficient to represent arbitrary Unicode
> characters, encoded with UTF-8 server-side, in an HTML document if the
> document encoding is not UTF-8; you would have to use htmlspecialchars()
> which has the same default parameter value since PHP 5.4.0.
>
> <http://php.net/htmlspecialchars>
Actually, it is worse. In such a document, to refer to even those Unicode
characters for which there is *not* a character entity reference in HTML,
you have to use mb_encode_numericentity():
$ php -r 'echo mb_encode_numericentity("∎", array(0x0, 0x10000, 0, 0xfffff),
"UTF-8") . PHP_EOL;'
∎
$ locale
LANG=de_CH.UTF-8
LANGUAGE=
LC_CTYPE="de_CH.UTF-8"
LC_NUMERIC="de_CH.UTF-8"
LC_TIME="de_CH.UTF-8"
LC_COLLATE="de_CH.UTF-8"
LC_MONETARY="de_CH.UTF-8"
LC_MESSAGES=en_US.UTF-8
LC_PAPER="de_CH.UTF-8"
LC_NAME="de_CH.UTF-8"
LC_ADDRESS="de_CH.UTF-8"
LC_TELEPHONE="de_CH.UTF-8"
LC_MEASUREMENT="de_CH.UTF-8"
LC_IDENTIFICATION="de_CH.UTF-8"
LC_ALL=
----------
<http://php.net/mb_encode_numericentity>
None of this is necessary if you use UTF-8 throughout.
PointedEars
--
Danny Goodman's books are out of date and teach practices that are
positively harmful for cross-browser scripting.
-- Richard Cornford, cljs, <cife6q$253$1$8300dec7(at)news(dot)demon(dot)co(dot)uk> (2004)
|
|
|
|
|
Re: Processing accented characters submitted from forms [message #184509 is a reply to message #184506] |
Fri, 03 January 2014 21:54   |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Ben Bacarisse wrote:
> Jerry Stuckle <jstucklex(at)attglobal(dot)net> writes:
>> On 1/3/2014 9:30 AM, Ben Bacarisse wrote:
> <snip>
>>> You really only need UTF-8 in the database. The page encoding is not
>>> that important.
>>
>> I beg to differ. Page encoding is important if you want the correct
>> characters displayed.
>
> It's important, but not *that* important. The OP says that changing it
> is not an option, so I gave an example of how one can finesse the page
> encoding entirely -- by converting data you take from the data base to
> ASCII.
You can do that – at the risk of increasing the server-side runtime and
memory usage of the PHP program, the size of the output, the loading and
rendering time, and memory usage of the document client-side, considerably,
to no substantial advantage. It is the *World Wide* Web, and for that
reason alone UTF-8 support is ubiquitous there since years.
PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee
|
|
|
|
Re: Processing accented characters submitted from forms [message #184511 is a reply to message #184469] |
Fri, 03 January 2014 22:01   |
Ben Bacarisse
Messages: 82 Registered: November 2013
Karma: 0
|
Member |
|
|
Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> writes:
> Ben Bacarisse wrote:
>
>> Jerry Stuckle <jstucklex(at)attglobal(dot)net> writes:
>>> On 1/3/2014 9:30 AM, Ben Bacarisse wrote:
>> <snip>
>>>> You really only need UTF-8 in the database. The page encoding is not
>>>> that important.
>>>
>>> I beg to differ. Page encoding is important if you want the correct
>>> characters displayed.
>>
>> It's important, but not *that* important. The OP says that changing it
>> is not an option, so I gave an example of how one can finesse the page
>> encoding entirely -- by converting data you take from the data base to
>> ASCII.
>
> You can do that – at the risk of increasing the server-side runtime and
> memory usage of the PHP program, the size of the output, the loading and
> rendering speed, and memory usage of the document client-side, considerably,
> to no substantial advantage. It is the *World Wide* Web, and for that
> reason alone UTF-8 support is ubiquitous there since years.
Yes, it's very far from ideal in general, but the OP seems resolved to
switch to UTF-8 pages in the long run so a temporary solution might be
acceptable in this case. For one thing, switching the database to UTF-8
will be needed for the long-term solution, so some of the temporary fix
won't be wasted work.
--
Ben.
|
|
|
Re: Processing accented characters submitted from forms [message #184512 is a reply to message #184506] |
Sat, 04 January 2014 00:59  |
Jerry Stuckle
Messages: 2598 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On 1/3/2014 3:28 PM, Ben Bacarisse wrote:
> Jerry Stuckle <jstucklex(at)attglobal(dot)net> writes:
>> On 1/3/2014 9:30 AM, Ben Bacarisse wrote:
> <snip>
>>> You really only need UTF-8 in the database. The page encoding is not
>>> that important.
>>
>> I beg to differ. Page encoding is important if you want the correct
>> characters displayed.
>
> It's important, but not *that* important. The OP says that changing it
> is not an option, so I gave an example of how one can finesse the page
> encoding entirely -- by converting data you take from the data base to
> ASCII.
>
No, it's not important if you don't care if characters are displayed
correctly, or if characters are send in from the client correctly. Just
storing in the database as UTF-8 and converting to/from ASCII will not
solve these problems.
But if you DO care about such mundane things, then you need to be using
an encoding which supports those characters.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex(at)attglobal(dot)net
==================
|
|
|