Re: Processing accented characters submitted from forms [message #184498 is a reply to message #184490] |
Fri, 03 January 2014 14:30 |
Ben Bacarisse
Messages: 82 Registered: November 2013
Karma:
|
Member |
|
|
JohnT <john-sospam(at)jtresponse(dot)co(dot)uk> writes:
> On Fri, 03 Jan 2014 12:37:27 +0000, Ben Bacarisse wrote:
>
>> JohnT <john-sospam(at)jtresponse(dot)co(dot)uk> writes: <snip>
>>> We're already using iso-8859-1 for the whole website. It will be a lot
>>> of work to change all that, so I guess we'll have to put up with the
>>> odd Turkish I causing problems.
>>
>> It's not clear (to me at least) what's happening to the data, but as far
>> as any normal set of HTML pages are concerned (PHP generated or
>> otherwise) you don't have to put up with a dotted I causing problems on
>> an ISO-8859-1 encoded page. You can represent any Unicode character in
>> a page using character entities (browser and font support is always and
>> issue but not nowadays for anything as ordinary as İ).
>
> I think it must be the browser that is encoding the character because İ
> is not supported by iso-8859-1.
Note that the browser behaviour can be altered by form attributes
(specifically accept-charset). You can have a form that accepts UTF-8
on an ISO-8859-1 served page.
> It arrives in the request data as the html numeric entity code, as that
> is the only way it can be transmitted.
>
> This causes issues:
>
> As I always htmlencode user entered data before display, it means that it
> gets encoded twice. I'll have to add the 'disable double encode' flag
> thoughout my code :-)
Sure. One way or another you need to get the right encoding. This
method is not perfect since a user typing İ into a form may not
expect a dotted I to come out.
The best method is (probably) to:
(a) Give UTF-8 as the form's accept-charset.
(b) Encode htmlentities giving UTF-8 as the encoding. This should leave
the UTF-8 characters as UTF-8.
(c) Use mb_convert_encoding($etext, "HTML-ENTITIES", "UTF-8") to make
the string displayable in a page regardless of the page's character
encoding.
> Secondly, it will be added to the database as the entity code, so this
> will break searching the database etc...
If you take the approach of accepting UTF-8 from the form, you can put
that directly into the database.
> I think the proper fix would would be to convert to UTF-8.
> But thats a lot of work. For now, I think I'll just manually translit the
> codes that cause issues.
You really only need UTF-8 in the database. The page encoding is not
that important.
--
Ben.
|
|
|