FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » DOMDocument loadHTML and double UTF8 encode
Show: Today's Messages :: Polls :: Message Navigator
Switch to threaded view of this topic Create a new topic Submit Reply
DOMDocument loadHTML and double UTF8 encode [message #170254] Fri, 22 October 2010 22:09 Go to next message
roger21 is currently offline  roger21
Messages: 3
Registered: October 2010
Karma: 0
Junior Member
hi,

i use DOMDocument loadHTML to parse pages from this forum
http://forum.hardware.fr/ (the forum is in utf8) my problem is some
pages are actually seen as utf8 like this one
http://forum.hardware.fr/hfr/Hardware/liste_sujet-1.htm and some are not
like this one
http://forum.hardware.fr/hfr/HardwarePeripheriques/liste_sujet-1.htm
and so the second kind results in a double utf8 encoding

so i test if the page is doubly encoded and if yes i utf8_decode the
text value i want but there are some side effects, for exemple the euro
sign is not doubly encode so this one become crap when utf8_decoded ...
and i don't know if there are other signs like that

so i am lost (and pissed) any idea how i should manage all that ?
Re: DOMDocument loadHTML and double UTF8 encode [message #170261 is a reply to message #170254] Sat, 23 October 2010 12:38 Go to previous messageGo to next message
Robert Hairgrove is currently offline  Robert Hairgrove
Messages: 19
Registered: September 2010
Karma: 0
Junior Member
roger21 wrote:
> hi,
>
> i use DOMDocument loadHTML to parse pages from this forum
> http://forum.hardware.fr/ (the forum is in utf8) my problem is some
> pages are actually seen as utf8 like this one
> http://forum.hardware.fr/hfr/Hardware/liste_sujet-1.htm and some are not
> like this one
> http://forum.hardware.fr/hfr/HardwarePeripheriques/liste_sujet-1.htm
> and so the second kind results in a double utf8 encoding
>
> so i test if the page is doubly encoded and if yes i utf8_decode the
> text value i want but there are some side effects, for exemple the euro
> sign is not doubly encode so this one become crap when utf8_decoded ...
> and i don't know if there are other signs like that
>
> so i am lost (and pissed) any idea how i should manage all that ?

Strange, because when I load those pages, both are seen as being UTF-8
encoded by Mozilla Firefox 3.6.11 on my system (Linux Ubuntu Hardy 8.04
LTS), and everything seems to display correctly.

But, as I recently discovered, a lot will depend on whether the forum's
server is actually issuing a PHP header with a UTF-8 charset declaration
before displaying the pages instead of merely using an HTML meta tag. If
not, the page might still be displayed as some other character set, for
example ISO-8859-1 ... which wouldn't be so bad if the extended/accented
characters were correctly translated as HTML entities by the forum
software, and obviously they are not (i.e. the source for the pages
shows "á" in plain text instead of "á").

Of course, the forum member's client browser might not send messages
encoded in UTF-8, and the result can be garbage -- Russian text encoded
as Windows 1251, for example, and copied and pasted from an editor into
the browser will sometimes display as ISO-8859-1 extended characters in
some forums I have seen.

One thing you might want to try is to compare some original text with
the result of a utf8_encode(utf8_decode(orig_text)) combination, or
maybe vice-versa. If they are the same, then the page is being
interpreted as UTF-8; if not, it is most likely being interpreted as
ISO-8559-1 or some other character set.

Also, the user comments posted here might prove to be helpful:
http://ch2.php.net/manual/en/domdocument.loadhtml.php
Re: DOMDocument loadHTML and double UTF8 encode [message #170263 is a reply to message #170261] Sat, 23 October 2010 13:35 Go to previous messageGo to next message
roger21 is currently offline  roger21
Messages: 3
Registered: October 2010
Karma: 0
Junior Member
Robert Hairgrove a écrit :
> roger21 wrote:
>> hi,
>>
>> i use DOMDocument loadHTML to parse pages from this forum
>> http://forum.hardware.fr/ (the forum is in utf8) my problem is some
>> pages are actually seen as utf8 like this one
>> http://forum.hardware.fr/hfr/Hardware/liste_sujet-1.htm and some are
>> not like this one
>> http://forum.hardware.fr/hfr/HardwarePeripheriques/liste_sujet-1.htm
>> and so the second kind results in a double utf8 encoding
>>
>> so i test if the page is doubly encoded and if yes i utf8_decode the
>> text value i want but there are some side effects, for exemple the
>> euro sign is not doubly encode so this one become crap when
>> utf8_decoded ...
>> and i don't know if there are other signs like that
>>
>> so i am lost (and pissed) any idea how i should manage all that ?
>
> Strange, because when I load those pages, both are seen as being UTF-8
> encoded by Mozilla Firefox 3.6.11 on my system (Linux Ubuntu Hardy 8.04
> LTS), and everything seems to display correctly.
>
> But, as I recently discovered, a lot will depend on whether the forum's
> server is actually issuing a PHP header with a UTF-8 charset declaration
> before displaying the pages instead of merely using an HTML meta tag. If
> not, the page might still be displayed as some other character set, for
> example ISO-8859-1 ... which wouldn't be so bad if the extended/accented
> characters were correctly translated as HTML entities by the forum
> software, and obviously they are not (i.e. the source for the pages
> shows "á" in plain text instead of "á").
>
> Of course, the forum member's client browser might not send messages
> encoded in UTF-8, and the result can be garbage -- Russian text encoded
> as Windows 1251, for example, and copied and pasted from an editor into
> the browser will sometimes display as ISO-8859-1 extended characters in
> some forums I have seen.
>
> One thing you might want to try is to compare some original text with
> the result of a utf8_encode(utf8_decode(orig_text)) combination, or
> maybe vice-versa. If they are the same, then the page is being
> interpreted as UTF-8; if not, it is most likely being interpreted as
> ISO-8559-1 or some other character set.
>
> Also, the user comments posted here might prove to be helpful:
> http://ch2.php.net/manual/en/domdocument.loadhtml.php

thank you for your answer, when i say "seen" i mean by the loadhtml
function that translate everything in utf8 if it thinks it is not
already the case, of course both pages are in utf8 (by the headers and
the meta tags and the content therefore my double encoding problem) and
that's why i'm a bit upset (but i won't say the forum's pages are clean,
it is prety crapy over-all)

and my problem is when i utf8_decode my doubly encoded pages i have
characters issues that i don't have when the page is not over-encoded by
loadhtml (with the same chars)

and i don't whant to decode the text the first because i want to stay in
utf8 (and i will lose characters if i decode it first)

and i alredy checked the coments, they all seems to be related but i
tried most of them and it is either the same or worse (therefore i ask
here :p)

maybe i should try to understand why some page are doubly encoded, they
may have some crap that i could fix before giving it to loadhtml
Re: DOMDocument loadHTML and double UTF8 encode [message #170265 is a reply to message #170263] Sat, 23 October 2010 14:20 Go to previous message
roger21 is currently offline  roger21
Messages: 3
Registered: October 2010
Karma: 0
Junior Member
roger21 a écrit :
> Robert Hairgrove a écrit :
>> roger21 wrote:
>>> hi,
>>>
>>> i use DOMDocument loadHTML to parse pages from this forum
>>> http://forum.hardware.fr/ (the forum is in utf8) my problem is some
>>> pages are actually seen as utf8 like this one
>>> http://forum.hardware.fr/hfr/Hardware/liste_sujet-1.htm and some are
>>> not like this one
>>> http://forum.hardware.fr/hfr/HardwarePeripheriques/liste_sujet-1.htm
>>> and so the second kind results in a double utf8 encoding
>>>
>>> so i test if the page is doubly encoded and if yes i utf8_decode the
>>> text value i want but there are some side effects, for exemple the
>>> euro sign is not doubly encode so this one become crap when
>>> utf8_decoded ...
>>> and i don't know if there are other signs like that
>>>
>>> so i am lost (and pissed) any idea how i should manage all that ?
>>
>> Strange, because when I load those pages, both are seen as being UTF-8
>> encoded by Mozilla Firefox 3.6.11 on my system (Linux Ubuntu Hardy
>> 8.04 LTS), and everything seems to display correctly.
>>
>> But, as I recently discovered, a lot will depend on whether the
>> forum's server is actually issuing a PHP header with a UTF-8 charset
>> declaration before displaying the pages instead of merely using an
>> HTML meta tag. If not, the page might still be displayed as some other
>> character set, for example ISO-8859-1 ... which wouldn't be so bad if
>> the extended/accented characters were correctly translated as HTML
>> entities by the forum software, and obviously they are not (i.e. the
>> source for the pages shows "á" in plain text instead of "á").
>>
>> Of course, the forum member's client browser might not send messages
>> encoded in UTF-8, and the result can be garbage -- Russian text
>> encoded as Windows 1251, for example, and copied and pasted from an
>> editor into the browser will sometimes display as ISO-8859-1 extended
>> characters in some forums I have seen.
>>
>> One thing you might want to try is to compare some original text with
>> the result of a utf8_encode(utf8_decode(orig_text)) combination, or
>> maybe vice-versa. If they are the same, then the page is being
>> interpreted as UTF-8; if not, it is most likely being interpreted as
>> ISO-8559-1 or some other character set.
>>
>> Also, the user comments posted here might prove to be helpful:
>> http://ch2.php.net/manual/en/domdocument.loadhtml.php
>
> thank you for your answer, when i say "seen" i mean by the loadhtml
> function that translate everything in utf8 if it thinks it is not
> already the case, of course both pages are in utf8 (by the headers and
> the meta tags and the content therefore my double encoding problem) and
> that's why i'm a bit upset (but i won't say the forum's pages are clean,
> it is prety crapy over-all)
>
> and my problem is when i utf8_decode my doubly encoded pages i have
> characters issues that i don't have when the page is not over-encoded by
> loadhtml (with the same chars)
>
> and i don't whant to decode the text the first because i want to stay in
> utf8 (and i will lose characters if i decode it first)
>
> and i alredy checked the coments, they all seems to be related but i
> tried most of them and it is either the same or worse (therefore i ask
> here :p)
>
> maybe i should try to understand why some page are doubly encoded, they
> may have some crap that i could fix before giving it to loadhtml

ok, one of the comments seems useful : the title tag is before the
charset meta tag, when i have a title with accents loadhtml over encode,
if i move the meta tags before the title tag that should be good
(according to the comment) i will try that
  Switch to threaded view of this topic Create a new topic Submit Reply
Previous Topic: US, Canada or International
Next Topic: ==Get an Internship in the United States ==
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Fri Sep 20 18:49:52 GMT 2024

Total time taken to generate the page: 0.03148 seconds