FUDforum: comp.lang.php » Trying to decode text that is supposed to be ISO-8859-1

Home » Imported messages » comp.lang.php » Trying to decode text that is supposed to be ISO-8859-1

Show: Today's Messages :: Polls :: Message Navigator

Trying to decode text that is supposed to be ISO-8859-1 [message #175395]

Wed, 14 September 2011 01:56

Bart Kastermans
Messages: 1
Registered: September 2011

Karma: 0

Junior Member

I have downloaded a file that claims to be ISO-8859-1. In it (among
many other stuff) are the bytes shown here (first column is the
character, the second is ord(character), the third and fourth are binary
respectively hexidecimal representations of the character.

P / 80 / 01010000 / 50
l / 108 / 01101100 / 6c
z / 122 / 01111010 / 7a
e / 101 / 01100101 / 65
\303 / 195 / 11000011 / c3
\205 / 133 / 10000101 / 85
\313 / 203 / 11001011 / cb
\206 / 134 / 10000110 / 86

This is supposed to be ISO-8859-1 encoded, and should encode the
character U+0148 (\v{n}; Latin small letter n with caron).

Does anybody have any idea how I could decode this (or how it was
encoded in the first place)? Any suggestions would be greatly
appreciated.

Report message to a moderator

Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175397 is a reply to message #175395]

Wed, 14 September 2011 03:42

Peter H. Coffin
Messages: 245
Registered: September 2010

Karma: 0

Senior Member

On Tue, 13 Sep 2011 19:56:20 -0600, Bart Kastermans wrote:
> I have downloaded a file that claims to be ISO-8859-1. In it (among
> many other stuff) are the bytes shown here (first column is the
> character, the second is ord(character), the third and fourth are binary
> respectively hexidecimal representations of the character.
>
> P / 80 / 01010000 / 50
> l / 108 / 01101100 / 6c
> z / 122 / 01111010 / 7a
> e / 101 / 01100101 / 65
> \303 / 195 / 11000011 / c3
> \205 / 133 / 10000101 / 85
> \313 / 203 / 11001011 / cb
> \206 / 134 / 10000110 / 86
>
> This is supposed to be ISO-8859-1 encoded, and should encode the
> character U+0148 (\v{n}; Latin small letter n with caron).
>
> Does anybody have any idea how I could decode this (or how it was
> encoded in the first place)? Any suggestions would be greatly
> appreciated.

It's UTF-8 encoded representation of a false ISO-8859-1(? probably
CP1251, actually) display of the CORRECT UTF-8 for the string you're
hoping it should be. Someone basically doubled-up on the conversion to
make that.

--
82. I will not shoot at any of my enemies if they are standing in front
of the crucial support beam to a heavy, dangerous, unbalanced
structure.
--Peter Anspach's list of things to do as an Evil Overlord

Report message to a moderator

Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175399 is a reply to message #175395]

Wed, 14 September 2011 10:25

alvaro.NOSPAMTHANX
Messages: 277
Registered: September 2010

Karma: 0

Senior Member

El 14/09/2011 3:56, Bart Kastermans escribió/wrote:
> I have downloaded a file that claims to be ISO-8859-1. In it (among
> many other stuff) are the bytes shown here (first column is the
> character, the second is ord(character), the third and fourth are binary
> respectively hexidecimal representations of the character.
>
> P / 80 / 01010000 / 50
> l / 108 / 01101100 / 6c
> z / 122 / 01111010 / 7a
> e / 101 / 01100101 / 65
> \303 / 195 / 11000011 / c3
> \205 / 133 / 10000101 / 85
> \313 / 203 / 11001011 / cb
> \206 / 134 / 10000110 / 86
>
> This is supposed to be ISO-8859-1 encoded, and should encode the
> character U+0148 (\v{n}; Latin small letter n with caron).

Funny... I think that character (ň) does not even exist in ISO-8859-1:

http://www.fileformat.info/info/unicode/char/148/index.htm
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout

And in fact the 0x85 and 0x86 positions are empty in ISO-8859-1.

The mb_detect_encoding() function suggests that the string is actually
in UTF-8 and contains two chars: 0xC385 and 0xCB86 (Åˆ). The "Åˆ" string
is exactly what you get if you encode "ň" in UTF-8 and try to display as
ISO-8859-1, so I guess that's what the data creator is doing.

> Does anybody have any idea how I could decode this (or how it was
> encoded in the first place)? Any suggestions would be greatly
> appreciated.

To begin with, you cannot use ISO-8859-1 as target encoding if you want
to use U+0148.

Now, if you decide to switch to UTF-8... well, I'll report back if I
find something more precise :)

--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--

Report message to a moderator

Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175401 is a reply to message #175397]

Wed, 14 September 2011 12:07

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Peter H. Coffin wrote:

> On Tue, 13 Sep 2011 19:56:20 -0600, Bart Kastermans wrote:
>> I have downloaded a file that claims to be ISO-8859-1. In it (among
>> many other stuff) are the bytes shown here (first column is the
>> character, the second is ord(character), the third and fourth are binary
>> respectively hexidecimal representations of the character.
>>
>> P / 80 / 01010000 / 50
>> l / 108 / 01101100 / 6c
>> z / 122 / 01111010 / 7a
>> e / 101 / 01100101 / 65
>> \303 / 195 / 11000011 / c3
>> \205 / 133 / 10000101 / 85
>> \313 / 203 / 11001011 / cb
>> \206 / 134 / 10000110 / 86
>>
>> This is supposed to be ISO-8859-1 encoded, and should encode the
>> character U+0148 (\v{n}; Latin small letter n with caron).
>>
>> Does anybody have any idea how I could decode this (or how it was
>> encoded in the first place)? Any suggestions would be greatly
>> appreciated.
>
> It's UTF-8 encoded representation of a false ISO-8859-1(? probably
> CP1251, actually) […]

Windows-125_2_ (Western) corresponds largely with ISO-8859-1. Windows-1251,
which is the proper name for that character set and encoding, is Cyrillic
above 0x7F, and corresponds largely with ISO-8859-5.

PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee

Report message to a moderator

Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175404 is a reply to message #175401]

Wed, 14 September 2011 13:37

Peter H. Coffin
Messages: 245
Registered: September 2010

Karma: 0

Senior Member

On Wed, 14 Sep 2011 14:07:27 +0200, Thomas 'PointedEars' Lahn wrote:
> Peter H. Coffin wrote:
>
>> On Tue, 13 Sep 2011 19:56:20 -0600, Bart Kastermans wrote:
>>> I have downloaded a file that claims to be ISO-8859-1. In it (among
>>> many other stuff) are the bytes shown here (first column is the
>>> character, the second is ord(character), the third and fourth are binary
>>> respectively hexidecimal representations of the character.
>>>
>>> P / 80 / 01010000 / 50
>>> l / 108 / 01101100 / 6c
>>> z / 122 / 01111010 / 7a
>>> e / 101 / 01100101 / 65
>>> \303 / 195 / 11000011 / c3
>>> \205 / 133 / 10000101 / 85
>>> \313 / 203 / 11001011 / cb
>>> \206 / 134 / 10000110 / 86
>>>
>>> This is supposed to be ISO-8859-1 encoded, and should encode the
>>> character U+0148 (\v{n}; Latin small letter n with caron).
>>>
>>> Does anybody have any idea how I could decode this (or how it was
>>> encoded in the first place)? Any suggestions would be greatly
>>> appreciated.
>>
>> It's UTF-8 encoded representation of a false ISO-8859-1(? probably
>> CP1251, actually) [???]
>
> Windows-125_2_ (Western) corresponds largely with ISO-8859-1. Windows-1251,
> which is the proper name for that character set and encoding, is Cyrillic
> above 0x7F, and corresponds largely with ISO-8859-5.

Yeah, I know that. But there's 0x8n values in the hex that don't
represent in 8859-1 but do in CP1251. And there's a LOT more
charset-unaware stuff out there that assumes all the world is CP1251
than assumes everything is 8859-1.

--
A government big enough to give you everything you want is a government
big enough to take from you everything you have.
-- Gerald Ford in an address to Congress on August 12, 1974

Report message to a moderator

Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175421 is a reply to message #175404]

Tue, 20 September 2011 22:14

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Peter H. Coffin wrote:

> On Wed, 14 Sep 2011 14:07:27 +0200, Thomas 'PointedEars' Lahn wrote:
>> Peter H. Coffin wrote:
>>> On Tue, 13 Sep 2011 19:56:20 -0600, Bart Kastermans wrote:
>>>> I have downloaded a file that claims to be ISO-8859-1. In it (among
>>>> many other stuff) are the bytes shown here (first column is the
>>>> character, the second is ord(character), the third and fourth are
>>>> binary respectively hexidecimal representations of the character.
>>>>
>>>> P / 80 / 01010000 / 50
>>>> l / 108 / 01101100 / 6c
>>>> z / 122 / 01111010 / 7a
>>>> e / 101 / 01100101 / 65
>>>> \303 / 195 / 11000011 / c3
>>>> \205 / 133 / 10000101 / 85
>>>> \313 / 203 / 11001011 / cb
>>>> \206 / 134 / 10000110 / 86
>>>>
>>>> This is supposed to be ISO-8859-1 encoded, and should encode the
>>>> character U+0148 (\v{n}; Latin small letter n with caron).
>>>>
>>>> Does anybody have any idea how I could decode this (or how it was
>>>> encoded in the first place)? Any suggestions would be greatly
>>>> appreciated.
>>> It's UTF-8 encoded representation of a false ISO-8859-1(? probably
>>> CP1251, actually) [???]
>> Windows-125_2_ (Western) corresponds largely with ISO-8859-1.
>> Windows-1251, which is the proper name for that character set and
>> encoding, is Cyrillic above 0x7F, and corresponds largely with
>> ISO-8859-5.
>
> Yeah, I know that. But there's 0x8n values in the hex that don't
> represent in 8859-1 but do in CP1251. And there's a LOT more
> charset-unaware stuff out there that assumes all the world is CP1251
> than assumes everything is 8859-1.

You are missing the point. Windows-125*1* (or "CP1251" as you put it) is
not remotely the same as ISO-8859-1x; Windows-125_2_ is.

It is also misleading to state that 0x85 and 0x86 had no representation in
the widely unused ISO/IEC 8859-1 because that encoding is _not_ equivalent
to ISO-8859-1, which is what the OP stated and you referred to instead. In
ISO-8859-1, 0x85 represents NEL (ISO C1 Next Line, marks end-of-line on some
IBM Mainframes) and 0x86 represents SSA (ISO C1 Start of Selected Area, used
by block-oriented terminals).

PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)

Report message to a moderator

Previous Topic:	php developer
Next Topic:	Website Designer and SEO - Ahmedabad

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Sat Nov 30 01:57:50 GMT 2024

Total time taken to generate the page: 0.03860 seconds