Re: UTF-8 charset [message #180854 is a reply to message #180850] |
Fri, 22 March 2013 00:17 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma:
|
Senior Member |
|
|
M. Strobel wrote:
> Am 21.03.2013 15:31, schrieb Adrian Tuddenham:
>> M. Strobel <sorry_no_mail_here(at)nowhere(dot)dee> wrote:
>>> Am 21.03.2013 00:13, schrieb The Cat in the Hat:
>>>> Christoph Becker <cmbecker69(at)gmx(dot)de> wrote in
>>>> news:kide60$df1$1(at)speranza(dot)aioe(dot)org:
>>>> > The Cat in the Hat wrote:
>>>> >> How about omitting the smart quotes in your posts?
>>>> >>
>>>> >> Come back after you've figured out how to configure your newsreader
>>>> >> properly.
>>>> >
>>>> > Thomas' message was properly encoded as UTF-8. Isn't that acceptable
>>>> > in this NG?
>>>>
>>>> Usenet was designed to use ASCII, not UTF-8. At best, it's poor
>>>> netiquette in *any* NG.
>>>
>>> Utter nonsense.
ACK.
>> There are some computers that cannot read UTF-8.
That would be computers and software older than 30 years now. The Unicode
Standard, version 2.0, and UTF-8, one the encodings for the character set
thus specified, was published in 1992 CE. All reasonably modern operating
systems, in effect all commonly used ones, support Unicode and provide
Unicode-capable fonts. Many have made a Unicode encoding their default
encoding; for example, NTFS encodes filenames using UTF-16, and UTF-8 has
been the default locale encoding on GNU/Linux systems for several years now.
Thus, a major criticism of PHP is that as of version 5.4 it still has no
native Unicode support, while other popular programming languages on the
Web, like ECMAScript implementations and Python, have.
>> Always use the lowest common denominator if you want to communicate
>> effectively without excluding anyone.
Insisting on using *ancient* and therefore inherently *insecure* hardware
and software is no excuse for claiming something mindbogglingly absurd like
that using UTF-8 encoding would be “poor netiquette in *any* NG”. First of
all, netiquette (network etiquette) is less concerned with the technical
aspects of messages but with the behavior of people towards each other thus
exhibited. Which makes this clearly a case of “the pot calling the kettle
black”.
Speaking of netiquette, though, it *is* considered polite (on Usenet) to
introduce oneself with one's full name. People posting under unnecessarily
abbreviated names, pseudonyms, and plain nick names, like “The Cat in the
Hat”, are usually either frowned upon, laughed at, or ignored instead by
longtime regulars (like me). The same applies to people who violate
technical standards or quasi-standards such as RFC 5536, along with the AUPs
of their providers, by using non-addresses in address header field values:
<http://www.zedat.fu-berlin.de/NetNews-Regeln>
<http://www.kirchwitz.de/~amk/dni/netiquette> (based on a Big 8 original)
> More than 90% of the text is still readable if your usenet client only
> displays ASCII.
>
> In the anglophone world people get easily to think ASCII is sufficient,
> but in a global world even 16-bit Unicode (Basic Multilingual Plane) is
> not.
Those are good points, however:
> Some languages use UCS-16 internally, and it is not enough for all uses
> and users.
This is a common misconception. There is no “UCS-16” and there never was.
At most, there is or was (depending how you look at it), UCS-2 (Universal
Character Set 2), to whose character set the Unicode character set from
Unicode 2.0 on is byte-by-byte equivalent.
The underlying misconception here is confusing character set with character
encoding, probably also common because of the “charset” parameter of
Internet messages that despite its name declares *character encoding* of a
message.
UCS-2 is/was a standard for a *character set along with a specific, 16-bit
encoding* to encode all characters in the Basic Multilingual Plane (BMP),
and *only* those. This and the overhead for simple (purely US-ASCII-based)
texts lead to the later success of the competing standard, Unicode.
See also: <http://en.wikipedia.org/wiki/UCS-16> (properly redirected to the
UCS article, with an explanation why “UCS-16” is just wrong.)
*UTF*-16 (Unicode Transformation Format, 16-bit) is *another* encoding where
each *code unit* of a sequence for a character has 16 bit; a character may
be encoded using more than one code unit (the same is true for the other
UTFs). The character set thus encoded is the Unicode character set, and
both are defined in the Unicode Standard. That character set comprises, and
its transformation formats can encode, a lot more than just the BMP,
although of the subsets of Unicode the BMP remains the best supported one,
also due to pre-installed font support. At the moment, there are 1'114'112
code points in Unicode (U+0000 to U+10FFFF), so in theory a text encoded
with one of the UTFs could contain 1'114'112 different characters. (In
practice, some code point ranges in the BMP are reserved for surrogate
characters to allow more than the 65536 potential characters of the BMP
while keeping the encoding relatively simple and backwards-compatible.)
> So UTF-8 is THE solution. There must be some progress sometime.
ACK.
> ASCII, 7-bit encodings and octal are left behind.
US-ASCII *is* a 7-bit encoding. “Octal”?
The most important thing about UTF-8 is that it is equivalent to US-ASCII
for code points below U+0080 (one 8-bit code unit per character, the MSB is
always 0, encodes the same characters as in US-ASCII). Therefore, UTF-8,
through Unicode, allows texts with the greatest possible range of characters
(all written languages, even some extinct ones, common symbols, punctuation
etc.) while using the least amount of memory when this range is _not_ fully
used.
Actually, it should not be necessary to explain all this to people
subscribed to a programming newsgroup, and to Web developers in particular,
but there you are.
<http://www.joelonsoftware.com/articles/Unicode.html>
<http://unicode.org/faq/>
HTH
PointedEars
--
> If you get a bunch of authors […] that state the same "best practices"
> in any programming language, then you can bet who is wrong or right...
Not with javascript. Nonsense propagates like wildfire in this field.
-- Richard Cornford, comp.lang.javascript, 2011-11-14
|
|
|