Re: Unicode support [message #180992 is a reply to message #180981] |
Sun, 31 March 2013 12:10 |
M. Strobel
Messages: 386 Registered: December 2011
Karma: 0
|
Senior Member |
|
|
Am 30.03.2013 19:04, schrieb The Natural Philosopher:
> On 30/03/13 14:55, Christoph Becker wrote:
>> The Natural Philosopher wrote:
>>
>>> On 30/03/13 13:25, Thomas 'PointedEars' Lahn wrote:
>>>
>>>> The same issue exists with characters outside the BMP in ECMAScript
>>>> implementations which uses 16-bit characters (usually one UTF-16 code
>>>> unit
>>>> per character). But you can work around that rather efficiently.
>>>>
>>>
>>> The problem become 'what do you mean by strlen()' - the space the
>>> characters will occupy in an constant width font, or the storage
>>> allocated to the string?
>>>
>>> Mostly we are concerned with the latter.
>>
>> I am more concerned about the number of characters the string holds.
>> Say, I want to get the last character:
>>
>> $str = '€';
>> echo $str[2];
>>
>>> Because lack of precision in font reproduction, or even in guaranteeing
>>> which font may be selected, renders the former an 'open' question.
>>>
>>> strlen('€')===3 is in fact the correct answer.
>>
>> I suppose most *higher level languages* define the length of a string as
>> the number of characters the string holds. Cf. ECMAScript's length
>> property and TCL's [string length]. Even PHP's mb_strlen() returns the
>> number of characters.
>>
>
> so what happens in a typographic ligature like 'ᴁ'?
>
> I think you are making a rod for your back here.
>
> The storage requirements are exact specific and useful.
>
> The concept of a 'character in a text string' is really not..and if you go deep into
> typography with kerning, leading,ligature and the like and the like you will
> understand why.
>
Of course you need both, the storage requirements, and direct access to characters.
Maybe programming languages should use internally full 32 bit per char, or compress
the unicode string using a good library for access.
C does not even know strings, just byte arrays
/Str.
|
|
|
Re: Unicode support [message #180993 is a reply to message #180992] |
Sun, 31 March 2013 12:41 |
The Natural Philosoph
Messages: 993 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On 31/03/13 13:10, M. Strobel wrote:
> Am 30.03.2013 19:04, schrieb The Natural Philosopher:
>> On 30/03/13 14:55, Christoph Becker wrote:
>>> The Natural Philosopher wrote:
>>>
>>>> On 30/03/13 13:25, Thomas 'PointedEars' Lahn wrote:
>>>>
>>>> > The same issue exists with characters outside the BMP in ECMAScript
>>>> > implementations which uses 16-bit characters (usually one UTF-16 code
>>>> > unit
>>>> > per character). But you can work around that rather efficiently.
>>>> >
>>>>
>>>> The problem become 'what do you mean by strlen()' - the space the
>>>> characters will occupy in an constant width font, or the storage
>>>> allocated to the string?
>>>>
>>>> Mostly we are concerned with the latter.
>>>
>>> I am more concerned about the number of characters the string holds.
>>> Say, I want to get the last character:
>>>
>>> $str = '€';
>>> echo $str[2];
>>>
>>>> Because lack of precision in font reproduction, or even in guaranteeing
>>>> which font may be selected, renders the former an 'open' question.
>>>>
>>>> strlen('€')===3 is in fact the correct answer.
>>>
>>> I suppose most *higher level languages* define the length of a string as
>>> the number of characters the string holds. Cf. ECMAScript's length
>>> property and TCL's [string length]. Even PHP's mb_strlen() returns the
>>> number of characters.
>>>
>>
>> so what happens in a typographic ligature like 'ᴁ'?
>>
>> I think you are making a rod for your back here.
>>
>> The storage requirements are exact specific and useful.
>>
>> The concept of a 'character in a text string' is really not..and if you go deep into
>> typography with kerning, leading,ligature and the like and the like you will
>> understand why.
>>
>
> Of course you need both, the storage requirements, and direct access to characters.
> Maybe programming languages should use internally full 32 bit per char, or compress
> the unicode string using a good library for access.
>
> C does not even know strings, just byte arrays
>
Thereby avoiding the problems completely by not even pretending to
solve them.
And you can always wrote a unicode_strlen() to any specification you
want..the problem is..
....given that many 'characters' may take LESS than a byte (ligature) or
up to 3-4 bytes (unicode character sets).... what specification?
and the concept of a 'character' is practically valueless anyway..,
> /Str.
>
>
--
Ineptocracy
(in-ep-toc’-ra-cy) – a system of government where the least capable to
lead are elected by the least capable of producing, and where the
members of society least likely to sustain themselves or succeed, are
rewarded with goods and services paid for by the confiscated wealth of a
diminishing number of producers.
|
|
|
Re: Unicode support [message #180994 is a reply to message #180993] |
Sun, 31 March 2013 17:12 |
M. Strobel
Messages: 386 Registered: December 2011
Karma: 0
|
Senior Member |
|
|
Am 31.03.2013 14:41, schrieb The Natural Philosopher:
> On 31/03/13 13:10, M. Strobel wrote:
>> Am 30.03.2013 19:04, schrieb The Natural Philosopher:
>>> On 30/03/13 14:55, Christoph Becker wrote:
>>>> The Natural Philosopher wrote:
>>>>
>>>> > On 30/03/13 13:25, Thomas 'PointedEars' Lahn wrote:
>>>> >
>>>> >> The same issue exists with characters outside the BMP in ECMAScript
>>>> >> implementations which uses 16-bit characters (usually one UTF-16 code
>>>> >> unit
>>>> >> per character). But you can work around that rather efficiently.
>>>> >>
>>>> >
>>>> > The problem become 'what do you mean by strlen()' - the space the
>>>> > characters will occupy in an constant width font, or the storage
>>>> > allocated to the string?
>>>> >
>>>> > Mostly we are concerned with the latter.
>>>>
>>>> I am more concerned about the number of characters the string holds.
>>>> Say, I want to get the last character:
>>>>
>>>> $str = '€';
>>>> echo $str[2];
>>>>
>>>> > Because lack of precision in font reproduction, or even in guaranteeing
>>>> > which font may be selected, renders the former an 'open' question.
>>>> >
>>>> > strlen('€')===3 is in fact the correct answer.
>>>>
>>>> I suppose most *higher level languages* define the length of a string as
>>>> the number of characters the string holds. Cf. ECMAScript's length
>>>> property and TCL's [string length]. Even PHP's mb_strlen() returns the
>>>> number of characters.
>>>>
>>>
>>> so what happens in a typographic ligature like 'ᴁ'?
>>>
>>> I think you are making a rod for your back here.
>>>
>>> The storage requirements are exact specific and useful.
>>>
>>> The concept of a 'character in a text string' is really not..and if you go deep into
>>> typography with kerning, leading,ligature and the like and the like you will
>>> understand why.
>>>
>>
>> Of course you need both, the storage requirements, and direct access to characters.
>> Maybe programming languages should use internally full 32 bit per char, or compress
>> the unicode string using a good library for access.
>>
>> C does not even know strings, just byte arrays
>>
> Thereby avoiding the problems completely by not even pretending to solve them.
>
> And you can always wrote a unicode_strlen() to any specification you want..the
> problem is..
>
> ...given that many 'characters' may take LESS than a byte (ligature) or up to 3-4
> bytes (unicode character sets).... what specification?
>
So what? How much space does a backspace take? And how much a DEL (127)? This has not
been a problem so far.
And a ligature can be decomposed. Of course every case has to be discussed, but this
is done AFAIK.
/Str.
|
|
|
Re: Unicode support [message #180995 is a reply to message #180979] |
Tue, 02 April 2013 17:59 |
M. Strobel
Messages: 386 Registered: December 2011
Karma: 0
|
Senior Member |
|
|
Am 30.03.2013 15:33, schrieb The Natural Philosopher:
> On 30/03/13 13:25, Thomas 'PointedEars' Lahn wrote:
>
>> The same issue exists with characters outside the BMP in ECMAScript
>> implementations which uses 16-bit characters (usually one UTF-16 code unit
>> per character). But you can work around that rather efficiently.
>>
>
> The problem become 'what do you mean by strlen()' - the space the characters will
> occupy in an constant width font, or the storage allocated to the string?
>
> Mostly we are concerned with the latter.
>
> Because lack of precision in font reproduction, or even in guaranteeing which font
> may be selected, renders the former an 'open' question.
>
> strlen('€')===3 is in fact the correct answer.
>
strobel@suse123-acer:~> tclshi
% set eur "€"
€
% string length $eur
1
% string bytelength $eur
3
%
Languages have to catch up.
/Str.
|
|
|