Created on 2008-01-19.04:58:07 by stephen, last changed 2008-01-19.06:43:18 by aidan.
| File name |
Uploaded |
Type |
Edit |
Remove |
|
unnamed
|
sperber,
2008-01-19.06:43:18
|
text/plain |
|
|
|
unnamed
|
sperber,
2008-01-19.06:43:18
|
text/plain |
|
|
| msg400
|
[hidden] ([hidden]) |
Date: 2008-01-19.06:43:18 |
|
| |
Message-ID: <18317.64553.511639.895176@parhasard.net> |
Ar an cúigiú lá déag de mí Eanair, scríobh Aidan Kehoe:
> One more thing that’s slightly depressing for me; if you open
> lisp/mule/greek.elc as binary (the default, for other, broken, reasons) the
> buffer is trucated. That shouldn’t have happened.
“Truncated.” And yay, I’m wrong, it doesn’t happen.
|
| msg399
|
[hidden] ([hidden]) |
Date: 2008-01-19.06:43:18 |
|
| |
Message-ID: <18315.63613.253989.385241@parhasard.net> |
Ar an cúigiú lá déag de mí Eanair, scríobh Stephen J. Turnbull:
> [...] The only reason we even care about this, by the way, is that Mule
> decoding is a lossy transformation and we don't provide any way for the
> user to recover the original, and try again.
We do for UTF-8, btw. See unicode-error-sequence-regexp-range,
unicode-error-default-translation-table, #'unicode-error-translate-region.
Yes, it might be nice to extend this to support non-Unicode coding systems.
GNU’s unibyte vs. multibyte concept is similar, but about a thousand times
morea pain in the arse to work with than our current architecture, and
conceptually not any less ugly.
One more thing that’s slightly depressing for me; if you open
lisp/mule/greek.elc as binary (the default, for other, broken, reasons) the
buffer is trucated. That shouldn’t have happened.
|
| msg398
|
[hidden] ([hidden]) |
Date: 2008-01-19.06:43:18 |
|
| |
Message-ID: <y9ld4s5q55n.fsf@deinprogramm.de> |
Mats Lidell <matsl@xemacs.org> writes:
>>>>>> Michael wrote:
>
> Michael> I touch that file outside of XEmacs and then do M-x
> Michael> revert-buffer RET.
>
> Don't know if this is related but I have this situation in a folder:
> ----------------------------------------------------------------------
> /tmp/TEST:
> totalt 28
> -rw-r--r-- 1 matsl users 0 20 okt 23.28 ÄÄÄ
> -rw-r--r-- 1 matsl users 0 20 okt 23.28 ÖÖÖ
> ----------------------------------------------------------------------
>
> Now I open the file 'ÄÄÄ' for editing, (modeline says it i ÄÄÄ) insert
> some chars and save it. This is what I get in dired:
> ----------------------------------------------------------------------
> /tmp/TEST:
> totalt 32
> -rw-r--r-- 1 matsl users 4 23 okt 12.40 ÄÄÄ
> -rw-r--r-- 1 matsl users 0 20 okt 23.28 ÄÄÄ
> -rw-r--r-- 1 matsl users 0 20 okt 23.28 ÖÖÖ
> ----------------------------------------------------------------------
>
> In a shell, outside XEmacs, I get:
> ----------------------------------------------------------------------
> spencer:/tmp/TEST% ls -l
> totalt 4
> -rw-r--r-- 1 matsl users 4 23 okt 12.40 ???
> -rw-r--r-- 1 matsl users 0 20 okt 23.28 ÄÄÄ
> -rw-r--r-- 1 matsl users 0 20 okt 23.28 ÖÖÖ
> ----------------------------------------------------------------------
>
> The file I saved from XEmacs, and is shown as ÄÄÄ in dired is in my
> shell shown as ???. Both XEmacs and the shell uses the same locale
> sv_SE.UTF-8.
>
> If I remove the file ??? in the shell, and update dired, XEmacs is
> smart and gets the names back readable again.
>
> Is this a bug or should it be like this?
I can reproduce the situation on a MULE XEmacs in an UTF-8 locale. (It
also breaks on a non-MULE XEmacs, but in a different way. Fixing it
there is harder.) The reason is that `default-process-coding-system' is
'undecided in the read direction, and UTF-8 fails to get detected. My
suggestion is to align the coding system in `insert-directory' with the
locale's coding system. Works for me.
Maybe the MULE experts could review? I'd appreciate that!
2008-01-13 Michael Sperber <mike@xemacs.org>
* files.el (insert-directory): Bind `coding-system-for-read'
according to the current locale where available. (Previously, the
default ended up being undecided, which doesn't work well for
UTF-8-based locales, for example.)
|
| msg397
|
[hidden] ([hidden]) |
Date: 2008-01-19.06:43:18 |
|
| |
Message-ID: <y9lmyurq63o.fsf@deinprogramm.de> |
It's time to do something about this one.
Mats Lidell <matsl@xemacs.org> writes:
>>>>>> Michael wrote:
>
> Michael> I touch that file outside of XEmacs and then do M-x
> Michael> revert-buffer RET.
>
> Don't know if this is related but I have this situation in a folder:
> ----------------------------------------------------------------------
> /tmp/TEST:
> totalt 28
> -rw-r--r-- 1 matsl users 0 20 okt 23.28 ÄÄÄ
> -rw-r--r-- 1 matsl users 0 20 okt 23.28 ÖÖÖ
> ----------------------------------------------------------------------
I don't even get that far. Is there a brief overview somewhere on how
locale influences the various parts of XEmacs? Specifically, how do
file names work? I have
(get-coding-system-from-locale (current-locale)) => utf-8
Yet XEmacs treats a filename called "äää" (a umlaut, a umlaut, a umlaut)
as if it were called "a box a box a box" (i.e. apparently the UTF-8
encoding of the file name).
|
| msg132
|
[hidden] ([hidden]) |
Date: 2008-01-19.04:58:07 |
|
| |
Message-ID: <87y7as6w6o.fsf@uwakimon.sk.tsukuba.ac.jp> |
Michael Sperber writes:
>
> "Stephen J. Turnbull" <stephen@xemacs.org> writes:
> > There are a lot of coding systems. But basically if you have as many
> > as 3 non-ASCII characters, the chance that any natural language text
> > "looks like" UTF-8 is vanishingly small. Except at the beginning and
> > end of the string, a single byte >= 0xC0 gives you information about
> > *at least* three other bytes: the preceding one may *not* be >= 0xC0,
> > the following N bytes must be in the range 0x80 to 0xBF, and the next
> > one after that must not be >= 0xC0.
>
> I'm not sure I understand: These are conditions which must hold true for
> UTF-8. Is the presence of a valid UTF-8 3-byte encoding in a byte
> sequence enough to be able to say that it is UTF-8?
No, it's not. For one thing, in a shell buffer you might be accessing
several remote systems or cat'ing files saved from MIME mail in the
specified encoding. Let's not worry about those, though.
Consider the popular Western European languages (English, Spanish,
French, and German). Suppose the text has three non-ASCII characters
you want to encode. The accented letters (acute, grave, tilde,
umlaut) almost always occur in isolation: that can't be UTF-8 where
high-bit-set bytes occur only in groups of two or more. If they do
occur in groups, what is the chance that the first of the group has
high bits that encode the byte count of the UTF-8-like group it is in?
Note that the probability that an ISO-8859 encoding of random
characters will put a UTF-8 trailing byte in the text is fairly low by
itself, since only the range 0xA0-0xBF (1/8) satisfies both ISO-8859's
avoidance of C1 controls and the UTF-8 trailing byte restriction.
How likely it is given the contraints of natural language
vocabulary, I don't know.
Spanish, however, has those inverted punctuation characters used at
the beginning of a sentence, and it does have words that begin with an
accented character. Do we need to worry? No! What will precede the
punctuation character? Most likely an ASCII whitespace character,
possibly NO-BREAK SPACE (NBSP). But what precedes NBSP? Probably
whitespace or ASCII punctuation. NBSP is encoded as a UTF-8 trailing
byte, so that can't be a UTF-8 character: no leading byte. How about
the inverted punctuation? We got lucky: they're both trailing bytes,
the argument holds. I believe the same analysis holds for French
guillemots, currency symbols, etc.
So once you've got 3 non-ASCII characters encoded in ISO-8859 Latin,
the chances that some UTF-8 restriction isn't violated seem rather
slim. It's all very hand-wavy and heuristic, but seems pretty solid
to me.
The only reason we even care about this, by the way, is that Mule
decoding is a lossy transformation and we don't provide any way for
the user to recover the original, and try again.
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta@xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta
|
| msg131
|
[hidden] ([hidden]) |
Date: 2008-01-19.04:58:07 |
|
| |
Message-ID: <87zlv86xp5.fsf@uwakimon.sk.tsukuba.ac.jp> |
Aidan Kehoe writes:
> On your original question; it's laughably unlikely (outside of
> Cygwin, where this code is not used) that ls will output file names
> in a coding system that doesn't reflect the octets stored in the
> directory entries.
Well, no, it's not funny at all in Japan where I know of servers
running in EUC or UTF-8 but handle Shift-JIS stores. I wouldn't be
surprised if similar issues arise with respect to KOI-8 and Big5.
> And on OS X
> file-name-coding-system (and relatedly, the 'file-name coding system alias)
> is unconditionally UTF-8, independent of the locale coding system.
This is a good heuristic, since the most popular file system on OS X
is HFS+, which does try to enforce UTF-8 file names.
> I would suggest binding coding-system-for-read to 'file-name, not
> (get-coding-system-from-locale (current-locale)) .
Yes, that's an excellent idea; it give the user enough control to deal
with the unusual cases described above.
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta@xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta
|
| msg130
|
[hidden] ([hidden]) |
Date: 2008-01-19.04:58:07 |
|
| |
Message-ID: <873at091pj.fsf@uwakimon.sk.tsukuba.ac.jp> |
Michael Sperber writes:
> Could you give a hint about detecting UTF-8? (I know what UTF-8 looks
> like, but enough about the other coding systems to be able to say what
> distinguishes them.)
There are a lot of coding systems. But basically if you have as many
as 3 non-ASCII characters, the chance that any natural language text
"looks like" UTF-8 is vanishingly small. Except at the beginning and
end of the string, a single byte >= 0xC0 gives you information about
*at least* three other bytes: the preceding one may *not* be >= 0xC0,
the following N bytes must be in the range 0x80 to 0xBF, and the next
one after that must not be >= 0xC0.
However, this should all already be part of the 'undecided' coding
system. If it's not working, there's probably something tricky going
on with process buffers.
_______________________________________________
XEmacs-Beta mailing list
XEmacs-Beta@xemacs.org
http://calypso.tux.org/cgi-bin/mailman/listinfo/xemacs-beta
|
|
| Date |
User |
Action |
Args |
| 2008-01-19 06:43:18 | aidan | set | messages:
+ msg400 |
| 2008-01-19 06:43:18 | aidan | set | messages:
+ msg399 |
| 2008-01-19 06:43:18 | sperber | set | files:
+ unnamed, unnamed messages:
+ msg398 |
| 2008-01-19 06:43:18 | sperber | set | messages:
+ msg397 |
| 2008-01-19 04:58:07 | stephen | set | messages:
+ msg132 |
| 2008-01-19 04:58:07 | stephen | set | status: new -> chatting messages:
+ msg131 |
| 2008-01-19 04:58:07 | stephen | create | |
|