LISTSERV - MACSCRPT Archives - LISTSERV.DARTMOUTH.EDU

On Aug 18, 2007, at 4:11 AM, Emmanuel wrote:

> The unfortunate thing is that no default behavior produces UTF-8 -  
> you *have* to specify "as «class utf8»" - which is sad because 1/  
> UTF-8 is basically a superset of ISO 8859, so most often you can  
> safely read an ISO file pretending it is UTF-8, 2/ as has been said  
> here most UNIX tools use UTF-8 as their output encoding.

I think what was happening in my case is that the script was coercing  
UTF-8 output from runpsynch to UTF-16. Personally, I don't think this  
is a good default behavior, especially since AppleScript doesn't  
attach any metadata to the files it creates with the file read/write  
commands. Yet, for some reason, when opened them in TextEdit, it  
correctly guessed their text encoding, while BBEdit, which is usually  
pretty good at guessing a file's text encoding, opened them as the  
default encoding.

This was another surprise. You'd think UTF-16 would be the easiest  
encoding to get from the file, especially if it has a BOM. These  
files have no BOM, but still, the first byte is 00, and every other  
character thereafter is 00. Wouldn't you think that was a pretty good  
indication of 16 byteness as well as bigendedness? I can't fathom how  
BBEdit could look at that file in HEX and go, "hmm... I wonder what  
character encoding THIS is." While I'm sure you could construct a  
case in which this would lead to a bad guess, I don't see why they  
don't use it.

I also don't see why Apple doesn't say they support UTF-16BE instead  
of UTF-16. If you don't prepend a BOM, you're not just assuming  
bigendedness, you're assuming everybody else does, and that's what  
UTF-16BE means.