On Aug 18, 2007, at 4:11 AM, Emmanuel wrote:
> The unfortunate thing is that no default behavior produces UTF-8 -
> you *have* to specify "as «class utf8»" - which is sad because 1/
> UTF-8 is basically a superset of ISO 8859, so most often you can
> safely read an ISO file pretending it is UTF-8, 2/ as has been said
> here most UNIX tools use UTF-8 as their output encoding.
I think what was happening in my case is that the script was coercing
UTF-8 output from runpsynch to UTF-16. Personally, I don't think this
is a good default behavior, especially since AppleScript doesn't
attach any metadata to the files it creates with the file read/write
commands. Yet, for some reason, when opened them in TextEdit, it
correctly guessed their text encoding, while BBEdit, which is usually
pretty good at guessing a file's text encoding, opened them as the
default encoding.
This was another surprise. You'd think UTF-16 would be the easiest
encoding to get from the file, especially if it has a BOM. These
files have no BOM, but still, the first byte is 00, and every other
character thereafter is 00. Wouldn't you think that was a pretty good
indication of 16 byteness as well as bigendedness? I can't fathom how
BBEdit could look at that file in HEX and go, "hmm... I wonder what
character encoding THIS is." While I'm sure you could construct a
case in which this would lead to a bad guess, I don't see why they
don't use it.
I also don't see why Apple doesn't say they support UTF-16BE instead
of UTF-16. If you don't prepend a BOM, you're not just assuming
bigendedness, you're assuming everybody else does, and that's what
UTF-16BE means.
|