Using UTF-8 with UltraEdit

Ammaletu · Aug 22, 2006#12006-08-22T11:57+00:00

Hi folks!

I recently bought UltraEdit 12.10a, and now I have a problem that I can't seem to solve myself. I already invested some days in this, reading the forum, saving test files, looking at files with a hex editor, and I'm not exactly happy with UltraEdit. :-/

So, the problem is the same that Johna seems to have: I want all my PHP files to be UTF-8 (without a BOM), and I want this to be possible with as few work for me as possible. For example, I don't want to convert every file manually. I could, though, live with UltraEdit just saving every new file as UTF-8. Sure, you say, UTF-8 is an about ten year old standard, UltraEdit is one of the best editors worldwide, what can be the problem?! Well, consider this:

I create a new empty file within UE and save it, specifying the format to be "UTF-8 NO-BOM". Now the file is empty, and would I close it and open it again, UE would have no way to tell if it is supposed to be UTF-8, ISO-8859-1 or something else. But I don't close it, I just write some characters, per chance using some that are encoded differently in UTF-8 and ISO-8859-1. And when I hit "Save", the file gets saved as ISO-8859-1.

Would I have used "Save As" instead of "Save", the file would have been saved as UTF-8 instead. Would I have written some special characters before first saving the file, "Save" would write it as UTF-8 too from there on. So why not in the above example? What's the difference between "Save" and "Save As" there? As a programmer, I'd really like to understand what UE does here.

Ok, so I have to include a special character in every file before I first save it as UTF-8. No problem, all my PHP, CSS, HTML und JS files get a header comment anyway. Cool, I thought, just make a template out of it. Unfortunately it turned out, that UE somehow can't insert templates that contain special characters. Although I went and saved the template file in the UltraEdit program directory explicitely as UTF-8, the characters are broken when I insert the template in my new document. Even if I save the document first as UTF-8. So that's my second question: Is UltraEdit just not able to do this or am I missing something?

Of course, the next possible solution might be to use Mofi's macro to load a template file when I create a new file (with a hotkey). But then I would always get the same template, while in reality I need different headers for PHP classes, PHP pages, CSS files etc. So I'd really like to use the template mechanism already built into UltraEdit. Alternatively, and that's my last question (sorry, spontaneous idea, the answer is probably in the Help): Is it possible to add menu buttons for different macros? Can I call macros with menu buttons at all?

Greetings,
Johannes

Mofi · Aug 26, 2006#22006-08-26T19:42+00:00

Today I have had time to look into the UTF-8 problems you described.

First what I think is the reason why you and other users think UltraEdit has problems handling UTF-8 files without BOM.

A UTF-8 file without BOM is 100% binary identical with an ASCII file, if it does not contain at least 1 character with code value greater than 0x7F (decimal 127) and must be saved therefore in a UTF-8 encoded file with 2 to 4 bytes like the German umlauts äöü. So if a file without BOM does not contain any multi-byte character, it is interpreted as ASCII file and this is 100% correct.

But UltraEdit is handling the Character encodings correctly. If the file contains either the string charset=utf-8 (HTML, PHP, ASP, ...) or encoding="utf-8" (XML) at top of the file in the first few KB on using UltraEdit for Windows < v24.10 or UEStudio < v17.10, UE is handling the file as UTF-8 file independent of the existence of a UTF-8 multi-byte character. So although for example an English webpage does not contain any character encoded in UTF-8 and so could be also an ASCII file, UE is nevertheless loading and handling it as a UTF-8 file if it contains one of the 2 encoding specification strings.

I tested saving a new file with Ctrl+S and with Save As with the format UTF-8 - NO BOM. I can't see a difference. I have done following tests:

1) Open a new file and save it immediately with Ctrl+S with format UTF-8 - NO BOM. As everybody can see in the status bar of UltraEdit, UE is still handling the file as ASCII file and not as UTF-8 file. This is correct, because the 0 byte file is according to the international encoding standard still not a UTF-8 file. You maybe have a different opinion here because you think, "I have specified it as UTF-8, so UE should handle it as UTF-8 until saved and re-opened". But this is not correct according to the international encoding standard because you have not really specified it as UTF-8 file.

2) Open a new file and save it immediately with Save As with format UTF-8 - NO BOM. Same result as at 1), the empty file is still an ASCII file.

3) Open a new file, enter a few ASCII characters all with a hexadecimal code lower than 0x80 and save the new file with format UTF-8 - NO BOM. According to the status bar UE handles it still as ASCII file. According to the international encoding standard this is correct even if you think it is a bug. It isn't. The cursor position is not changed after this first save. There is no difference between Ctrl+S and Save As because first save of a new file always opens the Save As dialog.

4) Open a new file, enter a few ASCII characters and also at least 1 character with a hexadecimal code higher than 0x80 like Ä and save the new file with format UTF-8 - NO BOM. According to the status bar UE handles it now as a UTF-8 file (U8-DOS with my settings).

You can see what really happens in this situation if you look the file content temporarily in hex mode before save and temporarily also look it again in hex mode after save.

Attention: Do not save the new file while you are in hex mode. Just enable the hex mode temporarily before save and after save.

The file is converted from 1 byte per character before save to a Unicode UTF-16 LE file with BOM and 2 byte per character after save. The cursor position has changed to top of the file after the first save because of the automatic conversion in background.

5) Open a new file, enter a few ASCII characters and also the string charset=utf-8 and save the new file with format UTF-8 - NO BOM. According to the status bar UE handles it now as a UTF-8 file although the file does not contain any character which is really encoded as multi-byte character. The file is also converted and handled temporarily now after save as UTF-16 LE with BOM. The cursor position has changed to top of the file after the first save because of the automatic conversion in background.

Conclusion: UltraEdit handles new files as UTF-8 files 100% correct according to the international encoding standard.

Update: Since UE v17.30.0.1011 any conversion executed from File - Conversions requiring a change in line termination or an ASCII/ANSI to Unicode or Unicode to ASCII/ANSI conversion is done immediately on the file and not anymore on next save. And since UE v19.00 the encoding of a file can be changed directly via the Encoding Type control in status bar at bottom of UE main window for active file as long as basic status bar is not used.

Johna and you have 2 problems caused by "wrong" UTF-8 handling.

It is not possible to open a file which does not contain whether a correct encoding specification nor at least 1 multi-byte character and insert by keyboard or paste from clipboard now characters which must be encoded in UTF-8. If these characters don't have an ANSI equivalent in the selected code page of the currently used font (a single byte with code value lower hexadecimal 0x100), you will not see those characters correctly.

The file is loaded as ASCII file according to the international encoding standard. As long as you do not convert it manually to a UTF-8 file (in real temporarily to a Unicode UTF-16 LE file), you cannot insert or paste characters which simply need 2 bytes.

And the second very similar problem is, that you can also not insert multi-byte encoded characters into a new file as long as it is not a real UTF-8 file according to the international encoding standard which is correctly indicated in the status bar of UltraEdit.

The second problem can be easily avoided. UTF-8 is a byte optimized version of Unicode. So if you most of the time want to create new files in UTF-8 format, enable the option Configuration - Editor - New File Creation - Always create new files as UNICODE. Now a new file is by default a UTF-16 LE file as every loaded UTF-8 file is also while editing. With the format UTF-8 - NO BOM in the Save As dialog the new file is then automatically saved as you want. The 5 tests above has been done with this option not checked to make it more difficult for UE as necessary.

It's correct that templates cannot contain characters which must be saved with 2 bytes because they have no single byte equivalent. The template file of UltraEdit is still a binary file where only single byte characters are possible. Changing the format of the template file just for support of a few 2 byte characters would be a hard work. You have to take also all the thousands of existing template files of UltraEdit users into consideration which are already satisfied with the current format. And the downwards compatibility will be also lost. I think you will understand now why IDM will not change the format of the template file because a few users think, they need it.

And you don't really need it. Write your templates for a new PHP, CSS, HTML, ... but don't forget to add to the template also the correct encoding specification. The templates must not have a special character, only the correct encoding specification.

Then you can use the templates on new files and after first save with the format UTF-8 - NO BOM the file is automatically converted by UltraEdit to UTF-8 (UTF-16 LE). But don't forget, first save the new UTF-8 file with no BOM but with the encoding specification before you insert manually or from clipboard a character which must be encoded with 2 bytes. Best is to use 1 or more macros for that job. An example:

InsertMode
ColumnModeOff
HexOff
NewFile
Template 4
SaveAs ""

or without immediately saving the new file

InsertMode
ColumnModeOff
HexOff
NewFile
ASCIItoUnicode
Template 4

And the Format selected in the Save As dialog is UTF-8 - NO BOM.

ASCIItoUnicode is only needed if Always create new files as UNICODE is not checked.

Template 4 for example contains your standard body for new PHP files with the charset=utf-8 encoding specification string in the HTML header. I should add, that UltraEdit is not examining where either charset=utf-8 or encoding="utf-8" is found. If the string is for example inside a PHP comment, UltraEdit will also interpret it as valid encoding specification. Don't know if the PHP interpreters or the browsers except the encoding specification only in the correct environment or also anywhere in the file like UltraEdit.

Update: With smart templates introduced with UE v18.00 it is also possible to create templates with Unicode characters as the templates are stored now in XML using a text encoding supporting all UTF-16 encoded characters.

To your last question: No it is not possible to add macros to the menu or a toolbar. But I never missed it because there is the macro list view at View - Views/Lists - Macro List. Activated by a click on it in the menu or by a hotkey you have assigned to this command or by a click on its symbol in the toolbar after you have added this command to the toolbar, it opens the macro list in a docked or floating window as you have specified it on last usage. Then you will see all the macros of the macro file currently loaded and you can run the macro you currently need to create a new PHP or a new CSS or a new ??? file with a double click or with the Return/Enter key if a macro in the macro list has the focus.

What I think IDM could do to help webpage writers who use UTF-8.

First an ASCIItoUTF8 macro command could be very helpful (available since UE v17.30).

Second a file loading configuration option like "Create and load ASCII files as UTF-8" would be helpful for some users like you (available undocumented since v11.10c, read below).
With such an option checked a new and also an existing ASCII file is automatically loaded and handled as UTF-8 file without BOM (internally in UE as UTF-16 LE) and so saved also as UTF-8 file without BOM. A real ASCII file without any character with a code higher 0x7F will be after closing still an ASCII file and not a UTF-8 file, it it still does not contain the UTF-8 encoding specification.

I have never requested the macro command and the configuration option, because I personally don't need it. Especially the configuration option would never be checked by me because I rarely edit or create UTF-8 files, but daily work with ASCII files with characters with a code greater 0x7F - German characters äüöÄÖÜß with OEM or ANSI code.

So if there are webpage writers who would need these 2 things, they all should write an appropriate feature request email to IDM support.

My suggestions for the configuration for UTF-8 webpage writers:

First read the FAQ about UTF-8, UTF-16, UTF-32 & BOM and the Character encodings to get the basic knowledge you need.

Second in UltraEdit or UEStudio open Configuration - File Handling and set following options:

Conversions

Uncheck the 2 EBCDIC options if you are not editing EBCDIC files, but check the option On Paste convert line ending to destination type (UNIX/MAC/DOS).

DOS/UNIX/MAC Handling

Set the Default file type for new files to whatever you prefer. If your host server is a Linux/Unix server, you should use Unix to avoid problems while downloading or uploading via FTP. If your host server is a Windows server, use DOS.

Set the Unix/Mac file detection/conversion to Automatically convert to DOS format to avoid problems with copy and paste with other windows applications.

Uncheck Only recognize DOS terminated lines (CR/LF) as new lines for editing.

Save

Uncheck Write UTF-8 BOM header to ALL UTF-8 files when saved.

If Write UTF-8 BOM on new files created within this program (if above is not set) should be enabled or not depends on the type of Unicode files you are creating. If you create for example only XML and HTML type files (HTML, HTML, PHP, ASP, ...) in UTF-8, you should uncheck this option, because then the encoding should be defined inside the file with encoding="utf-8" (XML) or with content="text/html; charset=utf-8" (HTML). See FAQ above for details about BOM and when it should be used.

Enable Save file as input format (UNIX/MAC/DOS). That's important because we convert every file automatically to DOS for editing, but we want to save it in the original format and not in DOS format. This option is moved from the Save to the DOS/UNIX/MAC Handling configuration dialog in v12.10 of UltraEdit!

You can set option Trim trailing spaces on file save to whatever you prefer. Normally it is good to activate it because it can reduce the file size a little bit which is interesting for HTML files.

Temporary Files

Use the second option Open file without temp file but prompt for each file and set the Threshold for example to 4096 (4 MB). You can set the threshold value to a higher value if your computer has enough performance and your hard disk is fast and you often edit large files.

Unicode/UTF-8 Detection

Enable Auto detect UTF-8 files, Detect Unicode(UTF-16) files without BOM and Detect ASCII/ANSI files with Escaped Unicode. You can disable for example the UTF-16 detection if you are sure that you will never edit a UTF-16 file. Every enabled detection increases the file load time of normal ASCII files. But if you don't know what format your files have, it is better to let UE/UES automatically detect it.

The 3rd option Disable automatic detection of HEX file format on reload is not important for handling Unicode files.

And as already explained above also enable the option Always create new files as UNICODE at Editor - New File Creation.

Last if you download/upload the files via the FTP client of UE/UES, always use the binary transfer mode and not the text mode. If your files on your Apache (Unix/Linux) host server are already Unix files, than UE/UES is converting a file temporary for editing only into DOS after loading from FTP and before opening in the editor and before saving back to Unix with the settings above. So there is no need to do it while transferring the file content. Local copies are then also Unix files and so are 100% identical with the files on the server. Using binary transfer mode is faster than the text/ASCII mode. Even if you don't use the FTP client of UE/UES and use a different FTP tool, you should always create and edit files with Unix line termination and use the binary transfer mode and the automatic conversion to DOS feature of UE/UES except your host server is a Windows server.

Added on 2009-11-09: I have found an undocumented setting in uedit32.exe of v11.10c and later. With manually adding to uedit32.ini

[Settings]
Force UTF-8=1

you can force all non Unicode files (not UTF-16 files) to be read/saved as UTF-8 encoded files. But new files are nevertheless created and saved either as Unicode (UTF-16 LE) or ASCII/ANSI files. So this special setting is only for already named files. However, creating a new file in ASCII/ANSI, save it with a name, close it and re-open it results in a new file encoded in UTF-8. Be careful with that setting. Even real ANSI files are loaded with this setting as UTF-8 encoded file causing all ANSI characters to be interpreted wrong.

Added on 2010-03-28: With UltraEdit v16.00 instead of Create new files as Unicode there are now the choices

Create new files as ANSI
Create new files as UTF-8
Create new files as UTF-16

at Advanced - Configuration - Editor - New File Creation. Therefore users of UltraEdit 16.00 and later can set the default encoding for new files to UTF-8. With this change the option Format of the Save As dialog is not remembered and preset anymore in UE v16.00 and later. Format of the Save As dialog is now always set to Default on opening of the dialog.

Ammaletu · Aug 27, 2006#32006-08-27T15:26+00:00

Hi Mofi!

First, thanks a lot for the extensive reply!

Conclusion: UltraEdit handles new files as UTF-8 files 100% correct according to the international encoding standard.

I'm not sure about that, but I don't know to what "international encoding standard" you are referring and you most probably know more about this topic than I.

For me, this is intuitively wrong. A valid ASCII file is always also a valid UTF-8 file. It's just a matter of convention if I see it as ASCII or as UTF-8 file. This just becomes important when I enter the first non-ASCII character. And intuitively, I would expect UltraEdit to remember a previous setting and then save the file as UTF-8. This is what I described in my first posting and what irritated me: I saved an empty file as UTF-8, wrote some special characters and hit "Save" -- and it got saved as ISO-8859-1, as if I had never told UltraEdit to save it as UTF-8. I rather would expect UE to remember settings like these *as long as I keep the file open*. But this might just be me expecting something which does not conform to standard procedure. Can you tell me more about this "international encoding standard" you mentioned?

From your explanations, I can definitely see why using the template mechanism the way I wanted doesn't work. I'd still think that it should be possible for UE to detect that its template file is saved as UTF-8 and act accordingly. But under the hood, this surely is more difficult to accomplish than it looks for a mere user of the program.

You also mentioned that UE treats a file as UTF-8 when it finds the charset=utf-8 declaration commonly used in HTML files. Now this is something that I find highly suspicious. As a programmer myself, I wonder who had this idea in the first place. Isn't that like *guessing* the file's encoding?! With a chance of, let's say, 9 out of 10 times to guess right? A simple test exposes strange behavior: I write a perfectly valid ISO-8859-1 HTML page (with this encoding stated in the HTML head section) and happen to mention the possibility of using "charset=utf-8" in the body of my page in the text (because I e.g. write about HTML). I save this explicitly as ANSI/ASCII and upon reopening it, UE treats it as UTF-8. Because it obviously isn't UTF-8 and states so, all special characters are broken. Offhand, I didn't found a way to repair the view in UE, and when I saved the file as ANSI/ASCII again, the characters were permanently broken. Now that's something that doesn't concern me personally and it might be a rare case. It's kinda creepy, nonetheless, and might be confusing for users with less knowledge about the whole encoding issue.

But I wouldn't rely on this feature anyway because many of my PHP files are meant to be included in a page and don't render a complete HTML page themselves -- therefore they don't include the charset statement. My PHP classes don't contain any charset information either, as do my JavaScript files. So my solution after some days of reading and trying things out is to create my own template files for the different file types I usually use. All of them start with a comment containing one character that can not be displayed as ISO-8859-1. I choose the Greek letter Omega for this; it makes a nice visual and immediately recognizable proof that the file got indeed saved in the right encoding. Of course I have to copy and rename these files manually, but after some thinking about it I decided that I could as well stick with my current habit of starting a new HTML page by copying a similar existing file. Doing so is more comfortable and fast in Windows Explorer than in the Save dialog of UE, anyway. So I can live with this solution.

As for the rest of your post: Thanks for pointing out how to access the macros in an easy way. I haven't ever used them, despite using UE for some years now. So I wasn't aware of what exactly can be done with them and how to handle them. I finally decided against using them in this case because I can't insert a custom time String with them like I could with the template mechanism.

I certainly would like a "Create and load ASCII files as UTF-8" option in UE (I guess I'll write IDM about it). I work a lot with ISO-8859-1 files, too. But after having one trouble with encodings after the other at work (usually in projects where I have no control about e.g. the database or the operating system's default encoding), I'm more than willing to switch to UTF-8 permanently for all text files, at least at home where I have full control over my private projects.

I also recommend reading about the whole matter to anyone who wants to switch to UTF-8 for their projects. It's a rather complex matter, but it's worth to know the basics. And after all, it's not so hard to do the switch. A resource that I found invaluable by the way is this page: http://www.phpwact.org/php/i18n

Greetings,
Johannes

Mofi · Aug 27, 2006#42006-08-27T19:46+00:00

Ammaletu wrote:A valid ASCII file is always also a valid UTF-8 file.

That's not true. An ASCII file contains only single byte characters, whereas all single byte characters of a UTF-8 file must be always converted to 2 byte characters before further processing. From a 'C' programmers view this makes an extreme difference. Typical string arrays like char Text[]; where the string is terminated with a NULL byte cannot be used for handling Unicode strings. That's the major problem.

Unicode is on Windows environment relatively new. Win98 for example has a very limited Unicode support. If a program should still be compatible with Win98, the writers of the program has to do much extra work to get it work with Unicode on Win98. Also Win2K and WinXP has still some bugs with Unicode handling. There are thousands of good old string arrays in programs which are not completely rewritten new for supporting Unicode.

All those arrays and the routines which work with it must be rewritten and additionally the different Unicode support of the OSes (versions and service packs) must be taken into consideration. I'm glad that I write firmware and not Windows programs which have to handle Unicode. For example the sort of UltraEdit on Unicode files is still not working and also simple finds are often not working on Unicode files. Sure that are bugs of UE, but it simply needs time to find all string handling routines and fix it for Unicode.

And I daily work also with ASCII files with OEM character set. I don't know how old are you and if you were working with MS-DOS and OEM character set (the OS which does not have a GUI) before Microsoft launched Windows 3.1 with ANSI character set. The German umlauts in OEM character set cannot be correctly replaced by a UTF-8 multi-byte character without 2 conversions (OEM to ANSI, ANSI to Unicode).

Ammaletu wrote:Can you tell me more about this "international encoding standard" you mentioned?

I have linked already to 2 pages with more details. You should not only read the 2 pages carefully, you should also follow the links in those pages to get more detailed infos about it and use a search engine to find more. I have read hundreds of pages, the complete HTML 4.01 standard from first to last page, the complete CSS 1.0 and 2.0 specifications, FAQs, and much more before I really started writing HTML files. Extreme, I know, but very useful in the last 5 years of writing HTML files. If you want a good starting point for more infos, start at the homepage of the World Wide Web Consortium. There you will find the standards and hundreds, no, thousands of interesting articles including the Internationalization (I18n) Activity you linked above to a different site.

Ammaletu wrote:You also mentioned that UE treats a file as UTF-8 when it finds the charset=utf-8 declaration commonly used in HTML files. Now this is something that I find highly suspicious. As a programmer myself, I wonder who had this idea in the first place. Isn't that like *guessing* the file's encoding?!

It's a standard, so it is not guessing. Well, the people who define a standard are mostly unfortunately not technicians or programmers. I'm a programmer for firmware for protective devices for generators in power plants, transformers, etc. and so I'm daily confronted with standards which are from the point of a programmers view totally nonsense. You will not believe what my firmwares must do to correctly transmit the 1 bit information of a circuit breaker OPEN/CLOSE to fulfill the standards. We are living in a complex world and we make it daily more complex. I'm not happy about that.

Ammaletu wrote:I write a perfectly valid ISO-8859-1 HTML page (with this encoding stated in the HTML head section) and happen to mention the possibility of using "charset=utf-8" in the body of my page in the text (because I e.g. write about HTML). I save this explicitly as ANSI/ASCII and upon reopening it, UE treats it as UTF-8.

Yes, I noticed this too. You have to write this special string in the body like you have to write HTML elements if you want it as visible text. You must use HTML entities. For example you could encode the = with =.

Please note: The multi-language support is relatively new because the computers and the computer languages are first developed in North America and Europe were 2 byte characters are still not needed (HTML entities exist). So most of what you can read in the internet about multi-language support, Unicode and 2 byte character encoding is only a few years old as the Asian countries with their thousands of different letters or better symbols really started to use computers in their mother language. I can't remember that I have known anything about UNICODE, UTF, ENCODING, ... 5 years ago. That was simply no topic for me and for my private and company work it is still not necessary to know anything about it. We sell our devices also in the Asian countries, but no customer has ever requested the programs and devices with their mother language. The get it in English and that's it. The world could be much easier if all humans on it would speak the same language. No more misunderstandings anymore. A common phrase is "The world is a village". Really? I live in a small village with about 550 people (Vienna is just the town were I'm working). We all speak the same language and we do not fire up missiles on our neighbors or shoot them. So I don't think the world is a village. The human race is far away from living together in peace and harmony. But we better stop here the discussion about sense and nonsense or advantages and disadvantages of UTF-8 because we can't change it.

Specify in UE to create by default a new file in Unicode, use the correct format in the Save As dialog and also use the correct encoding specification, in a comment if the file doesn't have a header where the specification should be written.

If you already have templates, specify it in the Favorite Files dialog to get easy access to it. You can use the Save As dialog or the Make Copy/Backup function after loading a template. It would be also good to have a backup of the template files itself, if you by mistake modify a template.

And write a feature request email to IDM support and if you find bugs, tell them also about it with as many details as possible.

Added on 2009-11-09: I have found an undocumented setting in uedit32.exe of v11.10c and later. With manually adding to uedit32.ini

[Settings]
Force UTF-8=1

you can force all non Unicode files (not UTF-16 files) to be read/saved as UTF-8 encoded files. But new files are nevertheless created and saved either as Unicode (UTF-16 LE) or ASCII/ANSI files. So this special setting is only for already named files. However, creating a new file in ASCII/ANSI, save it with a name, close it and re-open it results in a new file encoded in UTF-8. Be careful with that setting. Even real ANSI files are loaded with this setting as UTF-8 encoded file causing all ANSI characters to be interpreted wrong.

Ammaletu · Aug 28, 2006#52006-08-28T01:02+00:00

Hi Mofi!

To my statement that a valid ASCII file is always also a valid UTF-8 file you replied:

That's not true. An ASCII file contains only single byte characters, whereas all single byte characters of an UTF-8 file must be always converted to 2 byte characters before further processing.

Well, I still don't know that much about UTF-8, but I've read enough in the last week to understand that exactly this point is one of the big advantages of UTF-8: It preserves ASCII as a valid subset. Or to quote RFC 3629:

UTF-8, the object of this memo, has a one-octet encoding unit. It uses all bits of an octet, but has the quality of preserving the full US-ASCII [US-ASCII] range: US-ASCII characters are encoded in one octet having the normal US-ASCII value, and any octet with such a value can only stand for a US-ASCII character, and nothing else.

UTF-8 can contain one-byte characters as well as up to four-byte characters. This may not be what UE does internally, but frankly I don't care what UE does internally. It's not my job to know too much about that sort of thing, or else I could write my own editor.

I don't know how old are you and if you were working with MS-DOS and OEM character set (the OS which does not have a GUI) before Microsoft launched Windows 3.1 with ANSI character set.

Nope, started with Windows 95. *g*

Now I still don't know which "international encoding standard" makes it necessary for UE to treat any file with no non-ASCII characters in it as ASCII when it could as well be treated as UTF-8, given the chance that the user told the program to do so.

It's a standard, so it is not guessing.

No, wait. Parsing a HTML or XML file into a DOM and then, upon investigating the element tree and finding that it specifies its charset in the proper way at the proper place, treating the file as encoded in the specified charset -- that's true to standards. Just assuming that every file that happens to mention the String "charset=utf8" anywhere in the content, even if another encoding is explicitly specified at the proper place -- that seems like guessing to me, based on the assumption that people rarely would use this String within a normal text.

Anyway I don't want to argue with you for arguing's sake and I certainly don't mean to offend you. Just wanted to make my point clear. Besides, I've written to IDM support about this and I'm kinda curious what they're going to reply.

Sep 08, 2006#62006-09-08T13:54+00:00

Here's the answer of the IDM support (sorry for the delay, I was out of town):

---
Thank you for your message and detailed suggestions. I'm not sure what
changes would be required to support the configuration option Mofi
suggested, but I have asked our developers to consider this for a future
release.

Regarding the charset declaration in HTML files, there is a setting you
can add to your uedit32.ini file under the [Settings] section to prevent
this being used to determine the format of the file:

Detect UTF-8 String = 0

If this setting is added to the uedit32.ini file then the charset=utf-8
declaration would not be considered when determining the format of the file.
---

So this feature can be turned off. Not perfect but a practical solution which is ok for me. Well, it's not an issue for me anyway, I just wondered...

So, some weeks later I'm still glad I did the switch to UTF-8. It doesn't make PHP coding exactly easier, but it's possible and solves a lot of trouble with different characters. And UltraEdit is working fine with the files, once I saved a valid UTF-8 file as template. So, thanks again to Mofi for clearing things up, especially for helping me to understand the different conversion possibilities built into UE. Once I knew what they actually meant, I removed most of them from my menues and am now a lot less confused by them.