How to change the symbols "$" and "_" representing newlines and spaces?

Zababa · Jan 12, 2011#12011-01-12T12:09+00:00

Short version: Is there a way to make UE show other symbols for newlines and spaces than "$" and "_"?

Long version:
I usually edit UTF-8 documents. (To make UE recognize them as such I have to add the deprecated BOM -- Not my taste but I can live with it.) In those files where I use \r\n for line ends I see a "¶" (Pilcrow sign). Those which use only \n for new line display it like "¬" (Not sign). In both I see spaces like a "·" (Middle dot). I like it this way.

But now I am editing a document which is just in plain ASCII and which has to have \n as newlines. Every time I open it UE he recognizes it as "UNIX" and displays newlines as "$" (Dollar sign) and spaces like "_" (Low line). I don't like that. I want UE to display "¬" and "·" respectively. I can achieve that temporarily when I save the document as UTF-8 NOBOM with unix line ends (\n). Then UE says it's "U8-UNIX", but as soon as I close and open that document I see those dollars and low lines again because there is no BOM and UE thinks it's ASCII (well, he's right, it is ASCII) and displays the newlines as "$" and spaces as "_" again. I want UE to use "¬" and "·" in plain ASCII unix files as well.

Mofi · Jan 13, 2011#22011-01-13T07:12+00:00

I explained in UTF-8 not recognized, largish file how UltraEdit detects if a file is encoded in UTF-8. I would be interested which type of files your are editing using UTF-8 encoding, but does not have a UTF-8 character set declaration at top, or UTF-8 encoded characters in the first 64 KB on using UltraEdit for Windows < v24.10 or UEStudio < v17.10, nor (usually) a UTF-8 BOM as strongly recommended by the Unicode working group to help applications reading such files correct from the beginning. However, in the referenced topic I posted also which setting can be used to force UltraEdit reading all non UTF-16 files as UTF-8 file. Using this setting results in reading all single byte encoded text files as UTF-8 file which works for UTF-8 files and ASCII file (no character with value greater 127), but not for ANSI files (contain single byte characters with decimal value greater 127).

That a DOS text file contains also just \n or just \r is very, very unusual and often caused by applications badly coded. It might be that a single \r or single \n is used to code a line break within a paragraph terminated with a \r\n byte sequence. But that is definitely very unusual for text files. The problem is that in a pure text file it is not really possible to use line breaks in comparison to line terminations which are at the same time end of paragraph. Yes, human read a block of consecutive lines as paragraph and a blank line determines where a paragraph ends and next paragraph starts, but this is not a rule defined anywhere so that programmers can rely on this interpretation for paragraphs and there are lots of examples showing that paragraph definitions differ.

However, I tried to reproduce what you have described and failed to reproduce it with UE v16.30.0.1003, except one: files with UNIX line terminations (just \n) without temporary conversion to DOS are displayed with Show Line Endings with character ¬ (decimal value 172). There are the settings Automatically convert to DOS format and Save file as input format (Unix/Mac/DOS) which can be used to temporarily convert a UNIX file for editing to a DOS file, but the file is saved always as UNIX file. Because you work also with text files with mixed just \n in DOS files you need to check additionally Only recognize DOS terminated lines (CR/LF) as new lines for editing which is perhaps already checked.

But spaces are always displayed with character · (decimal value 183), DOS line terminations with ¶ (decimal value 182) and tabs with » (decimal value 187) which can't be configured. Of course it depends on the font and script (code page) as defined in View - Set Font respectively View - Set HEX/Column Mode Font if these bytes are displayed with the character glyphs displayed here in this UTF-8 encoded HTML file. So when you are editing an ANSI file it depends on the code page respectively font/script settings if the replacement characters are displayed as expected or with different glyphs.

Zababa · Jan 13, 2011#32011-01-13T11:37+00:00

I see I did not explain my situation properly. The newlines and spaces are displayed as "$" and "_" only in cases where I edit ASCII files with UNIX line ends. A concrete instance of this would be what I attached (conference_trimmed.zip). It's the first lines of a document-class definition file for LaTeX. This file is supposed to use only ASCII characters (at least in its code lines) and should have Unix line endings. Whenever I (re-)open this file, UE displays it like attached in unwanted_display.png.

Thank you for mentioning your update in the other post. I remember our discussion there. Good work on finding out that if you add "Force UTF-8=1" to the [Settings] of uedit32.ini, UE will treat all non-Unicode files as UTF-8. I tried this out and it effectively solved (circumvent) my display problem. When I start UE with that setting and open the file, UE assumes it is a U8-UNIX rather than "UNIX" and hence I get the spaces and newlines displayed just the way I want (wanted_display.png), as is the case in any other UTF-8 document.

I don't mind the forced UTF-8. In files where only first 127 code points are used, UTF-8 (without BOM) is identical to ASCII. Luckily, I don't need to work with ANSI files using national 8-bit code pages anymore. Even without forced UTF-8 in the ini settings, I never figured out why UE cannot display these files properly and why switching to another Code Page (View/Set Code Page ...) does not have any effect on the display of the opened file.

To address the other topics of your thorough answer:

… which type of files your are editing using UTF-8 encoding, but does not have a UTF-8 character set declaration at top, or UTF-8 encoded characters in the first 64 KB on using UltraEdit for Windows < v24.10 or UEStudio < v17.10, nor (usually) a UTF-8 BOM

None. Really, none. Here I think you misunderstood. I wasn't complaining about anything concerning UTF-8. I just mentioned that when I work with UTF-8 files I see spaces like "·" and line ends like "¬" or "¶" (for \n or \r\n respectively), whereas when UE opens an ASCII file with UNIX line ends (i.e. \n) the spaces and line ends are displayed differently (like "_" and "$") and that was what I wanted to change.

Nevertheless I am more than happy to tell you which files I actually work on so the UE community sees that the best text editor around is not only used by programmers:

I work mostly with linguistic stuff. Lots of files in this field don't have any UTF-8 declaration, but most of them don't mind having a BOM in their first three bytes — at least as long as you work with them in Windows.

More specifically, I write XeTeX source files. Theoretically, you could include an XML-like UTF-8 declaration in the comments somewhere within the first lines there, but I haven't seen anyone doing or recommending that. It is also not necessary because XeTeX requires the source files to be in any form of Unicode and can handle both BOM and NOBOM UTF-8. (But as soon as you get to the very internals which are pure (La)TeX, you don't write any special characters into the source files, so these are ASCII)

Another kind of UTF-8 files are annotated linguistic corpora of various languages. Many of them are written for a rather primitive software where there is no place for an encoding declaration in the source files. But again, since this is for Windows, both BOM and NOBOM UTF-8 is accepted and treated properly.

And there are other files containing nothing but texts (with special characters) with no place for encoding declaration. These may be raw parts of a text corpus or just some notes or intermediate files arising in various stages of a linguist's workflow. Again here the common practice is using UTF-8 BOM rather than an encoding declaration

Typically, all these files have some special characters within their first 64 KiB, except maybe for some bits and pieces of a multi-file XeTeX document which could as well be in ASCII and of course those files concerning the very internals of (La)TeX.

I also write some scripts in Python 3 but there UE really treats UTF-8 well because there is an explicit and recommended way how to declare the encoding at the beginning of the source file. (UE still cannot fold python's code, but that's worth another topic.)

UTF-8 BOM as strongly recommended by the Unicode working group to help applications reading such files from the beginning correct

I thought we agreed on that UTF-8 BOM is deprecated and not recommended when we discussed it in the other thread. There you have it. I looked it up once again in the drafts of the Unicode 6.0 documents. There the relevant passage can be found in section 2.6 under the Table 2-4. Its wording has not changed: "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature."

In my point of view, it is not UTF-8 BOM which should help the applications, but the applications should help themselves and make Unicode the new default character set and assume that any text document from these days is most likely to be encoded in UTF-8, -16, or -32. Even the most basic things like source codes of programs written in 7-bit ASCII are technically a subset of files written in UTF-8 (without BOM). It is sad that Windows Notepad and many other text editors assume one of the national 8-bit code pages as default. The same can be said about the very internals of operating systems like command line consoles, data storage and file systems. Having grown on computers with English, Czech and German software I have been going through almost every kind of character set and character encoding problem there is (since MS-DOS times) and I still don't see the end of it.

I see that Unicode has its own drawbacks and problems which are not trivial, but still, it makes so many things for the majority of languages and their speakers so much easier.

That a DOS text file contains also just \n or just \r is very, very unusual

I agree, but again, I wasn't talking about DOS files with \n's or \r's as line ends. I think we know that there are three line ending standards in the computer world. Unix system use \n, early Macintosh systems used \r, and Microsoft systems use \r\n. UE can treat all three kinds very well and displays which kind of line ends are being used in the document by saying "UNIX", "MAC", or "DOS" in the status line. The strange thing about it it how the newlines are being displayed.

I don't understand UE's system behind the type of newlines and the way they are displayed ("¬", "¶", or "$"). It seems to depend on so many things, including how UE should open the file (Open As, Format). I can only say with certainty that any document of which UE thinks it's UTF-8 shows "¬" for \n but "¶" for \r\n. A visual distinction of the newline type is a great help since there are cases where it matters. This way the difference is more apparent than just few letters in the status line. Since I have worked mainly with UTF-8 files I have got so used to it that I was confused by the way newlines (and spaces) are displayed when actually editing something different.

Mofi · Jan 13, 2011#42011-01-13T13:49+00:00

I'm glad that with the setting you get now the behavior you want: all single byte encoded files you edit are read as UTF-8 files (those with bytes only in range of 0x00 to 0x7F or really encoded with UTF-8).

I understood you correct and tested how spaces/tabs and line terminators are displayed for an ASCII UNIX file after enabling this special view modes. I could not reproduce what you captured and I still can't reproduce it with my settings with your file, see my images. I'm using font Courier New with script Western. I suppose that your font with your script is responsible for the wrong display of the characters. The specified font with the specified script (code page) is also used for the characters displayed for spaces/tabs/line endings when enabling the appropriate options. UltraEdit as text editor does not support displaying just some characters with a different font inside the document window like Microsoft Word. Therefore also these special characters are displayed with the specified font, wrong font/script - wrong display. When the file is a Unicode file and the specified font supports Unicode, the code page respectively script setting (which subset of Unicode to use) does not matter.

I explained somewhere already that the Set Code Page command is just for telling UltraEdit which code page to use for the ANSI file in case of conversion to/from Unicode. That is important when you do that manually using the Conversion commands, but happens also often in the background when pasting clipboard content from other applications, for example Microsoft Word or your browser which copy often text in UTF-16 LE to clipboard and UltraEdit has to convert them on paste to ANSI. And the code page is important when running a sort with option Use Locale. Hm, never tested what happens when using Use Locale on a Unicode file and the file contains really language specific characters. I work very rarely with Unicode files in any encoding. On changing the code page for a file UltraEdit informs the user with a message when the current font/script setting can result in wrong display of the bytes as characters.

Zababa · Jan 13, 2011#52011-01-13T16:09+00:00

Thank you for writing something about how the character display works. The font I use for displaying stuff is DejaVu Sans Mono, one of few monospaced fonts which are Unicode and have sufficient amount of glyphs to satisfy the needs of linguists.

Now I see that it is not only the code page I select but also the subset of the Unicode font I use for display which influence the characters of 8-bit encodings. Before I adopted UE as my main writing tool I worked in EmEditor which somewhat more Unicode-oriented in its core, but lacks many of UE's features. However, changing the code page of a document there resulted in an instant reload of the document and you saw the changes of the characters immediately because the user could define everything beforehand in EmEditor's configuration settings (which character encodings will be displayed by which fonts or their subsets).

So I forgot that in UE it's two things: 1. change the code page, 2. change the display font. Since I work with Unicode files I am happy that I don't need to care about that anymore. Everything is displayed as long as the font has the corresponding glyph.

Mofi · Jan 13, 2011#62011-01-13T17:04+00:00

Right, the two steps are not optimal. I thought already several times on answering display issues like this one caused by wrong font/script settings why switching code page automatically by UE or manually by the user does not result in selecting automatically correct code page (script) in the font settings if the font used supports Unicode respectively the set code page. With knowing the code page and font supports Unicode, UltraEdit could take every character with value from 0x80 to 0xFF in current code page from the Unicode table. With knowing the code page UE could find out automatically the correct Unicode value of the character. That might be definitely worth suggesting IDM as an enhancement for a future version. But I don't do that because I use always only ANSI Latin I, OEM character set (using command OEM Character Set) and sometimes Unicode and so never need to switch the code page, except when I try to reproduce a code page / display problem posted in the forums.