Why must many emojis be deleted in a Unicode encoded text file by pressing twice key BACKSPACE?

abcjme · Nov 29, 2019#12019-11-29T11:01+00:00

I've found out that many emojis in a UTF-8 encoded text file require two BACKSPACE key presses to be deleted by UltraEdit for Windows version 26.20.0.46.

I know that some single characters are actually a combination of multiple characters and this combination is the reason why they require multiple backspaces.

Example: οͅ is a combination of ο (&#x3bf) and ͅ (ͅ).

However, I'm not talking about combined character emojis. I'm talking only about single character emojis. I've put these same emojis on Chrome text editors. Chrome deletes them after only one backspace and so I've only experienced this backspace issue with UltraEdit.

Here's an example single character emoji that requires two backspaces in UltraEdit to delete:

(&#x1f642)

Mofi · Nov 29, 2019#22019-11-29T22:28+00:00

What do you know about character encoding, especially about Unicode encoding with the two most popular transformation formats UTF-8 and UTF-16? What do you know about the Unicode planes?

UTF-8 encoding uses one to four bytes to encode a Unicode character. So it is clear that this character encoding is a multi-byte character encoding. Every application must convert the byte stream in a UTF-8 encoded file to a character stream in memory with a fixed number of bytes per character as otherwise it would be extremely inefficient to manage in memory the characters with variable number of bytes per character depending on their code values.

Most people think (as I too, for many years) that UTF-16 uses fixed two bytes per character for Unicode characters. But this is true only for the characters with a code value in basic multilingual plane while all other Unicode characters with a code value in one of the supplementary planes like emoticons or Egyptian hieroglyphs require four bytes in a UTF-16 encoded file. So also UTF-16 encoding is a multi-byte encoding.

How many bytes are used in memory for a Unicode character?

Well, that depends on the used string library for wide characters. Many string libraries use just two bytes (16 bits) for each character as this is enough for all characters in basic multilingual plane which contains the characters for almost all modern languages, and a large number of symbols. That reduces the memory needed for large character blocks like on opening a Unicode encoded text file in a text editor with more than 500 MB of Unicode characters. Other string libraries use four bytes (32 bits) per wide character which of course requires the double memory in comparison to the other implementation with just two bytes per character. An application like a web browser which most likely has to never load a file with millions of Unicode characters could use such a string library to easily support really all characters defined by Unicode Consortium.

Does UltraEdit for Windows v26.20.0.46 use a string library using 16 or 32 bits per Unicode character?

I don't know. This question must be asked IDM support by email to get it answered by an UltraEdit developer. But it looks like UltraEdit uses just two bytes per Unicode character. That makes sense for me from a programmers point of view taking into account that UltraEdit is a text editor which supports editing text files of any size including huge text files while a web browser has to just display text files (mainly) which are usually less than a few MB.

Does it make a difference if a text file is just displayed or can be also edited?

Yes, it does. Most users editing a text file expect that hundreds, thousands or even millions of changes can be applied on a text file, for example with a search and replace. The text editing users expect also that the used application records in memory what has been changed on the text by a user action, and that those changes can be undone step by step. So a text editor like UltraEdit used really for editing text and not just for viewing text needs more or less quickly more and more memory. Well, some users use nowadays personal computers with 16, 32, 64 or even more GB of RAM installed and for that reason don't care about memory usage by text editor on not having on same machine other applications running which use many GB like a database application with a huge database loaded. But the vast majority of computer users do not use machines with so much RAM. So memory efficiency is still important.

How many users of UltraEdit edit text files with characters with a code value in a supplementary plane?

I don't know. I just know that in 15 years on moderating the UltraEdit forums, only two users wrote a post about an issue with a Unicode character with a code value not in basic multilingual plane and you are one of these two users. The other user wrote also about an emoji character in a text file without knowing that the not correct displayed character was an emoji.

What should you do next?

That's up to you. You can report this issue to IDM support by email if you think that UltraEdit should support deletion of slightly smiling face emoji in a UTF-8 encoded text file with hitting key BACKSPACE just once like Chrome text editor. But I think, the priority of this issue will be rated very low by the UltraEdit developers as nearly no UltraEdit user uses UltraEdit to edit text files with Unicode encoded emojis.

By the way: Which font have you configured to get this emoticon displayed correct in UltraEdit at all?

I failed to find a configuration to get the slightly smiling face emoji displayed in UltraEdit at all. Web browsers like Google Chrome, Mozilla Firefox, Microsoft Edge, Apple Safari or Opera and web pages like Facebook or Twitter and chat applications on smart mobile phones or on computers use a small image from an image library to display this emoticon and other emojis. The Unicode Consortium defines just the general parameters for the look of an emoticon being added to the Unicode character set. Therefore emoticons just look similar depending on which application displays it, but one emoji is not displayed 100% identical in all applications.

PS: I am not a fan of emoticons. The mankind needed thousands of years to reduce the number of glyphs from many thousands to just some dozens and express everything including emotions with a combination of those small set of glyphs by forming words learned to understand in school over several years. In the last years the mankind makes up lots of new glyphs which are used instead of words to express emotions which is for me like going back to the ancients on which just a small group of people knew how to interpret a glyph.

abcjme · Dec 16, 2019#32019-12-16T07:00+00:00

Mofi, as always, I appreciate your very informative posts.

I was, indeed, knowledgeable of how UTF-8 & UTF-16 handle the bytes of characters. However, I didn't know how apps handle variable byte lengths until now. Indeed, I understand now the memory & processing savings of a 2 byte string library.

However, I found a whole bunch of bugs, some of them quite horrible, that are all based on using 2 byte string library. This is actually why I've taken so long to respond. I was busy testing and documenting all these bugs. I know that most people don't use 4 byte characters. However, some people use them frequently (e.g. linguists). And on considering how bad these bugs are, I wonder now whether UltraEdit can provide a byte string option to users. They can perhaps let people choose whether they want UltraEdit to use a 2 byte or 4 byte string library, or they can maybe fix these bugs while still maintaining a 2 byte string library.

Mofi wrote:Which font have you configured to get this emoticon displayed correct in UltraEdit at all?

Segoe UI Emoji which displays inconsistently some emojis and I don't know why.
See also Windows 10: Incomplete display of font's characters on Super User.

LIST OF BUGS

1A: DOUBLE BACKSPACE ON 4 BYTE CHAR

UTF-8 requires only 1 to 3 bytes to encode any character in the BMP. But it requires 4 bytes to code any character in a plane other than the BMP. And so, first, I demonstrated that the double backspacing requirement only occurs on 4 byte character.

https://www.dropbox.com/s/ibsj4jfa0tq8ts1/1ue~1~double~bs~on~4~byte~char.mp4

1B: DOUBLE ARROW KEY PRESS ON 4 BYTE CHAR

This also occurs similarly with arrow key presses. However, after double key pressing through a 4 byte character once, it only needs 1 key press, although on entering hex mode while at the end coordinates of a 4 byte character can reinstate this double key press requirement.

https://www.dropbox.com/s/3k9a17wfpmj0g9f/1ue~1b~arrow~press~on~4~byte~char.mp4

2: 2 BYTE FLAGS TO HANDLE 4 BYTES

2 bytes can code 65536 characters (U+0000 to U+ffff).

Unicode has 137 994 assigned characters:

0 BMP: 55445 (assigned)
1 SMP: 21353 (assigned)
2 SIP: 60859 (assigned)
14 SSP: 337 (assigned)

65536 (2 bytes) - 55445 = 10094 (available characters that can be coded in 2 bytes)

21353 (SMP) + 60859 (SIP) + 337 (SSP) = 82549 (total assigned characters other than the BMP characters)

UTF-8 requires anywhere from 1 to 3 bytes to code all of the BMP characters. But parts of UTF-8's bytes are bits that act as flags to a character's location and byte length. And, indeed, without flags, all of the BMP could fit in 2 byte combinations and still 10094 byte combinations would be left. 10094 aren't enough to complete the rest of the Unicode's assigned characters (82549). However, flags don't have to just be some bits of a byte. An entire 2 byte string can act instead as a flag to a character that has 4 bytes in UTF-8.

UltraEdit does seem to use its extra 2 byte combinations as flags. I input 3 different UTF-8 4 byte characters to test this. Then I did a single backspace on all of them and then I looked at the hex code of all of them and they all had the same code.

https://www.dropbox.com/s/pz4u1757z5q4ly7/1ue~2~2~byte~flags.mp4

This suggests:

That my backspace is deleting the UTF-8 4 byte characters.
But also, that it's leaving behind their 2 byte flag.
And, apparently, UltraEdit uses the same 2 byte string to flag many different UTF-8 4 byte character.

A side note: All of the UTF-8 UltraEdit bugs also occur in any other 4 byte encoding like UTF-16. But these bugs don't occur in Notepad, i.e. Notepad doesn't require 2 backspaces to delete a UTF-8 4 byte character. Thus, the bugs aren't universal to the operating system or some factor external to the app.

3A: EOF: NEWLINE BUG

Put a 4 byte character at the end of a file.
Then put ENTER (CRLF control characters).
This will disorder the end row's ordinal number.

https://www.dropbox.com/s/4y1tditajb3fbrd/1ue~3a~eof~newline~bug.mp4

3B: EOF: NEWLINE BUG VAR

Put a 4 byte character at the end of a file.
Then press ENTER.
Then press LEFT or UP.

https://www.dropbox.com/s/70lut7af1g46x0o/1ue~3b~eof~newline~bug~var.mp4

4: EOF: HEX MODE BUG

Put a 4 byte character at the end of a file.
Then activate hex edit mode.
Then exit hex edit mode.
This will cause the 4 byte character to disappear.

https://www.dropbox.com/s/qh36ue6vux5mtsi/1ue~4~eof~hex~mode.mp4

5: CURSOR POSITION CHANGE

Vertically go to a line's 4 byte character from a line's non 4 byte character.
Then activate hex edit mode.
Then exit hex edit mode.
The cursor position will now be 1 column coordinate less.

https://www.dropbox.com/s/8pqhatr5g0sy821/1ue~5~cursor~position~change.mp4

6A: 4 BYTE CHAR SPLIT WITH CHAR

Vertically go to a line's 4 byte character from a line's non 4 byte character.
Then put a character.
This will split the the 4 byte character into 2 non 4 byte character, with the character that you put being between these 2.

https://www.dropbox.com/s/sv3to9bec0u2doe/1ue~6a~4~byte~char~split~with~char.mp4

6B: 4 BYTE CHAR SPLIT WITH BACKSPACE

Vertically go to a line's 4 byte character from a line's non 4 byte character.
Then press BACKSPACE.
This will turn the 4 byte character into a non 4 byte character and also place the cursor position to the start of the character rather than the end of the character.

https://www.dropbox.com/s/diz3vhsz815nkay/1ue~6b~4~byte~char~split~with~bs.mp4

7: SOF CURSOR TRANSPORTATION

Put two 4 byte characters at the start of file (sof).
Then keep pressing LEFT.
The cursor will auto transport to the end of the first row.
If: you don't put anymore 4 byte character on the first row,
then: this will only happen once.
But if: you put more 4 byte character on the first row, in any position other than the first column,
then: you can repeat this auto transportation.

https://www.dropbox.com/s/y1j52nq48mlzgxf/1ue~7~row~1~cursor~transportation.mp4

8: EOF: BACKSPACE ON 4 BYTE CHAR

Put a 4 byte character at the end of a file.
BACKSPACE once on the 4 byte character.
The 4 byte character will disappear.

https://www.dropbox.com/s/jjhdcuzp0fb8inz/1ue~8~eof~bs.mp4

9A: EOF: HEX MODE

Put a 4 byte character at the end of a file.
Enter hex edit mode.
Exit hex edit mode.
The 4 byte character will disappear.
This will also cause near permanent changes to the text that can lead to other bugs.

https://www.dropbox.com/s/z96nnofgaeazhe7/1ue~9a~eof~hex.mp4

9B: EOF: SAVE

Put a 4 byte character at the end of a file.
Save.
The 4 byte character will disappear.
This will also cause near permanent changes to the text that can lead to other bugs.

https://www.dropbox.com/s/mc35n7zcfko47mt/1ue~9b~eof~save.mp4

9C: EOF: SCROLL BAR

After doing end of file hex or save, the scroll bar will behave strangely near the end of file.

https://www.dropbox.com/s/hf35x2u8cs1uhyf/1ue~9c~eof~scroll~bar.mp4

9D, 9E, AND 9F: EOF: DOUBLE TRANSPORTATION

After doing end of file hex or save, adding a character or pressing BACKSPACE at the end of file will result in a scroll bar transportation to the start of file.
After doing this once, on any line that has a 4 byte character in the first column, putting a character at any column after the 4 byte character will cause another scroll bar transportation.
Or, on any line that has a 4 byte character, a BACKSPACE that causes the 4 byte character to become the first column of its line, or a BACKSPACE that occurs at the start of a line that has a 4 byte character in the first column, will cause cursor transportation to the start of the file.
Or, on any two or more consecutive lines that have a 4 byte character, with a 4 byte character at the start of at least the last line, then using the vertical arrow keys to move from before to after the lines will disorganize the column line number and it'll cause a cursor transportation to the start of the file.
But a non end of file transportation will only happen once after an end of file transportation.

https://www.dropbox.com/s/pwxq1akwnjn32zf/1ue~9d~eof~double~transportation~char.mp4
https://www.dropbox.com/s/bluj2z35cwb19c1/1ue~9e~eof~double~transportation~bs.mp4
https://www.dropbox.com/s/uw9knw6mlzduxrr/1ue~9f~eof~double~transportation~double~vert~4~byte~char.mp4

9G: EOF: END ENTRAPMENT

After doing end of file hex or save, pressing the end key can trap the cursor in the last 2 columns.
However, after being trapped, doing a BACKSPACE can erase the trapped 4 byte character.
Thereby, it can get rid of the semi permanence of this end of file bug.

https://www.dropbox.com/s/q1flfqihl0lnzg1/1ue~9g~eof~end~entrapment.mp4

9H: EOF: IMPOSSIBLE TO BACKSPACE

Put a 4 byte character at the end of a line.
Then press RIGHT.
Then BACKSPACE.
Depending on the scroll bar location, the 4 byte character might be impossible to BACKSPACE unless you change the cursor position.

https://www.dropbox.com/s/jdk531haetccbg6/1ue~9h~eof~impossible~to~bs.mp4

9I: EOF: 4 BYTE BS DIFFERS BECAUSE OF SCROLL BAR POSITION

BACKSPACE on a 4 byte character has a different effect depending on the scroll bar location.
The BACKSPACE might cause the character's blank grapheme to become a question mark kind of grapheme.
Or the BACKSPACE might cause the character to disappear.

https://www.dropbox.com/s/0fmdlgotilhc77f/1ue~9i~eof~4~bs~differs~cuz~of~scroll~bar.mp4

9J: EOF: ARROW DOWN GOES TO RELATIVE 2ND LINE

Pressing DOWN on the last line causes the cursor to transport to the relative second line.

https://www.dropbox.com/s/r4oqbcq8bnk9wq0/1ue~9j~eof~arrow~down~relative~2nd~line.mp4

9K: EOF: HEX CURSOR PAST END CAN ERASE THE EOF 4 BYTE CHAR

Putting the cursor at the end of file, and then pressing BACKSPACE, won't delete the end character.
But you can go to hex edit mode.
Then put the cursor past the last character.
Then exit hex edit mode.
Then press BACKSPACE.
And this will delete the end character.

https://www.dropbox.com/s/7y3f1ip71nl82z6/1ue~9k~eof~hex~erase.mp4

Mofi · Dec 21, 2019#42019-12-21T16:00+00:00

First, thank you for the information that the font Segoe UI Emoji is used by you to get displayed the emoticons in Unicode encoded text files at all.

I am still using at the moment only Windows XP and Windows 7 which both have not installed this font by default.
There is planned an upgrade to Windows 10 for next month of my main office computer. I have to upgrade the main office computer although I don't want that. Windows 10 will not derive some advantages for my daily work. I am pretty sure that with the upgrade to Windows 10 my daily work will become less efficient with using the same hardware because of Windows 10 is less efficient than Windows 7 which was already less efficient than Windows XP for my daily paid work.

You have done an impressive work to find and document all issues caused by the fact that currently latest version 26.20.0.68 of UltraEdit for Windows and currently latest version 19.20.0.44 of UEStudio do not support correct editing of Unicode characters encoded with two 16-bit code units called a surrogate pair. The currently latest versions of UltraEdit and UEStudio support correct loading, editing and saving UTF-8 and UTF-16 encoded Unicode files containing code points from the supplementary planes, as long as those characters and symbols are not deleted, modified, inserted, searched or replaced by the user in such Unicode text files. I am really impressed by how much time you invested in this work. I would have asked first IDM support by email if current version of UltraEdit supports surrogate pairs at all. This is obviously not the case.

I know that Visual C/C++ is used by IDM Computer Solutions, Inc. for UltraEdit for Windows and for UEStudio. I know that from inspecting the executable files of UltraEdit. I can see on the executables that Visual Studio 2017 is used as compiler for UE v26.20 and UES v19.20.

The Visual Studio 2017 documentation page Multibyte and Wide Characters referenced from VS2017 documentation page Multibyte Characters of VS2017 documentation chapter Characters describes that wide characters are multilingual character codes that are always 16 bits wide. I like more the Microsoft documentation page Working with Strings because it offers more useful information. This page contains similar to the other documentation page specific for Visual Studio 2017 the information:

Windows represents Unicode characters using UTF-16 encoding, in which each character is encoded as a 16-bit value.

Well, we both know that this statement is wrong nowadays. The characters from a supplementary plain are encoded with two 16-bit values using a surrogate pair. It looks like UltraEdit uses the string library of Visual Studio 2017 without any additional code to support Unicode characters encoded with a surrogate pair. I found also the discussion Wide strings vs UTF-16 strings very informative on this topic although it is from March 2014 and therefore most likely not up-to-date anymore.

I don't know if Visual Studio 2019 introduces a string library using 32-bit per character. I suggest to search in world wide web for an appropriate information if you are interested in. But I doubt that because I think it would cause an immense compatibility problem with all the Windows 7/8/8.1/10 libraries including the Windows kernel libraries.

You wrote that Windows Notepad supports emoticons. But you have not written which version of Windows Notepad supports emoticons encoded in a Unicode encoded text file with four bytes. Microsoft has not made any enhancements on Windows Notepad for many years. But Microsoft implemented enhancements on Windows Notepad starting with Windows version 10.0.17666 as it can be read in Wikipedia article about Windows 10 version history. More enhancements are made by Microsoft on Windows Notepad in the Windows 10 versions 10.0.17713 and 10.0.18298. In other words Windows Notepad of Windows 10 1809 and later Windows 10 versions are different to Windows Notepad of Windows 10 1803 and former versions of Windows 10. But even knowing the exact Windows 10 version is perhaps not enough because of the information Notepad updates is now available via Microsoft Store listed as highlight on Windows 10 version 10.0.18963. I don't have access at the moment to a Windows 10 machine. But if Windows Notepad updates are available now via Microsoft Store, the users of Windows 10 with a version lower than Windows 10 1809 could have the possibility to update their Windows Notepad, too. Text writers and developers creating text files with code should taken into account that not all users of Windows have installed anymore the same version of Notepad. So such people have to take care that a text file written by them or created by an application written by them really looks well on being opened by a user with Windows Notepad not being the currently latest version released by Microsoft.

Note: Microsoft releases every 6 months a completely new compiled version of Windows 10. All files installed by default with Windows 10 into directory %SystemRoot% and its subdirectories are replaced by Windows 10 on upgrading the Windows 10 version. For that reason it is very often not enough to just write on reporting an issue that Windows 10 is used as operating system. It is quite often very important to know which version of Windows 10 is used by the user on reporting an issue. On Windows XP and all newer Windows versions it is possible to click on Windows Start button and execute winver. That is a very small executable with full qualified file name %SystemRoot%\System32\winver.exe which shows in a GUI window the Windows version with all additional information usually needed on reporting an issue. The command ver can be executed in a Windows command prompt window to get the exact version string of Windows which could be very important especially on using Windows 10 and reporting an issue.

Further I know from looking on files in program files folder of UltraEdit for Windows and UEStudio that the currently latest versions of UE/UES use the ICU - International Components for Unicode C++ library version 64.2 which is also used by many other software companies for their applications. However, the library uses most likely (not verified by me) standard C++ wide character strings and their appropriate string functions and so it depends on the used compiler and its string library if a wide character is handled in memory with 16 or with 32 bits. I don't use UltraEdit for Linux and UltraEdit for Mac, but if UEX and UEM are compiled with GCC, it could be that UEX and UEM support fully all Unicode characters including those in supplementary planes with including correct behavior on inserting, deleting, modifying, searching and replacing those characters.

What could be done by you now?

You have invested a lot of time to find and document many (definitely not all) issues caused by missing full support of code points encoded in UTF-16 using a surrogate pair. So I suggest to report those issues to IDM support by email.

What could be done by IDM Computer Solutions, Inc. now regarding to missing full support of of code points encoded using a surrogate pair?

IDM could do nothing.
IDM could explicitly declare that Unicode characters encoded with four bytes are not fully supported for editing by UltraEdit for Windows and UEStudio and change nothing on code of UltraEdit for Windows and UEStudio.
IDM could change to a different compiler using 32-bit wide characters. But that would be definitely an extremely time consuming (months or even years) and so very expensive work. It has also the disadvantage that the memory usage for editing Unicode encoded text files (or all text files) would double whereby it should not be forget that not only the file content must be kept in memory (partly on very large files), but also the undo history, all views and lists showing strings from active file or a set of files, etc. must store strings in memory with 32-bit per character. I doubt that UltraEdit and UEStudio can be kept compatible to all the Microsoft libraries currently used by UE/UES on changing to a compiler using 32-bit wide characters. It would be most likely necessary to exchange nearly the entire code to a completely different framework like all the applications which are written primary for Linux and are ported to Windows using a framework which is not written by Microsoft for Windows (like Eclipse based on Java). That would dramatically decrease the performance of UltraEdit and UEStudio on Windows.
IDM could add lots of extra code to handle surrogate pairs in UTF-16 text data stream. On every caret movement by the arrow keys, commands like Goto, Select word, Select range, etc. the extra code would need to evaluate every single 16-bit value in UTF-16 text data stream in range depending on used command if two 16-bit values build a surrogate pair of a Unicode character outside base multilanguage plane. That would result in an incredible performance loss on working with text files in UltraEdit. And there would be remaining the problem with Unicode characters of a supplementary plane which must be displayed in any other view than the file window or stored in any other file than the opened file like the find/replace history stored in INI file.

There could be even more possibilities, but those are the fours I could imagine.

Only the first two possibilities are realistic in my opinion. The other two possibilities are extremely expensive on implementation (time and money if the developers are paid for the work) and very risky as lots of bugs cannot be avoided making users definitely not happy and are most likely also the opposite of what 99.999999% of all UE/UES users would like because of UltraEdit/UEStudio would not be anymore a very efficient native Windows text editor respectively IDE.

UltraEdit is for my daily work on text files the best text editor and UEStudio is for my daily programming tasks the best IDE with the exception that Visual Studio is the better IDE on searching for an error in code in a Windows GUI application coded by me using integrated debugger of Visual Studio. But I wrote in the last years mainly code for embedded devices where UEStudio is best for me as I can use the same customized development environment for various controllers and processors. Small Windows console applications compiled with Visual Studio are usually also debugged by me in Visual Studio if that is necessary at all which is not often the case.

UltraEdit is most likely not the best text editor for people who have to edit small Unicode encoded text files containing characters being encoded with four bytes because of not being assigned to base multilanguage plane and which really have to touch such characters in the Unicode encoded file. So my advice for such people is to use an other application for this text editing task which supports this very special text editing task better than UltraEdit.

abcjme · Feb 02, 2020#52020-02-02T22:40+00:00

I'm very grateful of your detailed and informative post.I'm also very interested in this topic and I suspect that other plausible ways to deal with these issues exist and keeping all types of customers happy without needing to invest unrealistic amounts of time or money might be possible, or I could be wrong. I'm not confident of any opinion yet.

I was in the middle of thoroughly studying these things, but I unexpectedly got offered a lead management and development position of a new company. It's an amazing job, but I'm also having to work practically all day every day. Most of the business will be automated by March or April. Then, I'll hopefully have time to revisit this topic, but as of now, finding the time to focus on it is too difficult. But, I just wanted to say thanks and maybe I'll have more to say at a later date.