How to fix little squares rather than correct characters in copy/pasted text caused by wrong ANSI to UTF-8 conversion?

pendle · PostDec 26, 2017#12017-12-26T21:57+00:00

I'm going through about 5000 text documents and formatting them for a society's webpage. I've noticed that in some of that text formatting there are little squares where there should be space.

I'm wondering how to replace these en masse - or at least with little effort. I don't have the original files, these are what have been supplied, and I'm putting HTML around the text. The square isn't viewable on the UltraEdit screen, just when I look at said document (then a PHP file) on the website.

Any ideas?

Mofi · PostDec 27, 2017#22017-12-27T08:24+00:00

A simple Replace in Files should remove those characters from the files. But it is necessary to know what those characters are.

Are those characters displayed at top?

In this case it could be that the files contain a byte order mark which can be also removed by running a Replace in Files search for the two or three bytes of the BOM and replace them with an empty string.

But if those characters are not the bytes of a BOM, we really need at least one file compressed into a ZIP or RAR archive and attached to your next post to determine what are those characters if you cannot find that out by yourself.

pendle · PostDec 27, 2017#32017-12-27T13:31+00:00

Hi - thank you for responding.

I've attached a sample transcript. If you view the file in a browser you will see the square appears between the R and S in "mothers side". Looking at the text in the file, there's nothing there.

This is a small transcript, but some are many lines and with many squares, that's why I'd like to try and find a way of dealing with these en masse.

Thank you

Mofi · PostDec 27, 2017#42017-12-27T15:45+00:00

Okay, I could see the issue. The text/character encoding is not right.

Do you know anything about text/character encoding?

No, you should read power tip Working with Unicode in UltraEdit/UEStudio and this post and all other pages referenced in power tip and the post. It is inexcusable for everyone editing HTML, XHTML and XML files in a text editor not knowing what character encoding is, how it works and what the meta tag with charset=utf-8 in header of an HTML/XHTML file really means. At least read very careful the text below explaining what happened on not knowing anything about text/character encoding.

The text with mother´s was originally non Unicode encoded with most likely code page Windows-1252. The right single quotation mark is encoded in Windows-1252 with just a single byte (8-bit) with hexadecimal value 92 (decimal 146 or binary 1001 0010).

I think, the HTML header was added next with the line:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

This line declares the file as Unicode encoded with UTF-8. But the encoding of original text was not converted from Windows-1252 to UTF-8 and so the text inside the XHTML file was in real still Windows-1252 encoded although declared as UTF-8 encoded. That was the main mistake on creation of the file.

Then the file declared as UTF-8 encoded, but being in real Windows-1252 encoded, was opened in UltraEdit which recognized UTF-8 encoding declaration and interpreted the file according to UTF-8. On next save this results in converting the byte with hexadecimal value 92 now really correct UTF-8 encoded with the two bytes with hexadecimal values C2 92 and became the character private use two. This character is not supported by most fonts resulting in getting it displayed with default glyph for non supported characters of the used font. This can be no glyph, or a rectangle or a thin line like -. It depends on the font how the not supported character is displayed on screen. This is correct as private use characters are code points whose interpretation is not specified by a character encoding standard.

The real procedure necessary to get text encoded in Windows-1252 correct into an XHTML file being UTF-8 encoded is as follows:

The text file with Windows-1252 encoded text is opened as ASCII/ANSI file with code page 1252 in UltraEdit.
Then the file is converted from ASCII/ANSI to UTF-8.
This step would modify character ´ encoded with a single byte (8-bit) with value 92 to Unicode with hexadecimal code point value 201A (two bytes respectively 16-bit) in memory of UltraEdit. This can be seen in UltraEdit on positioning the caret left to the character and executing command Character Properties before and once more after conversion to UTF-8.
Next the HTML header and footer could be inserted into now really Unicode encoded file with the character set declaration as posted above.
UltraEdit runs on saving the Unicode file the 8-bit Unicode transformation format procedure resulting in getting character ´ stored in the file with three bytes with hexadecimal values E2 80 99.

I really recommend to run this procedure on each file containing a character in Windows-1252 encoding with a value greater 127 decimal, i.e. with a hexadecimal value in range 80 to FF. Run a Find in Files with Perl regular expression [\x080-\xFF] on original Windows-1252 encoded files using option Open matching files. Convert each opened file to UTF-8 encoding, add the XHTML header and footer and other HTML tags on each file and save the files (without UTF-8 BOM) now with text being really UTF-8 encoded as declared in header.

See also the script ConvertFilesToUtf8 which is most likely a big help on converting all the Windows-1252 encoded files to UTF-8.

It would be possible to run a Perl regular expression Replace in Files with search string \xC2\x92 and replace string \xE2\x80\x99 to correct all occurrences of UTF-8 encoded private use two by UTF-8 encoded right single quotation mark. But that quick fix of this character does not help on the other non ASCII characters in the wrong encoded XHTML files like the character being displayed as question mark after mother´s. I am quite sure that in original Windows-1252 encoded text this character was not a question mark. A question mark is the result of a character which could not be correct converted from one text encoding to another text encoding.

pendle · PostDec 27, 2017#52017-12-27T17:37+00:00

Thank you so much for this information. Unicode etc is not something that I am familiar with. All these files I have were originally plain text or Word documents which have gone through various conversions.

I'll read through your links and information and hopefully resolve my problem.

Thank you again.