Converting HTML notation of Unicode characters into text

rotten · Nov 24, 2008#12008-11-24T20:50+00:00

Hello and nice to meet you.

I paste some text from a webpage which contains Unicode characters like this:

ήγο

How can I convert them to plain text (I know that it is Greek fonts, but of course I have to use a Unicode font).

Thank you.

Mofi · Nov 25, 2008#22008-11-25T13:24+00:00

That are Unicode characters, but not in a Unicode notation. These Unicode characters are defined in a HTML notation - decimal code for characters of the ISO 10646 character table (identical with Unicode). The simplest way to get these Unicode charactes as text is to pack that strings into an HTML file, open this HTML file with your browser, select the text and copy it into a Unicode text file.

For example copy following HTML code into an ASCII file, save this file with UltraEdit for example with name Html2Unicode.htm and next click on Window - Show File in Browser to open (a copy of) this file with your preferred browser.

Code: Select all

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
 <title>HTML notation of Unicode characters</title>
 <meta http-equiv="content-type" content="text/html; charset=iso-10646-utf-1">
</head>
<body>

<p>&#942;&#947;&#959;</p>

</body>
</html>

jorrasdk · Nov 25, 2008#32008-11-25T14:08+00:00

Okay, I have not tested this, but since they are decimal Unicode character code values, String.fromCharCode() in a script with a combination of code page and font for proper Greek Unicode display might solve this problem:

The script (for UE13 or above):

Code: Select all

// add a new function to String Prototype:
// search string for &#nnn;, capture charCode, and convert to character
String.prototype.unescUnicodeEntities = function () {
return this.replace(/&#(\d+);/g, function (str, charCode) { return String.fromCharCode(charCode); });
}

// Select whole document
UltraEdit.activeDocument.selectAll();

// Retrieve selection
var sel = UltraEdit.activeDocument.selection;

// UnescapeUnicode Entities and write back into editor
UltraEdit.activeDocument.write(sel.unescUnicodeEntities());

I would appreciate feedback on this idea!

Mofi · Nov 25, 2008#42008-11-25T14:53+00:00

jorrasdk, your idea is good, but unfortunately not working.

The problem is that these 3 characters have the hexadecimal values 0x3AE, 0x3B3 and 0x3BF. But when running your script on an ANSI file with for example Courier New font and Greek codepage selected the 3 characters have the hexadecimal code 0xAE, 0xB3 and 0xBF (high byte lost) after running the script, but required would be 0xDE, 0xE3 and 0xEF. In ISO-8859-7 or Windows-1253 codepage these 3 characters have a byte code different to the low byte in the Unicode table.

So it would be necessary that the source file is already a Unicode file when running the script. But then the line

var sel = UltraEdit.activeDocument.selection;

returns only the first character of the file because Unicode strings cannot be stored in NULL terminated single byte character arrays.

An idea I have had was to convert the file into an ASCII Escaped Unicode file by replacing all &#dec code; with \uhex code and next save the modified ASCII/ANSI source file, close it and re-open it. If the detection of ASCII Escaped Unicode files is enabled, UltraEdit will open the file now as Unicode file and the characters would be correct displayed. A conversion to ASCII/ANSI would convert the file then correctly into a simple text file if the codepage is correct set.

jorrasdk · Nov 25, 2008#52008-11-25T15:54+00:00

Thanks for the feedback !!

rotten · Nov 25, 2008#62008-11-25T19:58+00:00

Thanks for replying. I will try to find a solution.