Creating a Perl regular expression string with ANSI/Unicode characters

Mofi · Sep 29, 2013#12013-09-29T14:04+00:00

Sometimes it is necessary to create an UltraEdit macro or script using finds/replaces for ANSI/Unicode strings.

ANSI strings are strings consisting of single byte characters with a code value in range 0 to 255 whereby which character is displayed for a value in range 128 to 255 depends on currently active code page. Therefore a Find/Replace in Files could be a problem when using ANSI characters in search/replace strings and not all files containing single byte characters make use of the same code page. But files of same type (= same file extension) with only 1 byte per character are usually of same code page and therefore Finds/Replaces with ANSI characters are no problem.

More problematic are finds/replaces with Unicode characters having a code value greater than 255. An UltraEdit script must be an ASCII/ANSI file prior UltraEdit for Windows v24.00 and UEStudio v17.00. (A UTF-8 encoded Unicode script is parsed by JavaScript in UE < v24.00 and UES < v17.00 like an ASCII/ANSI file which means no conversion of the UTF-8 coding sequences for characters with a value greater than 127 to the appropriate Unicode character.) Also the Edit Macro dialog of UltraEdit < v24.00 and UEStudio < 17.00 supports only ASCII/ANSI strings and the binary macro storage format in these versions of UltraEdit/UEStudio has also problems with Unicode strings.

The solution for rarely needed ANSI/Unicode finds/replaces in UE < v24.00 and UES < v17.00 is the usage of the Perl regular expression engine even for simple Unicode finds/replaces. The Perl regular expression engine supports \x[0-9a-f][0-9a-f] for specifying a character by its code value in range of 0 to 255 in hexadecimal with a two digit hexadecimal value as well as \x{[0-9a-f][0-9a-f][0-9a-f][0-9a-f]} for specifying a Unicode character by its code value in hexadecimal with a four digit hexadecimal value. A-F can be also used.

For example the Perl regular expression [\x00-\x08\x0E-\x1F] finds a control character usually not present in a text file. The characters with code value 09 to 0D often exist in text files as these are the horizontal tab, line-feed, vertical tab (rarely used), form-feed (page break) and carriage return.

Another example is [\x{2200}-\x{22ff}] which finds mathematical operators in a UTF-16 encoded Unicode file.

UltraEdit has the command Search - Character Properties (Alt+RETURN) to get code value of the character at current position of the caret in hexadecimal. But if multiple ANSI/Unicode characters are in a search/replace string, the manual conversion takes a lot of time.

Solution: The script UnicodeStringToPerlRegExp.js available for download at Macros & Scripts which converts the selected string into a Perl regular expression string with characters in range 128 to 255 encoded as \x[0-9a-f][0-9a-f] and characters with a code value greater than 256 encoded as \x{[0-9a-f][0-9a-f][0-9a-f][0-9a-f]} copied to operating system clipboard. ASCII characters in range 0 to 127 are not modified which means Perl regular expression characters are copied to the clipboard without any modification.

Note: The script does not support a conversion from character to UTF-8 coding sequence which would be needed only when running a Find or Replace in Files in not opened UTF-8 encoded files from within a script or macro.

The line and block comments can be removed from script file by running a replace all (from top of file) searching with Perl regular expression for ^ *//.+[\r\n]+|^ */\*[\s\S]+?\*/[\r\n]+| +//.+$ and using an empty replace string. The first part in this OR expression with three arguments matches entire lines containing only a line comment, the second part matches block comments, and third part matches line comments right to code. Removal of the comments makes the usage of this script more efficient on using it often because of JavaScript interpreter has to interpret less characters and lines.

JackTing · May 20, 2014#22014-05-20T07:17+00:00

Dear Mofi:

You surely can do it in this way.

But as a Chinese people, we read Chinese characters but not the codes, just like American read "ABC" but not 0x41,0x42,0x43
If you do it in such way, the scripts will be unreadable, and hard to maintain, for Chinese people or even not.

Fortunately, UltraEdit can deal such scripts in its nature way.

All you have to do is SAVE your script in (UTF-8 - NO BOM).

I've many experience on this, and I can assure that both UltraEdit.perlReOn and UltraEdit.unixReOn can work.

The only thing users have to notice is that they have to convert the data from ASCII (or DBCS) encode, to Unicode encode or UTF-8 encode and convert them back (if necessary).
The reason why is: in BIG5 encoding of DBCS, the second byte of characters may contain special ASCII characters such as "\", "|", ... which have special meaning in perlRegx or unixRegx. Such special characters will make the whole meaning of your script interpreted with wrong meaning.
(Not all DBCS have such problem)

But, if you convert it to UTF-8/Unicode, the problem is gone.
Also, after converted to UTF-8/Unicode the Chinese character are treated as ONE character, but not TWO byte (in DBCS), so user do not have to worry about half of a character will changed (replace/delete ...) by the script.
For example: "愁" in DBCS is (0xB7, 0x54), and 0x54 is ASCII 'T', so if you translate 'T' to 'A', the data will be changed. But it will not in Unicode.

Here I've 2 desktops right now:
1.XP Pro SP3 (Traditional Chinese version)
2.Win7 64bit (Pro Edition) (Traditional Chinese version).

I'll double confirm the result on "Windows XP SP3" (English version)
In my environment I see the Chinese character in script correctly, and run correctly.
There will be 4 replacement in first 3 lines.

Mofi wrote:One of my design objectives on writing a public script is: it should work on all operating systems with any version of UltraEdit if it's at all possible.

I agree with this concept basically. But for convenience of maintain the scripts and for scripts that will go on evolving, if my scripts will no run on other language environment, I'll try some other technique to solve them (maybe an additional tool to convert the original script to the format you use).

Additionally, I setup a VM running "Windows XP SP3" English version, without any modification on the multi-language settings (no Unicode font installed even).

And testing my script with UEStudio 06.60.1.1001 and UEstudio 14.20, without any setting changed.
Both results are identical and no difference with what I wrote above.
(I run the script, save the replaced data file on the VM, then examing the results on HOST).

So here are the procedures as a summary:

1. Prepare the data with Unicode/UTF-8 encoding. Translate them to Unicode if they are ANSI/OEM Big5 (DBCS) original, by the following code

Code: Select all

UltraEdit.activeDocument.ASCIIToUnicode();

2. Write and save your scripts with "UTF-8 no BOM" encoding.
3. Test or Run the scripts.
4. Translate the data back to ANSI/OEM Big5 if necessary, by the following code

Code: Select all

UltraEdit.activeDocument.unicodeToASCII();

Mofi · May 23, 2014#32014-05-23T20:02+00:00

You are right and you are wrong.

Characters with a code value greater than 127 decimal can be used in UltraEdit scripts, if the script is stored as UTF-8 encoded file without byte order mark (BOM).

I verified that with your example files and your scripts with some additional commands with several versions of UltraEdit including v13.00, the first version with scripting support. All produced the right results on the example file encoded with UTF-8, and on a file encoded in ANSI with characters in upper half of the code page for the replace operation.

But this does not mean that UTF-8 encoded characters can be used anywhere in an UltraEdit script.

UTF-8 encoded characters are possible in an UltraEdit script according to my tests only in

comments of the script, and
search and replace strings of the commands
frInFiles of the UltraEdit object and
findReplace of the UltraEdit document object.

It is not possible to use UTF-8 encoded characters anywhere else in the script, neither on write command, nor for putting Unicode characters into active clipboard directly, nor in a regular expression of a JavaScript RegExp object, and also not in an argument string of a method of the JavaScript String object, etc. The topic UltraEdit.clipboardContent not supporting Chinese characters? demonstrates such a problem.

There is one exception, the file on which the script is executed is a UTF-8 encoded file, but opened as ASCII/ANSI file instead of as Unicode file.

So why can UTF-8 encoded characters used in search and replace strings of the commands findReplace and frInFiles?

The reason is most likely (I don't know the source code) that UltraEdit supports internally 2 encodings for search and replace strings: ANSI and UTF-8.

The commands findReplace and frInFiles automatically detect if the search/replace string is an ANSI or a UTF-8 encoded string and converts the string to encoding of the file on which the replace command is executed, or the encoding defined by the encoding parameter.

For example if there are opened three files, and first one is an ANSI file, second one a UTF-8 encoded file (converted to UTF-16 LE on load) and the last one is encoded in UTF-16 LE, and a replace all in all open files is executed by the user or a macro or a script, UltraEdit converts the search/replace strings to ANSI for first file and UTF-16 LE for the other two files.

The reason for the automatic support of ANSI and UTF-8 encoding for scripting commands findReplace and frInFiles (and the corresponding macro commands) is the requirement to store search/replace strings in the histories of the commands in the INI file which is a single byte encoded file. Any Unicode character in a search/replace string must be therefore encoded in UTF-8 to be able to store it in the INI file. But ANSI characters (single bytes with a value greater 127 decimal) are of course still supported on write and read from INI file.

And search/replace strings are stored as ANSI strings in the macro file and therefore Unicode characters are stored UTF-8 encoded in the macro file. That characters with a code value greater 127 are stored in a macro file in UTF-8 can be seen when recording a macro which searches for example for German character ß, stopping the recording and looking in Edit/Create Macro dialog on code of the recorded macro. Instead of "ß" the string "ÃŸ" is displayed which are the two bytes of UTF-8 encoded "ß" displayed as ANSI string.

Conclusion:

The internal automatic detection and conversion of the encoding of the search/replace string by the UltraEdit commands Find, Replace, Find in Files and Replace in Files is the reason why your script example works.

However, as many UltraEdit scripts are just a sequence of finds/replaces, knowing that Unicode characters can be used for search/replace strings on the commands findReplace and frInFiles is definitely useful for many users of UltraEdit.

Therefore many thanks for your very useful contribution on this topic and your tests.

One more note:

I wrote my script not only for UltraEdit scripts. It is also useful when there is a need for a Perl regular expression in any language supporting Perl regexp like JavaScript for webpages, Ruby, C++, C#, etc. which must be encoded in ASCII, but Unicode characters must be searched/replaced. See for example Non escaped non ASCII character in non ASCII-8BIT script.

PS: I have added a note about usage of Unicode characters in UE/UES scripts to post How to create a script from a post in the forum?

JackTing · May 30, 2014#42014-05-30T08:57+00:00

Thanks for update, Mofi.
I do appreciate your efforts on the forum that make us less try and error.

Mofi · Nov 14, 2018#52018-11-14T16:48+00:00

Some important things changed since last post on this topic. The most important one is that UltraEdit and UEStudio became full Unicode aware applications with UltraEdit for Windows v24.00 and UEStudio v17.00. Therefore UltraEdit for Windows ≥ v24.00 and UEStudio ≥ v17.00 support Unicode characters in macros and also in scripts which can be encoded now in UTF-8 or UTF-16 LE and support Unicode characters everywhere.

For that reason I revised the script UnicodeStringToPerlRegExp on 2018-11-14 although not really needed anymore for users of UltraEdit ≥ v24.00 or UEStudio ≥ v17.00.

There are following improvements:

Enhanced the script code to optionally replace also horizontal tabs, line-feeds, form-feeds and carriage returns by their corresponding escape sequence after changing in script the value of variable bEscapeWhitespaces from false to true.
Fixed a bug which had resulted before in further script processing after showing a message that the script cannot be used with UltraEdit for Windows ≤ v14.20 or UEStudio ≤ 9.00 and optimized the code for this UE/UES version verification and incompatibility reporting.
Extended and improved the comments in script file.