How to search for trademark in files with various encoding (Windows-1252 and UTF-8)?

wells2207 · May 09, 2019#12019-05-09T03:15+00:00

I am doing some searching and also some global replaces for items with and without a trademark symbol. However, I am getting inconsistent behavior.

When I search with UltraEdit regular expression for items without a trademark using PRODUCT[~™] as search string using Find in Files, sometimes it still finds items with a ™. Any idea why?
Some of the items it finds show a ™ and others show â„¢. If I open the topics in Notepad or the XML Editor in MadCap Flare, they show as having ™. The same is true if I open the file directly in UltraEdit. The only place they show â„¢ is in Find in Files results. Any idea why?

Thanks!

Mofi · May 09, 2019#22019-05-09T06:07+00:00

Do you know anything about character encoding, i.e. with which bytes a character is stored in a text file?

The trade mark sign is encoded in a UTF-16 Little Endian encoded file with the two bytes 22 21. The same character is encoded in a UTF-16 Big Endian encoded file with the bytes 21 22 (reverse byte order in comparison to Little Endian). And in a UTF-8 encoded file the character is encoded with the three bytes E2 84 A2 which are displayed as â„¢ on interpreting and displaying a UTF-8 encoded file as "ANSI" file using code page Windows-1252. A text file using an encoding with just one byte per character like Windows-1252 contains the trade mark sign with the byte 99. The byte values posted here are the hexadecimal values of the bytes.

Microsoft called all character encodings with just one byte per character using a code page (table) for mapping 256 characters to 256 bytes used in GUI applications ANSI encoding although this is not really correct because of not all those character encodings were standardized by American National Standards Institute (for U.S.) or International Organization for Standardization (ISO ... defines international standards). Windows-1252 is a code page not defined as real standard, but is supported nevertheless by all applications capable interpreting text data.

So it depends on how you search for the trade mark sign respectively for a character being NOT a trade mark sign after a sequence of other characters?

A Find/Replace in current file should always work as UltraEdit has usually on file open detected the used character encoding automatically and knows therefore how the character ™ is represented with which bytes in current file.

On using Find/Replace in Files it gets more complicated because of the searched text files can be encoded different. "ANSI" encoded files and UTF-8 encoded files with no byte order mark (BOM) are very hard to distinguish by an application, see How does automatic UTF-8 encoding detection work in UltraEdit and UEStudio? UltraEdit does not analyze the entire file on using Find/Replace in Files to automatically detect if a file not being UTF-16 LE or UTF-16 BE or UTF-8 with BOM encoded is an "ANSI" or a UTF-8 without BOM encoded file. That would dramatically slow down Find/Replace in Files. So it is recommended that the user running a Find/Replace in Files searching for non-ASCII characters, enables the advanced find/replace in files option Use encoding and select the right encoding like 65001 (UTF-8) for running the Find/Replace in Files on UTF-8 encoded files.

You have unfortunately not posted anything about version of UltraEdit used by you on which operating system and how the files are encoded on which find/replace is executed and which find/replace you execute. So I can't help with detailed instructions on how to run the finds/replaces for a better result.

In case of some text files are Windows-1252 encoded and others are UTF-8 encoded as it looks like, I recommend to run first a Find in Files or Replace in Files with option Use encoding checked and 65001 (UTF-8) selected on using search string PRODUCT[~™] because in this case UltraEdit definitely knows that it has to search for the bytes 50 52 4F 44 55 43 54 (on a case-sensitive search) with the next three bytes NOT E2 84 A2.

Next run Find in Files or Replace in Files a second time with option Use encoding checked and this time 1252 (ANSI - Latin I) selected as this makes it clear for UltraEdit to search for the bytes 50 52 4F 44 55 43 54 (on a case-sensitive search) with next byte NOT 99.

Well, the second search for Windows-1252 encoded PRODUCT[~™] is really problematic on running on files which are UTF-8 encoded as find is positive also on ™ present in UTF-8 encoded file.

It might work also using Use encoding with Auto-detect selected depending on used version of UltraEdit.

Best is perhaps a Find in Files or Replace in Files with Perl regular expression search string PRODUCT(?!\xE2\x84\xA2|\x99) without using Use encoding which produces a positive match on all PRODUCT on which neither next three bytes being E2 84 A2 nor next byte being 99.

I have added a ZIP file containing all the same two lines of text, but no file is binary equal with any other file. You can use this sample files to test the Find/Replace in Files you want to execute on your files. It should also help to understand what character encoding means.

wells2207 · May 09, 2019#32019-05-09T14:36+00:00

This is a boat-load of great information, and I really appreciate you taking the time to provide it! I have read over it several times, and I must admit I feel like I only have a vague understanding of it all.

For the record, I have UltraEdit 25.00.0.82 running on Windows 10.

When I use PRODUCT(?!\xE2\x84\xA2|\x99) with no encoding and with Perl, I find a bunch of references with the TM after it.
When I use PRODUCT[~(?!\xE2\x84\xA2|\x99)] with no encoding and with Perl, I only find references with the â„¢ version of the TM after it.

This does give some information, but I am not certain the difference between what shows as a regular trademark symbol and the â„¢ version.
When I try to replace the â„¢ version with the regular, I get the following symbol, which seems to be a problem for my application: �

Any additional help would be appreciated!

fleggy · May 09, 2019#42019-05-09T15:30+00:00

Hi,

please, paste here some real text sample with lines which seem problematic to you and information about used code page (you can see it on the status line).
BTW If you want to find PRODUCT[~™] using Perl pattern provided by Mofi then you must adapt it:

PRODUCT\[~(?=\xE2\x84\xA2|\x99)[^\]]*+\]

Thanks, Fleggy

wells2207 · May 09, 2019#52019-05-09T16:41+00:00

Thanks! I will try to provide what is needed, but I am unsure if this is what you are looking for:

I need to be able to find:

PRODUCT (this is "PRODUCT" with a space after it or with any another character after it other than a trademark symbol)

When I tried the adapted pattern above, nothing was found. If I did a simple search for "PRODUCT " (with a space after), several examples were found, so something about the pattern above is not working for me.

The following is a sample of what was NOT found using the pattern above:

<p>PRODUCT is an improved product.</p>

I don't know what you mean by "information about used code page (you can see it on the status line)."

Thanks for your help!

fleggy · May 10, 2019#62019-05-10T04:33+00:00

Hi,

I didn't read you first post carefully, sorry.
Use this simple Perl pattern (paste it inside the field Find what:) containing ™ as a literal and not as a hexadecimal sequence:
PRODUCT(?!™)

I successfully tested it in ANSI and UTF-8 files using Notepad++ instead of UE because I have no access to UE at this moment.
I don't know if it is usable in Find in Files.

BR, Fleggy

Mofi · May 10, 2019#72019-05-10T05:44+00:00

The status bar is the bar at bottom of main application window of UltraEdit showing information like help text for a command or status information, an indication on active recording of a macro, line number, column number, clipboard number, line termination type (DOS/Unix/MAC), the encoding of active file (65001 (UTF-8) or 1252 (ANSI - Latin I)), active syntax highlighting, last modification date of file, file size or selected characters, read/write status of file, insert/overwrite mode, column/normal editing mode and status of CapsLock.

There are two variants of the status bar available. The standard status bar in UE v25.00.0.82 has drop down items for encoding and syntax highlighting to change both for active file. The basic status bar (configuration setting) does not have these drop down items and also does not show which encoding active file has. The basic status bar should show U-DOS for a file UTF-16 LE encoded with DOS line terminators, UBE-Unix for a file with UTF-16 BE encoded with Unix line terminators and U8-MAC for a file with UTF-8 encoding with MAC line terminators and just the line terminator type on file being not Unicode encoded. But a bug introduced with UE v25.00 results in not showing Unicode encoding information in basic status bar. This bug was partly fixed with UE v26.10.0.14 indicating UTF-16 LE and UTF-16 BE again in basic status bar as prior v25.00, but UTF-8 encoding indication is still missing on using basic status bar.

I remembered that UE v25.00 has some bugs which older versions like v22.20 used by me on Windows XP and newer versions like v26.10 on Windows 7 do not have. So I restored UE v25.00.0.82 from my archives and looked on the example files provided by me to find a Perl regular expression search string matching just those PRODUCT which do not have a trade mark sign as next character without matching also this character. The working Perl regular expression search string is PRODUCT(?!™|\xE2\x84\xA2|\x99|\xE2\x{201E}\xA2) whereby Use encoding should not be checked. For some unknown reason the character „ having code value 84 in Windows-1252 code page is interpreted by UE v25.00.0.82 with its Unicode code value 201E even on running a Find in Files. That is very strange.

So we have now a Perl regular expression string to find in all Windows-1252 and UTF-8 encoded files all occurrences of PRODUCT on which next character is not a ™ encoded either with just one byte with hexadecimal value 99 (Windows-1252) or with three bytes with hexadecimal values E2 84 A2 (UTF-8). But it is not possible to use this search expression with a Replace in Files using as replace string PRODUCT™ to insert the missing trade mark sign after PRODUCT. It is not possible because some files are Windows-1252 encoded and others are UTF-8 encoded and so it depends on the encoding of each file if ™ is represented with just one byte with value 99 or with three bytes with the values E2 84 A2. So it is required that UltraEdit really knows the character encoding of each file before it can insert the trade mark sign.

For that reason I suggest to run a Perl regular expression Find in Files with search string PRODUCT(?!™|\xE2\x84\xA2|\x99|\xE2\x{201E}\xA2) and with option Open matching files checked. Next press Ctrl+R to run a Replace with same Perl regular expression search string and PRODUCT™ as replace string with All open files selected to insert ™ after PRODUCT where missing in opened files with the correct encoding of this character for each opened file because of UltraEdit knows for each opened file the character encoding of the file. Finally use Save all with key Alt+F12 to save all files. This procedure worked with UE v25.00.0.82 on the set of example files provided by me.