Tapatalk

Regex find/replace messing up diacritic characters with UltraEdit for MAC

Regex find/replace messing up diacritic characters with UltraEdit for MAC

3
NewbieNewbie
3

    19:26 - Jan 23#1

    I am working with a large text file, and need to insert a line break after every character in the text. Unlike other text tools I have tried, UltraEdit is able to handle the large amount of text really quickly (pretty much everything else I have tried times out).

    I can use (Perl) regex to find each character using either

    Code: Select all

    (.)
    or

    Code: Select all

    (\X)
    and replace with

    Code: Select all

    $1\n
    This inserts the line breaks, but UltraEdit loses any diacritic characters in the original string, replacing them with non-character code points. So, for example, if my input string is

    Code: Select all

    Union à Dieuchez Denys l'Aréopagite
    the output of the find/replace operation is 

    Code: Select all

    U
    n
    i
    o
    n
     
    �
    �
     
    D
    i
    e
    u
    c
    h
    e
    z
     
    D
    e
    n
    y
    s
     
    l
    '
    A
    r
    �
    �
    o
    p
    a
    g
    i
    t
    e
    
    The individual Unicode diacritic characters à and é are being replaced by sequences of two U+FFFD REPLACEMENT CHARACTER codes.

    Is there a way to prevent this? I tested short strings like this in TextMate and Sublime Text, and they didn’t mess up the diacritics like this, but they can’t handle my large text file.

    19176
    MasterMaster
    19176

      21:24 - Jan 23#2

      Hi,

      I created a new UTF-8 file (Linux EOL) and pasted your sample. Then this Perl regexp Replace All worked fine:|

      F: .\K
      R: \n

      Tested in UE 2023.2.0.22 64-bit for Windows

      BR, Fleggy

      3
      NewbieNewbie
      3

        23:50 - Jan 23#3

        Thanks. This is what I get trying that in UE 2022.0.0.19 on Mac OS (have not been prompted with an update notification for a while; will check to see what stable version is current for Mac).

        Code: Select all

        U
        ni
        on
         �
        � 
        Di
        eu
        ch
        ez
         D
        en
        ys
         l
        'A
        r�
        �o
        pa
        gi
        te
        

        6,685587
        Grand MasterGrand Master
        6,685587

          6:29 - Jan 24#4

          I tried the Perl regular expression replace with search expression (.) as well as (\X) and replace expression \1\n as well as $1\n with UltraEdit for Windows v2023.2.0.27 and the result was always correct. The replaces were done on ANSI encoded file with Unix 1252 (ANSI - Latin I) displayed in the status bar respectively 1252-DOS with usage of basic status bar, on UTF-8 without BOM encoded file with Unix UTF-8 displayed in the status bar respectively U8-Unix with usage of basic status bar, and on UTF-16 with BOM encoded file with Unix UTF-16 BOM displayed in the status bar respectively UB-Unix with usage of basic status bar.

          It is possible that UltraEdit for MAC uses a different Perl regular expression library than UltraEdit for Windows using the Boost Perl regular expression library. (The Boost library version depends on version of UltraEdit.) I suggest in this case converting the file temporarily from UTF-8 to UTF-16, run the replace and convert the file back to UTF-8 without BOM. This could be a workaround for the issue. The issue should be reported to UltraEdit support by email on being reproducible with UTF-8 encoded file which I cannot verify myself because I don't have a MAC and don't use UltraEdit for MAC for that reason.
          Best regards from an UC/UE/UES for Windows user from Austria

          3
          NewbieNewbie
          3

            17:22 - Jan 24#5

            Thanks for the investigation. I will report this to UE.