User to user discussion and support for UltraEdit, UEStudio, UltraCompare, and other IDM applications.

Find, replace, find in files, replace in files, regular expressions
11 posts Page 1 of 1
This sticky topic is created for building a collection of often needed finds or replaces in

  • current file,
  • all open files,
  • or all files in a directory or even folder tree.
Every post with a general find or replace which might be useful for others is welcome on this topic.

Please give each post an individual subject explaining briefly what kind of task can be done with the posted find or replace.
Append DOS line break/terminator to end of file if last line has no DOS line termination

With the Perl regular expression engine:

Find What: (.)$(?!\r\n)
Replace With: \1\r\n

With UltraEdit and Unix regular expression engines there is no expression to append a line termination to end of file.



Append UNIX line break/terminator to end of file if last line has no UNIX line termination

With the Perl regular expression engine:

Find What: (.)$(?!\n)
Replace With: \1\n

This replace should be never executed on files with DOS line terminations (carriage return + linefeed). Use it only for files with UNIX line terminators (only linefeed) not converted temporarily to DOS on file load.
Trim trailing spaces and tabs from all lines

There is the command Format - Trim Trailing Spaces which can be used also in macros and scripts to remove all trailing spaces and tabs from all lines of active file.

But in case of the need to trim trailing spaces and tabs on all files in a directory or a directory tree, it is better to use a regular expression Replace in Files for this task.

With the Perl or Unix regular expression engine:

Find What: [ \t]+$
Replace With: empty

With the UltraEdit regular expression engine:

Find What: [ ^t]+$
Replace With: empty

Please note that trailing spaces/tabs at end of last line of a file are not removed when using UltraEdit or Unix regular expression engine if the last line has no line termination. The Perl regular expression works also on last line of a file without line termination.
Convert all files in a folder (tree) from DOS to UNIX

There is the command File - Conversions - DOS to UNIX which can be used also in macros and scripts. It is strongly recommended to use this command on files opened already in UltraEdit or UEStudio. The command is also available from context menu of a file tab.

But in case of the need to convert all DOS line terminations to UNIX in all files in a directory or a directory tree, it is better to use a Replace in Files for this task.

With the Perl or Unix regular expression engine:

Find What: \r\n
Replace With: \n

With the UltraEdit regular expression engine or without using the regular expression option:

Find What: ^p
Replace With: ^n



Convert all files in a folder (tree) from UNIX to DOS

There is the command File - Conversions - UNIX/MAC to DOS which can be used also in macros and scripts. It is strongly recommended to use this command on files opened already in UltraEdit or UEStudio. The command is also available from context menu of a file tab.

But in case of the need to convert all UNIX line terminators to DOS in all files in a folder or a folder structure, it is better to use Replace in Files for this task.

Simply use following without using the regular expression option:

Find What: ^n
Replace With: ^p

But this can easily result in a carriage return + carriage return + linefeed sequence if some of the files contain already DOS line terminations. Therefore a second Replace in Files should be executed to make sure that such a line terminator sequence does not exist now in any file respectively correct them:

Find What: ^r^p
Replace With: ^p

This normal, non regular expression replace can be used also to fix CR CR LF sequences in files

  • created by applications opening a file in text mode and writing \r\n to the file instead of only \n as required in this case;
  • downloaded in ASCII mode from a webserver on which the files are stored already with DOS line terminations.
Remove / delete blank and empty lines

This article is an add-on for the IDM power tip Remove blank lines.

Before starting with the regular expression(s) to use for deleting blank lines, let us define what a blank line is in a plain text file.

Line and Line Terminator

A line is a sequence of zero or more characters ending with a line termination defined by one or two newline characters.

There are mainly 3 types of line terminators respectively end-of-line (EOL) markers:

  1. DOS/Windows
    On DOS and Windows operating systems the line terminator is defined by a carriage return (CR) plus line feed (LF) pair.
    \r\n matches a DOS/Windows line ending in Unix or Perl regular expressions and is used also in many programming languages within strings.
    In Unix regular expressions of UE/UES also \p can be used to reference CRLF.
    Either ^r^n or shorter ^p can be used in non regular expression or UltraEdit regular expression finds/replaces for CRLF.
    Note: ^p is internally replaced by ^r^n which must be taken into account when putting ^p in square brackets or appending a multiplier in an UltraEdit regular expression search string. For example ^p+ is interpreted as ^r^n+ which finds a carriage return and 1 or more line feeds and not 1 or more CR+LF pairs.
    ^p matches in find/replace strings in Microsoft Word the end of a paragraph.
  2. Unix
    On Unix operating systems (OS) as well as on Mac since OS X (Cheetah) the line terminator is just the line feed character (LF).
    \n matches a Unix line ending in Unix or Perl regular expressions and is used also in many programming languages within strings.
    ^n can be used in non regular expression or UltraEdit regular expression finds/replaces for LF.
  3. Mac
    On Mac OS prior Linux based OS X (Cheetah) the line terminator is just the carriage return character (CR).
    \r matches a Mac line ending in Unix or Perl regular expressions and is used also in many programming languages within strings.
    ^r can be used in non regular expression or UltraEdit regular expression finds/replaces for CR.
UltraEdit and UEStudio detect all three types of line terminators and both show the line terminator type of the active file in the status bar.

There are several configuration options to determine how to handle files with Unix or Mac type of line terminators at Advanced - Configuration - File Handling - DOS/Unix/Mac Handling in Windows UltraEdit and in UEStudio. The favorite settings of the author are:

  • DOS as default file type for new files.
  • Automatically convert to DOS format for files with Unix or Mac files.
  • Only recognize DOS terminated lines (CR/LF) as new lines for editing not checked.
    This setting is mainly for files with mixed usage of line terminators as for CSV files with DOS/Windows line terminators at end of every data row and just line feed for a line break within a data field. But this setting is only effective if Never prompt to convert files to DOS format is selected above in the configuration dialog.
  • Save file as input format (Unix/Mac/DOS) enabled to save a Unix/Max file automatically converted to DOS on opening as Unix/Mac file. So the conversion to DOS is just temporarily and not permanently.
On Windows it is in general advisable to convert all small files with Unix/Mac line terminators to DOS in case of copying and pasting some lines of the file to other Windows applications. But for viewing or editing very large files it is better to select Never prompt to convert files to DOS format as the temporary conversion to DOS makes sense only for small files opened with the usage of a temporary file.

An empty line is a line with no characters at all which means on character stream that after a line terminator immediately the next line terminator follows.



Whitespaces

But very often a text file contains lines with no visible characters, which are nevertheless not empty lines as there are one or more whitespaces (or white-spaces or white spaces) present in the lines.

There are 2 types of whitespaces: the horizontal and the vertical whitespaces.

  • Horizontal whitespaces
    This group of whitespace characters is matched by \h in Perl regular expressions. Not all applications supporting regular expressions in Perl syntax support also \h as placeholder for any horizontal whitespace. The Perl regular expression engine inside UltraEdit for Windows supports \h since v17.10.0.1010. Version 11.00.0.1011 was the first version of UEStudio supporting \h in Perl regular expressions.

    1. SPACE
    2. The common space is the most often used whitespace character. It is entered by pressing the space bar on the keyboard. It has the hexadecimal value 20 which is decimal 32.
    3. TAB
      The tab character is also very often used in text files. It is entered by pressing the TAB key on the keyboard accept UE/UES is configured to insert 1 or more spaces instead of a tab for the active file based on the file extension of the file. The tab character is more precisely the horizontal tab character (HT) also known with the name character tabulation and has the hexadecimal value 09 which is decimal 9.
      In many programming languages \t represents a horizontal tab character in strings. \t represents also in Unix and Perl regular expression strings the horizontal tab while ^t must be used for the horizontal tab in non regular expression or UltraEdit regular expression find/replace strings.
      In Microsoft Office applications ^t can be used also in find/replace strings for a tab.
    4. NO-BREAK SPACE
      This space has the same width as the common space, but prevents wrapping text. Two words with a non breaking space between the two words are always put together. This special space can be entered in UltraEdit for example via the ASCII table view/dialog as it has the hexadecimal value A0 which is decimal 160.
      This space is often coded in HTML by the entity  
      The no-break-space can be quickly entered with Ctrl+Shift+Space in Microsoft Word.
      It exists quite often in plain text files, but rarely in source code files of any programming or scripting language.
    5. Unicode spaces
      The Unicode table defines 16 additional horizontal whitespace characters which are with their hexadecimal values:

      Code: Select all
      0x1680 ... OGHAM SPACE MARK
      0x180e ... MONGOLIAN VOWEL SEPARATOR
      0x2000 ... EN QUAD
      0x2001 ... EM QUAD
      0x2002 ... EN SPACE
      0x2003 ... EM SPACE
      0x2004 ... THREE-PER-EM SPACE
      0x2005 ... FOUR-PER-EM SPACE
      0x2006 ... SIX-PER-EM SPACE
      0x2007 ... FIGURE SPACE
      0x2008 ... PUNCTUATION SPACE
      0x2009 ... THIN SPACE
      0x200a ... HAIR SPACE
      0x202f ... NARROW NO-BREAK SPACE
      0x205f ... MEDIUM MATHEMATICAL SPACE
      0x3000 ... IDEOGRAPHIC SPACE

      Most people don't know about those characters and therefore they are usually not present in Unicode encoded text files.
    In general it makes often more sense to use in Perl regular expression search strings the expression [\t ] instead of \h as other horizontal whitespaces than common space and horizontal tab are very rare in text files and with [\t ] every character in file must be compared only with two characters instead of 19 characters when using \h.

  • Vertical whitespaces
    This group of whitespace characters is matched by \v in Perl regular expressions. Not all applications supporting regular expressions in Perl syntax support also \v as placeholder for any vertical whitespace. The Perl regular expression engine inside UltraEdit for Windows supports \v since v17.10.0.1010. Version 11.00.0.1011 was the first version of UEStudio supporting \v in Perl regular expressions.

    1. LINE FEED
    2. The line feed (LF) (or linefeed or line-feed) is the most often used vertical whitespace character as it is the line terminator in Unix text files and also part of the line termination in DOS/Windows text files. It has the hexadecimal value 0A which is decimal 10.
      In many programming languages \n represents a line feed in strings. \n represents also in Unix and Perl regular expression strings the line feed.
      ^n must be used for the line feed in non regular expression or UltraEdit regular expression find/replace strings.
      In Microsoft Word ^n can be used also in find/replace strings for column breaks.
    3. CARRIAGE RETURN
      The carriage return (CR) is also an often used vertical whitespace character as it is the line terminator in Mac text files and also part of the line termination in DOS/Windows text files. It has the hexadecimal value 0D which is decimal 13.
      In many programming languages \r represents a carriage return in strings. \r represents also in Unix and Perl regular expression strings the carriage return.
      ^r must be used for the carriage return in non regular expression or UltraEdit regular expression find/replace strings.
    4. FORM FEED
      The form feed (FF) is sometimes present in text files as it is often interpreted as page break. It has the hexadecimal value 0C which is decimal 12.
      UE/UES displays a horizontal line for the form feed character if Show Page Breaks as Lines is enabled in menu View and no other character is set as page break character in File - Print Setup/Configuration - Page Setup for Page break code.
      ^b represents in non regular expressions or UltraEdit regular expression finds/replaces the page break character.
      In Microsoft Word ^b can be used also in find/replace strings for section breaks.
      In many programming languages \f represents a form feed in strings as well as in Unix and Perl regular expression strings.
    5. VERTICAL TAB
      The vertical tab (VT) also known as line tabulation is used nearly never nowadays. It has the hexadecimal value 0B which is decimal 11.
      In Unix regular expressions strings \v represents the vertical tab as also in many programming languages in strings. But in Perl regular expressions \v is a placeholder for all vertical whitespaces.
      There is no special character for a vertical tab for a non regular or an UltraEdit regular expression find/replace as really not present anymore in text files.
    6. Unicode vertical whitespaces
      The Unicode table defines 3 additional vertical whitespace characters which are with their hexadecimal values:

      Code: Select all
      0x0085 ... NEXT LINE (NEL)
      0x2028 ... LINE SEPARATOR
      0x2029 ... PARAGRAPH SEPARATOR

      Most people don't know about those characters and therefore they are usually not present in Unicode encoded text files.
Note: Many people think that \s matches in a Perl regular expression (common) spaces and (horizontal) tabs and use it therefore often in find strings. Yes, \s matches spaces and tabs, but also all other whitespaces including the newline characters carriage return and line feed. An expression like \s+ does not stop selecting whitespaces up to end of line, it matches also the line termination of the file and continues matching whitespaces until a character is found which is not one of the 26 whitespace characters. That is very often not a wanted behavior. The usage of \h or [\t ] is often better.



Remove or delete blank lines

Now after the two lessons above we know what is a line in a text file and which not visible characters can be on a line.


What is a blank line?

A blank line is a line with zero or more horizontal whitespaces.


How to remove all blank lines in a selected block, entire file, all open files, all files in a folder or even all files in a directory tree?

Best is to use the Perl regular expression engine available since UltraEdit v12.00 and UEStudio v5.50.

Use as search string: ^(?:[\t ]*(?:\r?\n|\r))+

The replace string is an empty string in either the Replace or Replace in Files dialog.

The Regular Expressions option must be enabled and Perl must be selected, too.

The Perl regular expression engine can be selected either in the advanced options of the replace dialog opened by pressing button Advanced if not already displayed, or at Advanced - Configuration - Search - Regular Expression Engine in UltraEdit prior v14.00 and UEStudio prior v6.50.

The selected Replace Where option in the Replace respectively Replace in Files dialog determines on which part of current file or on which files the blank lines are removed by running once a Replace All.

In case you are still using an UltraEdit version prior 12.00 or UEStudio prior 5.50 you should think about an upgrade. However, IDM explained in detail in power tip Remove blank lines how to remove blank lines using the UltraEdit regular expression engine which is not as powerful as the Perl regular expression engine and therefore requires executing Replace All until no blank line removed anymore.

The search string above finds only lines containing zero or more spaces or tabs. If the non breaking space should be also included, use the search string ^(?:[\t \xA0]*(?:\r?\n|\r))+ or depending, on your version of UE or UES (see above), use ^(?:\h*(?:\r?\n|\r))+ to find really all blank lines with zero or more occurrences of any horizontal whitespace character within the line. On running a Replace in Files on UTF-8 encoded files the search string must be ^(?:[\t \xA0\xC2]*(?:\r?\n|\r))+ as the no-break space is encoded in UTF-8 with 2 bytes with the hexadecimal values C2 A0.


How to keep this replace for fast usage?

You need to do remove blank lines for example in entire file quite often?

Yes, then you should store this replace in a macro stored together with other often needed macros in a macro file which is configured at Macro - Set Auto Load for being automatically loaded on startup of UltraEdit or UEStudio. You can assign a hotkey or chord to the macro for quick execution by key or use the Macro List view opened from View - Views/Lists and double click on the macro in the list to execute it.

It is also possible to code the replace in a script which is added to the script list at Scripting - Scripts and executed by the assigned hotkey or chord, or from menu Scripting or from the Script List view opened from View - Views/Lists with a double click on the script in the list.

Two macro/script examples:

For removing all blank lines in entire file with a macro:

Code: Select all
InsertMode
ColumnModeOff
Top
PerlReOn
Find MatchCase RegExp "^(?:[\t ]*(?:\r?\n|\r))+"
Replace All ""

Note: Users of a version of UltraEdit or UEStudio where command PerlReOn switches the regular expression engine permanently to Perl instead of just temporarily for the macro execution should append either the command UnixReOff or UnixReOn for switching back to UltraEdit or Unix regular expression engine if one of the easier to use, but not so powerful legacy regular expression engines is the preferred engine.

For removing all blank lines in entire file with a script:

Code: Select all
if (UltraEdit.document.length > 0)     // Is any file currently opened?
{
   UltraEdit.insertMode();             // Define environment for the script.
   UltraEdit.columnModeOff();
   UltraEdit.activeDocument.top();
   UltraEdit.perlReOn();               // Define the parameters for the replace.
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   if (typeof(UltraEdit.activeDocument.findReplace.searchInColumn) == "boolean")
   {
      UltraEdit.activeDocument.findReplace.searchInColumn=false;
   }
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceAll=true;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=false;
   // Every backslash in a JavaScript string must be escaped with a backslash to
   // pass the find string correct to the UE/UES Perl regular expression engine.
   UltraEdit.activeDocument.findReplace.replace("^(?:[\\t ]*(?:\\r?\\n|\\r))+", "");
}


For removing just consecutive blank lines in all open files with a macro:

Code: Select all
InsertMode
ColumnModeOff
PerlReOn
Find MatchCase RegExp "^[\t \xA0]*(\r\n|\n|\r)(?:[\t \xA0]*(?:\r?\n|\r))+"
Replace All AllFiles "\1"

Note: All spaces, tabs and non breaking spaces are additionally removed by this replace on each first blank line of all blocks of consecutive blank lines.

For removing just consecutive blank lines in all open files with a script:

Code: Select all
if (UltraEdit.document.length > 0)     // Is any file currently opened?
{
   UltraEdit.insertMode();             // Define environment for the script.
   UltraEdit.columnModeOff();
   UltraEdit.perlReOn();               // Define the parameters for the replace.
   UltraEdit.activeDocument.findReplace.mode=0;
   UltraEdit.activeDocument.findReplace.matchCase=true;
   UltraEdit.activeDocument.findReplace.matchWord=false;
   UltraEdit.activeDocument.findReplace.regExp=true;
   UltraEdit.activeDocument.findReplace.searchDown=true;
   if (typeof(UltraEdit.activeDocument.findReplace.searchInColumn) == "boolean")
   {
      UltraEdit.activeDocument.findReplace.searchInColumn=false;
   }
   UltraEdit.activeDocument.findReplace.preserveCase=false;
   UltraEdit.activeDocument.findReplace.replaceAll=true;
   UltraEdit.activeDocument.findReplace.replaceInAllOpen=true;
   // Every backslash in a JavaScript string must be escaped with a backslash to
   // pass the find string correct to the UE/UES Perl regular expression engine.
   UltraEdit.activeDocument.findReplace.replace("^[\\t \\xA0]*(\\r\\n|\\n|\\r)(?:[\\t \\xA0]*(?:\\r?\\n|\\r))+", "\\1");
}


It should be no problem now to create the variations you need for removing blank lines.

PS: Thanks to forum members mjcarman and StaticGhost who contributed information in other topics used in this article.
Hi,

What about the conversion of ASCII NO-BREAK SPACE 0xA0 (Decimal 160) into UTF-8 0xC2A0 (Decimal 49824)?
Because many of my files that I converted to UTF-8 with the script "ConvertAllFilesInDirectoryToUTF8.js" contained characters NO-BREAK SPACE 0xA0 (Decimal 160) that has been converted into 0xC2A0. In many cases, there is no problem, but for separators in some PHP functions, it generates an error or a bad result, for example with sprintf ().
How to find in what files are the UTF-8 characters 0xC2A0?
Otomatic wrote: What about the conversion of ASCII NO-BREAK SPACE 0xA0 (Decimal 160) into UTF-8 0xC2A0 (Decimal 49824)?

The Unicode code value of character NO-BREAK SPACE is also 0xA0 as in ISO-8859-1 and Windows-1252, or more precisely 0x00a0. So if a UTF-8 encoded file is opened in UltraEdit containing a no-break space and caret set left to it and Search - Character Properties is executed, UltraEdit shows in the dialog window the hexadecimal value 0xa0. So for finds/replaces on already opened UTF-8 encoded files there is no difference to ANSI files regarding to non breaking space.

All characters with a code value < 128 (decimal) are stored in the file with a single byte with the code value when using the special UTF-8 encoding. But all other characters with a code value >= 128 are stored in a UTF-8 encoded file with 2, 3 or even 4 bytes with each byte value being >= 128. Therefore the no-break space is stored in a UTF-8 encoded file with the two bytes 0xC2 0xA0. It is important that in this case the character is encoded with two bytes and not a single 16-bit value.

Special care must be taken when running a Find/Replace in Files on files not already opened, i.e. Find Where is not set to Open Files available only in Find in Files dialog as a replace in all open files is done with Replace command.

A UTF-16 encoding of a file is automatically detected on Find/Replace in Files because those Unicode files have often an appropriate BOM or the first few bytes can be used to detect a UTF-16 encoding quickly. But UTF-8 encoded files (without a BOM) and ASCII Escaped Unicode files are hard to detect as they do not contain any character different to an ASCII/ANSI file as long as no byte found with a decimal value >= 128. So UltraEdit does not determine on Find/Replace in Files if a file which is definitely not a UTF-16 encoded file is an ASCII/ANSI, UTF-8 or ASCII Escaped Unicode file. So for a Find/Replace in Files all UTF-8 encoded files are interpreted like ANSI files.


How to find or replace a no-break space in UTF-8 encoded files?

This can be done in several ways:

  1. A simple, none regular expression Find/Replace in Files is executed with encoding option not checked searching for the characters " " (without the double quotes) which represents 0xC2 0xA0. Please take care that the second character is a no-break space and not a common space.
    I do not recommend this method for Find in Files as the display of the found lines in output window will not be good because the UTF-8 byte sequences for characters with a code value >= 128 are displayed in ANSI. For Replace in Files it would be okay if the replace string is correct encoded in UTF-8, too.
  2. A Perl regular expression Find/Replace in Files is executed with search string \xC2\xA0. This is easier to enter in the Find What field, but has the same problem as the first method: the found lines are output in ANSI which is not looking well and the replace string must be correct encoded in UTF-8 if any character has a code value >= 128.
    So I do not recommend using this method, too.
  3. A none regular expression Find/Replace in Files is executed with searching for the no-break space character entered in Find What field, or better in a file (ANSI or UTF-8, does not matter) and selected before opening Find/Replace in Files dialog resulting in having this character automatically preset as string to find. Before starting Find/Replace in Files, the advanced options are opened by pressing the button Advanced, option Use Encoding is checked and from the list below near end 65001 (UTF-8) is selected. UltraEdit converts on start of the Find/Replace the search and the replace string to UTF-8 byte streams and run the find/replace with those byte streams. The found lines in case of a Find in Files are now displayed well as UltraEdit knows that the bytes on every found line must be parsed as UTF-8 byte streams.
What you enter as replace string is your choice. You can enter a common space character, or perhaps HTML entity &nbsp;.

Regular expression engines are designed to work on character streams where every character is either an 8-bit character (of type char) or a Unicode character (of type wide character) with 16 or 32-bit (depends on type of Unicode implementation of the used library). Therefore using a regular expression engine on UTF-8 encoded files on a Replace in Files can very easily result in damaged text files if characters with a code value >= 128 are replaced and not taking care of their byte sequences.

The search expression [\t \xA0] works therefore just for ASCII/ANSI and UTF-16 encoded files or UTF-8 encoded files opened already in UltraEdit (because converted in memory to UTF-16 LE), but not for UTF-8 encoded files.

For a Replace in Files in UTF-8 encoded files [\t \xA0\xC2] is needed for deleting blank lines containing UTF-8 encoded no-break spaces. It can be expected that there are no lines containing just ANSI character  with hexadecimal value 0xC2 with or without other whitespaces and no other non whitespace characters in the line and therefore this expression is not the best, but good enough for UTF-8 encoded files. I updated my previous post with an additional sentence.
Hello,

this two-pass replacement could be useful for simple refactoring. Once I needed to change many identificators in all source files so I "invented" this.
I used '{' and '}' as safe delimiters but you can choose another suitable chars. These regexes work in UE19 and above (older versions do not support conditionals).

1st step: Name all search patterns and corresponding replacements.

Find What: \b(?>(<1st_pattern>)|(<2nd_pattern>)|...|(Nth_pattern))\b

Replace With: {\1=<1st_replace>}{\2=<2nd_replace>}...{\N=<Nth_replace>}

Maximum number of patterns is 9 (N = 9).
Or 99 if you use $N instead of \N


2nd step: This step is constant.

Find What: \{([^{}]+?)?=(?(1)([^{}]+?)|[^{}]+?)\}

Replace With: \2


For example:

\b(?>(Party_Name)|(Party_Address)|(ID_Party))\b
{\1=Party_FullName}{\2=Party_Address_ID}{\3=Party_ID}
Find in files blocks (containing 1 or more words) and delete them

Finding and deleting blocks with always same content is no problem as just the block must be selected before opening the replace or replace in files dialog to search for this block and replace it by nothing.

Finding and deleting blocks with varying content, but a fixed number of lines, is also quite simple with a regular expression search. Such blocks removing replaces are often no big challenge for users with little practice in regular expression finds/replaces.

But the deletion of blocks with varying content and a not determined number of lines is very often a big problem even for users with practice in regular expressions.

However, UltraEdit offers multiple methods for finding, selecting and deleting such blocks and I want to briefly explain them here.

Note 1:
All regular expressions below are for text files with DOS or UNIX line terminators. The expressions as written are not working on text files containing only carriage returns, i.e. older MAC files. But the expressions are working for files with MAC line terminators with replacing ^r++^n by just ^r respectively \r*\n by just \r.

Note 2:
The block starting string and block ending string in the regular expression search strings below should not contain characters which are regular expression characters for the used regular expression engine. If such characters exist nevertheless in those 2 strings, escape them with either ^ (UltraEdit) or \ (Unix/Perl) inserted left to the character with a special regular expression meaning.



Find and select everything between two strings

First, there is the Find Select feature. With holding key SHIFT on pressing the button Previous or Next (or Find Next) in the Find dialog a find is executed upwards or downwards with selecting everything from current position of the caret to end of found string. The Find Select feature works also on the commands Find Next and Find Previous when holding key SHIFT on execution of the command.

An existing selection is extended or reduced when using the Find Select feature.

An existing selection is extended if the caret is blinking at top of the selection because the selection was made upwards and the find is executed also upwards, or if the caret is blinking at bottom of the selection because the selection was made downwards and the find is executed also downwards.

An existing selection is reduced and perhaps completely replaced by a new selection if the find is executed in the different direction than the selection was made before whereby a new selection starts where the existing selection started before.

The Find Select feature can be used to select a block of any size. There is no limit in number of lines or number of characters (bytes).

Selecting and next deleting blocks multiple times in a file with using Find Select feature should not be done manually. It is better to write a small macro for this task. Here is template for the macro:

InsertMode
ColumnModeOff
UnixReOff
Top
Loop 0
Find "
block starting string"
IfNotFound
ExitLoop
EndIf
Key HOME
IfColNumGt 1
Key HOME
EndIf
Find RegExp Select "
block ending string*^r++^n"
IfSel
Delete
Else
ExitLoop
EndIf
EndLoop
Top


This little macro runs in a loop first a simple find for the block starting string. The loop is exited if this string cannot not be found anymore from current position of the caret to end of the file. Caret is moved to beginning of the line containing the block starting string. Next a second find is executed which is this time an UltraEdit regular expression find searching for the block ending string and matching also everything up to end of the line including the DOS or UNIX line terminator. This find is executed with additionally selecting everything from beginning of the line with the block starting string. The selected block is deleted if the block ending string was found in the file. Otherwise the loop is exited and caret is moved back to top of the file.

However, using the Find Select feature has some disadvantages in comparison to the other methods explained below:

  1. Always creating a macro for finding and replacing blocks is time consuming.
  2. This method cannot be easily extended to run on all files of a directory or all opened files.
So there is mainly only 1 advantage in comparison to the other methods: unlimited block size.



Match everything between two strings - greedy

Most often the blocks to find and delete are small and have just a few kilobytes. In this case it is more efficient to run a regular expression replace all to find and delete the blocks.

With the UltraEdit regular expression engine the search string is:

%*block starting string[~#]++block ending string*^r++^n

With the Unix/Perl regular expression engine the search string is:

^.*block starting string[^#]*block ending string.*\r*\n

The replace string is simply an empty string.

The expression [~#]++ respectively [^#]* matches 0 or more characters not being # which should not exist anywhere within the block to find and delete. It does not matter which character is used in this negative character set definition. So it can be also any other character which definitely never exists within the blocks to delete.

With such a regular expression replace it is much easier to find and delete blocks in multiple files.

But there is one big problem here: the expression [~#]++ respectively [^#]* is greedy.

What means greedy?

Let me explain this with a small XML like example.

Code: Select all
      <resource type="text">
         <text>any text</text>
         <lang>en</lang>
      </resource>
      <resource type="bitmap">
         <bitmap>file.bmp</bitmap>
         <lang>neutral</lang>
      </resource>
      <resource type="text">
         <text>anderer Text</text>
         <lang>de</lang>
      </resource>

The target is to delete all resource blocks of type text. So for example with the UltraEdit regular expression engine the search string would be: %*<resource type="text"[~#]++</resource>*^r++^n

If you run first a find in active file to verify if the blocks are correct identified and selected by this expression, you will see something you would have not expected most likely. Instead of finding first the upper text resource block and next the second text resource block, the find selects everything. So the find does not stop on first occurrence of the string </resource>.

And that is meant with greedy. A greedy expression matches as much characters as possible to get nevertheless a positive result on the entire search expression.

Sometimes a greedy expression is wanted for example on a file containing just a list of file names with full path and from all lines the path should be removed. With greedy expression %?++\ (UE) or ^.*\\ (Perl) it is very easy to remove all characters from all lines up to and including last backslash of the path.

But very often a greedy expression is not wanted, especially not on finding blocks for deletion.

A non greedy expression is very often needed which matches as less characters as possible to get nevertheless a positive result on the entire search expression.

It is not possible unfortunately with the legacy UltraEdit and Unix regular expression engines to define a non greedy expression for finding blocks.

So the expression templates written here can be safely used only if you are 100% sure that the block to find and delete exists only once in a file and that especially the block ending string never exists more than once in a file.



Match everything between two strings - non greedy

Well, a non greedy expression for finding blocks is not possible with the legacy UltraEdit or Unix regular expression engines. But since UltraEdit version 12.00 there is a third one, the Perl compatible regular expression engine. And this regular expression engine offers a very simple method to change a greedy expression into a non greedy one: a question mark must be appended on the greedy expression to change it to a non greedy expression.

With the Perl regular expression engine the non greedy search string is:

^.*block starting string[^#]*?block ending string.*\r*\n

That's it, really?

Yes, that's it. But with the Perl regular expression engine it is possible to change the expression [^#]*? to be usable more generally by using instead .*? which matches any character except new line characters 0 or more times non greedy.

Except new line characters for matching a block?

Well, a point matches by default only all characters except a carriage return or a line-feed. But needed is here a different behavior as multiple lines should be matched by the expression and therefore also line ending characters must be matched by the expression .*?

The good news: the behavior on what is matched by a point can be modified by using (?s) in the search string. This is a special expression which tells the Perl regular expression engine to match also line terminating characters for every point meaning now really any character. Usually this special flag expression is used at beginning of the search string.

The non greedy search string could be therefore also:

(?s)^.*?block starting string.*?block ending string.*?\r*\n

As you can see there are two more ? in the search string now. This is necessary as all points match now also carriage returns and line-feeds and therefore the other two .* for matching just the characters from beginning of a line to block starting string and from block ending string to end of the line would be greedy and the result would be a match of the entire file.

Safer against a wrong matching would be in this case with point matching also line terminating characters:

(?s)^[^\r\n]*block starting string.*?block ending string[^\r\n]*\r*\n

[^\r\n] matches any character except a carriage return or a line-feed and is therefore exactly what a point matches without the flag (?s) at beginning of the search string.



Find and delete only blocks containing a string

Especially for XML files there is often the task to find and delete blocks containing a specific string.

This additional requirement makes the block finding expression a real challenge as it must be avoided that the Perl regular expression engine starts the matching on beginning of block X not containing the specific string and ends the matching on end of block Y really containing the specific string.

The Perl regular expression search string for finding a block containing a specific string is:

(?s)^[^\r\n]*block starting string(?:(?!block ending string).)*?specific string.*?block ending string[^\r\n]*\r*\n

Deleting all text resource blocks with German text on the XML example above could be done with the search string

(?s)^[^\r\n]*<resource type="text"(?:(?!</resource>).)*?<lang>de</lang>.*?</resource>[^\r\n]*\r*\n

and an empty replace string resulting in

Code: Select all
      <resource type="text">
         <text>any text</text>
         <lang>en</lang>
      </resource>
      <resource type="bitmap">
         <bitmap>file.bmp</bitmap>
         <lang>neutral</lang>
      </resource>

It is even possible to use a non marking OR expression to find and delete blocks containing one string of a list of specific strings.

(?s)^[^\r\n]*block starting string(?:(?!block ending string).)*?(?:string 1|string 2|string 3).*?block ending string[^\r\n]*\r*\n

I don't want to explain here how this regular expression string works although I know it. It is very difficult to understand even for regular expression experts and therefore not easy to explain people not knowing much about Perl regular expressions.


ATTENTION:
The methods for selecting and deleting a blocking using a regular expression replace command do not work for blocks of unlimited size. None of the 3 offered regular expression engines support matching very large blocks respectively strings. So use this method only for blocks with just some kilobytes. I have made once in the past several attempts to find out the limit, but could not even find out the limiting criteria as I have seen different results depending on search string and content of the files. However, blocks smaller than 64 kilobytes are certainly never a problem.
Best regards from Austria
Hello,

if you need to search <X> followed by <Y> but not with <Z> in between then you may find useful this Perl regex with nested lookarounds:

<X>(?=(?>.(?<!<Z>))*?<Y>)

Please don't forget that UltraEdit/UEStudio supports lookbehind with fixed length only (on writing this).

BR, Fleggy
With the Perl regular expression engine the search string (?s)$(?!.) matches end of file without matching (selecting) any character.

Explanation:

(?s) ... flag to match also newline characters with dot. See the forum topic "." (dot) in Perl regular expressions doesn't include newline characters CRLF? for a detailed explanation of this flag.

$ ... end of line without matching any character. The Perl regular expression interprets also end of character stream respectively end of file as end of line.

(?!.) ... a negative lookahead which is only true if there is no more character after end of line which is true only at end of file.

With this search string the replace string defines the string to append to end of file. Searching for (?s)$(?!.) with any replace string can be used to append the replace string to end of each file using a Replace in Files.

The search expression works also for large and huge files with several hundred MB respectively some GB, but takes some time to find the end of file as the entire file is searched for a newline character and checked next if there is no more character.
Best regards from Austria
11 posts Page 1 of 1