Remove double or extra spaces

bnadesan · Nov 08, 2004#12004-11-08T20:15+00:00

Hi All,

How do I remove double or extra spaces between words or letters and replace then with a single space.

For Example

Code: Select all

1)Hi     Sam   replace to  Hi Sam
2) A    B    C D replace it with A B C D

Mofi · Nov 09, 2004#22004-11-09T07:38+00:00

That is simple. Use this regular expression in UltraEdit style. Replace ° with the space character. ° is used here only for better visibility in html.

Find What: °°+
Relace With: °

This regular expression searches for one or more spaces and replaces it with a single space. Tab characters are not modified.

YSLGuru · Mar 13, 2008#32008-03-13T16:50+00:00

I did a quick search on this and while I did find the posting from Mofi above. I'm looking for the best/easiest way to remove all excess white space from a file using UEStudio/UltraEdit. I have UEStudio but I've posted this question for both formats so as to get an answer that will work in either as I believe this is probably a very commmonly performed task by many users.

What is excess white space in a file? To me excess white space is any instance of 2 or more contiguous white space characters. I work a lot with T-SQL (SQL for the SQL Server platform) and I spend a good amount of that time cleaning up the mess that most SQL based code/scripts is when I get the code/script. The easiest way to do this is to first replace all Tabs with Spaces, something built into UEStudio (within the menus under Format-->Tabs To Spaces). Next I remove all excess white space using UEStudio's Search & Replace specifying 2 white space characters as the Search value and 1 white space character as the Replace value. I repeat the search until UEStudio responds with the msg that the search value cannot be found.

What would be great is to have a macro (preferrably) that will perform this same process so I don't have to keep performing the search & replace manually or at least come up with some Regular Expression that does the same recursive action of replacing 2 white spaces with 1. I'm completely open to any suggestions on how to reach this and am not set on using Search & Replace or even a Macro. If there is a built-in method in UE to do this that I am unaware of please let me know.

Lastly for me it would be great if there were a way to combine the tabs to spaces action with the removal of excess white space so both are done via a single command.

Thanks in advance to any and all replies!

YSLGuru

mjcarman · Mar 13, 2008#42008-03-13T17:54+00:00

You can do it all with a single replace operation:

Find What: [ \t]{2,}
Replace With: <a single space>
Replace Where: Current File
Select "Regular Expressions" and Perl as the Regular Expression Engine.
Select "Replace All is from top of file"
Unselect "Match Whole Word Only"

Or, in macro form:

Code: Select all

InsertMode
ColumnModeOff
HexOff
GotoLine 1 1
PerlReOn
Find RegExp "[ \t]{2,}"
Replace All " "

But keep in mind that this regular expression keeps single tabs between two other characters which are whether a space nor a tab.

Mofi · Mar 14, 2008#52008-03-14T09:20+00:00

With legacy Unix regular expression engine as well as with the Perl engine the search string [ \t]+ which is replaced by a single space would work too. In UltraEdit style the search string would be [ ^t]+.

That replace should not be used when running the replace step by step because it finds also a single space and replace it with a single space. That is the most simple regular expression to replace any occurrence of spaces/tabs by a single space and should be used only with replace all.

What most users don't know other type of spaces exist as well. For example in Western code pages (ISO-8859-*) there is the non breaking space having code value 160 (hex. A0). You can include the non breaking space also in your search by searching for [ \t ]+ (Unix/Perl) respectively [ ^t ]+ (UltraEdit). You can't see it in your browser, but the second space in the expression has a different value. Copy the string into a new edit window of UltraEdit and toggle to hex edit mode or use Search - Character Properties on the second space. You maybe will see, it is a different space. Some browsers copy a non breaking space with a conversion to a normal space with code value 32 (hex. 20) to the Windows clipboard. The Unicode table contains also further type of spaces like "em space", "en space", "thin space", ...

And there is no difference between UltraEdit and UEStudio regarding the search and replace functions except UEStudio supports search/replace in all files of a solution too. UltraEdit does not support solutions which is also known as project space in some IDEs. A solution contains multiple projects.

Jane · Mar 26, 2008#62008-03-26T19:53+00:00

The problem with using [ \t]+ is that it matches and wastes time replacing all single spaces with single spaces.

To match only the criteria multiple spaces\tabs and single tabs it would probably be better to use [ \t]{2,}|\t in the search box and replace with single space. This will skip all single spaces. This regular expression search string works only with the Perl compatible regular expression engine. It cannot be used with the legacy Unix or UltraEdit regular expression engines.

Cordially
Jane

Added by Mofi:
With the UltraEdit regular expression engine use ^{^t[ ^t]++^}^{ [ ^t]+^} as search string and a single space as replace string to do the same as the Perl engine does with [ \t]{2,}|\t as search string.

To do the same with the legacy Unix regular expression engine use as search string (\t[ \t]*| [ \t]+) and use a single space as replace string. Of course that search string can be also used with the Perl compatible regular expression engine.

OnlineCop · Jul 24, 2008#72008-07-24T03:13+00:00

If using UltraEdit regular expression engine, I've found using [ ^t][ ^t]+ as search string and a single space as replace string is simple and effective. It finds only strings with 2 or more contiguous spaces/tabs and replaces it with a single space. Already existing single spaces or single tabs are not modified.

If you want to keep the first whitespace character (space or tab) you can use following UltraEdit regular expression

Find: ^([ ^t]^)[ ^t]+
Replace: ^1

This way, you're not replacing single whitespaces (spaces/tabs) like [ ^t]+ does; it only affects 2+ whitespace characters and keeps the first whitespace character (space or tab).

ridgerunner · Aug 07, 2009#82009-08-07T05:19+00:00

There is a problem with OnLineCop's solution - it does not replace a single tab with a single space.
Jane's previously mentioned (perl style) solution does work correctly:

Code: Select all

[ \t]{2,}|\t

which, in theory, we should be able to write using the old native UE regex syntax as follows:

Code: Select all

^{[ ^t][ ^t]+^}^{^t^}

However, this UE native regex doesn't work under UE14.20.1.1008. It correctly matches whitespace sequences that begin with one or more spaces, but fails to match a complete sequence of whitespace if this sequence begins with a tab. UE only matches the beginning tab in the whitespace sequence.

Bug? (Not sure if older or new versions exhibit this erroneous behavior.)

Mofi · Aug 07, 2009#92009-08-07T09:48+00:00

To clarify what every solution does in the various posts I have now edited all posts in this topic and explained more detailed what the various regular expressions do.

ridgerunner,
I added working regexp search strings for UltraEdit and Unix engine at Jane's post doing the same as the Perl regexp string [ \t]{2,}|\t posted by Jane.

From what I observed the last years is that the Unix/UltraEdit engines are internally working with the same code functions. I guess the Unix regexp strings are simply converted to UltraEdit syntax internally before using that function. It looks like the UltraEdit engine uses the Microsoft syntax (as in MS Word or in MS Excel) and the Unix engine was just introduced to help UltraEdit users familiar with the Perl syntax used on Unix machines to more easily use regular expressions in UltraEdit. With introducing the Perl compatible engine the full power of this search engine is available also in UltraEdit.

A disadvantage of the Perl engine, which is very, very powerful and makes very, very complex search and replace operations possible, is the same with all programs which are very powerful and can do lots of things, it contain bugs and therefore produces sometimes not the expected results. Fixing a bug in such complex functions often results in something other worked before well does not work correct anymore after fixing the bug. I think most programmers know what I'm here talking about.

The legacy engine (UE/Unix) is not so powerful and of course contains also some bugs and limitations because it is not so powerful as the Perl compatible engine. But after years of practice with the legacy engine I know quite good how the engine works internally and know most of its bugs and limitations. So I often find quickly a working solution for most search/replace tasks also with the limited legacy search engine (UE/Unix).

ridgerunner,
the not working UE regular expression search string you posted is from a users point of view definitely a bug. But for me it was not surprising that it does not work because I actually know how the OR expression in the legacy engine works and it works completely different than the Perl engine. How to explain the different methods of the engines? Let us look on a simple example. There is following line:

test1 test2

And you run an UE regexp search with search string ^{test1^}^{test^} (stupid search string, but good for explaining the different methods). You expect that the search selects first test1 and next test from test2. But that does not happen, the UltraEdit engine selects twice just test. I don't really know what internally happens, but it looks like the UltraEdit engine evaluates character by character if the build string matches one of the 2 possible OR argument expressions. So it first checks if t matches one of the 2 expressions. That is the case here for both expressions. So the engine takes the next character, build the string te and evaluates this string again with both expressions. Two steps further the evaluated string test is a 100% match of the second expression and UE engine exits because string found. So the UltraEdit engine treats both arguments of an OR expression to 100% at the same level of importance.

Now let us do the same with the Perl compatible engine by searching for (test1|test). This engine selects now first test1 and second test from test2 what everyone expects. Why? Again I don't really know how the Perl engine works, but it looks like it works as follows. It takes the t and checks if it is matched by any expression in the OR expression. In this example this is true for both arguments. So it remembers the string position and evaluates now just the first expression on the entire string (byte stream). If it matches it returns the matching string. If that would not be the case, it would rewind back on the byte stream to the position of character t and evaluates the byte stream from this position with the second OR expression and if that would match, it would return this matching string.

So the UltraEdit engine evaluates a string always with both expressions in the OR argument at the same time while the Perl engine evaluates a string with one expression after the other. So the main difference is that the UltraEdit engine avoids looking back on the byte stream while the Perl engine is designed to go back on the byte stream and evaluate from this position again. Avoiding looking back makes the UltraEdit engine fast, but limits it's capabilities. Supporting looking back gives the Perl engine the power it has, but can make simple searches slower. You can watch that also with modifying the line to

test12 test2

and search for ^{test1^}^{test[0-9]+^} or (test1|test[0-9]+). Both engines now select only test1 and not test12. Independent of the search engine it is never a good idea to use an OR expression where 2 (or more) of the expressions start with an expression matching the same characters as done here twice. With the Perl engine the most left one matching expression always "wins", with the UltraEdit engine the expression first returning a 100% match always "wins". The different working methods when using expressions for the arguments in an OR expression must be taken into account when the arguments match the same substring at start of a matching string.

Now you hopefully understand why UE expression ^{[ ^t][ ^t]+^}^{^t^} selects always only the first tab when a whitespace string starts with a tab character while the Perl expression [ \t]{2,}|\t works. If Perl regexp \t|[ \t]{2,} would be used, it wouldn't work too.

If you look on the UltraEdit and Unix regexp strings I posted at Jane's post you see that I avoided the problem with identical starting substrings for both expressions in the OR expression. The first argument in the OR expression matches only strings starting with a tab character followed by 0 or more occurrences of tabs or spaces. The second argument matches only strings starting with a space followed by 1 or more occurrences of tabs or spaces.

ridgerunner · Aug 10, 2009#102009-08-10T00:13+00:00

Thank you Mofi for the detailed explanation! The "buggy" UE behaviour now makes perfect sense and is in fact NOT a bug. I have grown used to the perl syntax where OR alternatives are tested in sequence one by one until the first match is found. I just assumed that the UE engine works the same way.

Once again you have demonstrated your patient, thorough and helpful nature!

Cheers!

Dominicon · Mar 04, 2010#112010-03-04T19:20+00:00

Just to throw a wrinkle in the works...

How do you select double spaces that are not at the start of the line? I have a lot of code that uses leading spaces for white space and don't want to blow up my formatting or have to manually parse through the file.
For example (Spaces are replaced with *):

*****variable or code**=**value or more code

I want to replace the double spaces between code**=**value but not those at the start of the line.

Mofi · Mar 05, 2010#122010-03-05T06:36+00:00

Deleting multiple spaces/tabs except at start of a line can be done for example with the UltraEdit regular expression engine searching for ^([~ ^t^r^n][ ^t]^)[ ^t]+ and using ^1 as replace string.