Find over multiple lines in a column

Art · Jan 19, 2017#12017-01-19T01:14+00:00

I need to recognize and reformat email addresses (and other similar data formats) in a text file. The information in this file is divided into 2 columns, but a single data structure may span multiple lines in one column. The columns are separated by a tab character. The last lines of each 'record' in this file may exist but will span 2 columns.

Example:

2017-01-19 01_55_29-Start.png (15.96KiB)

see this attached image for an example

I am able to recognize the email address by using the UltraEdit regular expression [^.^-_0-9a-z]+^@[^.^-_0-9a-z^p]+.[a-z]+, even over multiple lines if there are no columns. But I'm curious if I can do a search over multiple lines in a specific column. And if possible, how do I eliminate the tab or CR/LF in the email address string?

A straight forward approach could be to start with reformatting the file into one column (for example, moving the right column under the last line of the left column), but there is the difficulty that the last line may have free text, which makes it difficult to recognize the last line of the left column.

Anybody have an idea?

Art

Mofi · Jan 19, 2017#22017-01-19T06:16+00:00

The case-insensitive UltraEdit tagged regular expression search string which might work for this task is:

^([0-9a-z.^-]+@[0-9a-z]*[.^-]^)^p^(*^t^)^([0-9a-z.^-]+^)$

The replace string to use is: ^1^3^p^2

The search string is defined to interpret a string containing @ as spanning over two lines only if the last character on line is a dot or a dash.

fleggy · Jan 19, 2017#32017-01-19T09:21+00:00

Hi,

just for an inspiration - you can modify following Perl regex to search in columns. And column mode must be ON!

E.g. find abc and def starting col 10:
(?<=^.{10})abc.*\r\n.{10}efg

BR, Fleggy

Art · Jan 19, 2017#42017-01-19T21:28+00:00

Thanks Fleggy,

I'm testing with it quite some time now. I'm not unfamiliar with regex but I can't get the perl regex to work for me.

Code: Select all

kjbasjbas ew8q9rfy73476134fuo13496      email:                                  
[email protected] wqpiuqerp;b wdoiqjhwri  [email protected]                          
fqkhwf8jnbuqrhfpqweprfqiruhfquihr       def oqrjgqeprgj[]qegrjg'j               
qe[ghje[itojh[iqeqérgji[qerjgiqjrjíqrjgqpjergopjegopqj]]]]]

First I had to discover that tab positions have no influence on the column mode. It is just regarded as one character, and ignores any tab setting (which is by the way understandable, this is an application setting)
See the example above. With your solution I should be able to select the word email: (on line 1, starting at column 40) and at least the word "test" (on line 2 starting at column 40), but I cannot get it to work. My search over the internet didn't bring me any other suggestions.

In your example, the find regex (?<=^.{40})email:.*\r\n.{40}test.*$ should select the entire string email:[email protected], but only selects email and test.
The search regex (?<=^.{40})email:.*\r\n.{40}[email protected] doesn't find anything.

I hope that you can help me out. Otherwise I consider a rigid solution, like copy each column of each record to a new file. Search in one column is a lot easier.

Peter

fleggy · Jan 20, 2017#52017-01-20T09:30+00:00

Hi Art,

I modified your pattern a little (not select ending whitechars):

(?<=^.{40})email:.*\r\n.{40}test.*?(?=\s*$)

It works for me in your sample using UE 23.20.0.43 and UE 24.00.0.10 BETA (both x64). What is your version? Try to change the setting Editor display -> Cursor/Caret -> allow positioning beyond line end (I have this option ON). Or maybe Mofi would have an idea why it doesn't work for you.

BTW I found a bug in UE connected to this case and will report it. Always do CTRL+HOME before searching this pattern.

BR, Fleggy

EDIT: I don't use TABs - always SPACEs only.

Art · Jan 22, 2017#62017-01-22T14:50+00:00

Thanks Fleggy!

I work with the latest Windows version.

Your post helped me a lot, but unfortunately it didn't brought the solution I was looking for. The difficulty is that email addresses may be spanned over 1,2 or even lines within this column. I didn't succeed in finding a Perl regex pattern to solve this, but your pattern to search in specific columns helped me the in other search actions as well.

This is how I solved the email find.

Code: Select all

InsertMode
ColumnModeOff
HexOff
Key Ctrl+HOME
Loop 2000
  IfEof
    ExitLoop
  Else
    Key Ctrl+HOME
    PerlReOn
    Find RegExp "(?<=^.{40})EMAIL:\r\n"
    IfFound
      GotoEndOfPrevWordSelect
      GotoLine 0 41
      Key DOWN ARROW
      StartSelect
        Key END
      EndSelect
      Cut
      Key UP ARROW
      Key END
      Paste
      ClearClipboard
      Key END
      StartSelect
        Key LEFT ARROW
      EndSelect
      PerlReOn
      Find RegExp SelectText "[\.\-@]" 'Line breaks in the selection are always after a ./-/@, so if this is the last character I have to consider the second line as well.
      IfFound
        GotoLine 0 41
        Key DOWN ARROW
        Key DOWN ARROW
        StartSelect
          Key END
        EndSelect
        Cut
        Key UP ARROW
        Key UP ARROW
        Key END
        Paste
        ClearClipboard
        Key END
        StartSelect
          Key LEFT ARROW
        EndSelect
        PerlReOn
        Find RegExp SelectText "[\.\-@]" 'Line breaks in the selection are always after a ./-/@, so if this is the last character I have to consider the third line as well.
        IfFound
          GotoLine 0 41
          Key DOWN ARROW
          Key DOWN ARROW
          Key DOWN ARROW
          StartSelect
            Key END
          EndSelect
          Cut
          Key UP ARROW
          Key UP ARROW
          Key UP ARROW
          Key END
          Paste
          ClearClipboard
          Key END
          StartSelect
            Key LEFT ARROW
          EndSelect
          PerlReOn
          Find RegExp SelectText "\w" 'The last character of the third must be alphanumeric. If not, the constructed email address is incorrect.
          IfNotFound
            'Still have to fix a rollback
          EndIf
        EndIf
      EndIf
    Else
      ExitLoop
    EndIf
  EndIf
EndLoop
Key Ctrl+HOME

IMHO It is really annoying that the UE macro's have no support for remarks in the code.

I did encounter other bugs as well, like the undo after a macro with Perl find/replace in them. It totally messes up the file and had to start over again.

Mofi · Jan 23, 2017#72017-01-23T06:15+00:00

Art, look on sticky macro forum topic Macro examples and reference for beginners and experts how to save a macro additionally to compiled in macro file as text with comments and get this text representation syntax highlighting and indented.

The command Top can be used for Key Ctrl+HOME.

In the macro reference file is written that IfEof should be used only if it is guaranteed that the visible caret is reaching ever end of file. On using Find in a loop running with an indefinite number of iterations it is highly recommended to use IfFound or the opposite IfNotFound to exit the loop instead of IfEof as the caret is not moved to end of file if a searched string is not found.

For example your macro code could be saved into a *.uem file as follows:

Code: Select all

InsertMode
ColumnModeOff
HexOff
PerlReOn
Top
Clipboard 9
Loop 0
    Find RegExp "(?<=^.{40})EMAIL:\r\n"
    IfNotFound
        ExitLoop
    EndIf
    GotoLine 0 41
    Key DOWN ARROW
    StartSelect
    Key END
    EndSelect
    Cut
    Key UP ARROW
    Key END
    Paste
    Key END
    StartSelect
    Key LEFT ARROW
    EndSelect
//  Line breaks in the selection are always after a ./-/@, so if this
//  is the last character I have to consider the second line as well.
    Find RegExp SelectText "[.\-@]"
    IfFound
        GotoLine 0 41
        Key DOWN ARROW
        Key DOWN ARROW
        StartSelect
        Key END
        EndSelect
        Cut
        Key UP ARROW
        Key UP ARROW
        Key END
        Paste
        Key END
        StartSelect
        Key LEFT ARROW
        EndSelect
//      Line breaks in the selection are always after a ./-/@, so if this
//      is the last character I have to consider the third line as well.
        Find RegExp SelectText "[.\-@]"
        IfFound
            GotoLine 0 41
            Key DOWN ARROW
            Key DOWN ARROW
            Key DOWN ARROW
            StartSelect
            Key END
            EndSelect
            Cut
            Key UP ARROW
            Key UP ARROW
            Key UP ARROW
            Key END
            Paste
            Key END
            StartSelect
            Key LEFT ARROW
            EndSelect
//          The last character of the third must be alphanumeric.
//          If not, the constructed email address is incorrect.
            Find RegExp SelectText "\w"
            IfNotFound
//          Still have to fix a rollback.
            EndIf
        EndIf
    EndIf
EndLoop
Top
ClearClipboard
Clipboard 0

I think the reformatting task could be done easier using a macro with several UltraEdit tagged regular expression or Perl regular expression using backreferences replaces. But it is very, very difficult to help you on this reformatting task without having examples showing us (nearly) real data lines before and same lines after reformatting. We have to think out our own example lines and if those example lines really represent the real data is most likely not the case as the past has proven many times. So for a better help on your reformatting task post two code blocks showing us lines before and after reformatting with all possible variations (lines not to modify, lines with email address over two lines, lines with email address over 3 lines, etc.).

BTW: The dot in Perl/Unix syntax has no special meaning inside a character class (square brackets) and therefore must not be escaped with a backlash within the square brackets. Escaping a dot for being interpreted as literal character is necessary in a Perl/Unix regular expression only outside a character class.

Art · Feb 03, 2017#82017-02-03T14:12+00:00

Well, actually I managed to do it without the macro but using 4 expressions for different patterns, without the column setting. For example, this expression corrects the email address in the right column (divided by TABs) when the email address and the label is separated over 3 lines:

Code: Select all

InsertMode
ColumnModeOff
HexOff
Top
PerlReOn
Find RegExp "\tEMAIL:\r\n^(.*?)[ \t]+([0-9a-z\-_.]{1,100}\@[0-9a-z\-_.]{1,100}\.[a-z]{2,6})"
Replace All "\t<emailnew>\L\2\E</emailnew>\r\n\1\r\n"

Thanks for your help!