How to use regular expressions to create a hierarchical function list for Proton Basic files?

JohnBarrat · Nov 19, 2017#12017-11-19T16:45+00:00

I am very new to regular expressions and have just spent a couple of days trying to get to grips with the concept which is very different from languages I have used before. I still haven't discovered a way to write a regexp to extract specific data from the following type of expression:

Proc LoadTask(TaskID As Word, TaskAddr As Dword, Priority As Word, Start As Word)

I want to extract the command i.e. LoadTask
and its associated parameters i.e. TaskID, TaskAddr, Priority, Start
so I can build a function list in the form:

LoadTask
TaskID
TaskAddr
Priority
Start

Name
Param1
Param2
etc.

Is this possible and if so I would appreciate a pointer into how to go about it.

Mofi · Nov 19, 2017#22017-11-19T18:09+00:00

You could use for example following in the *.uew file used to syntax highlight the language:

Code: Select all

/TGBegin "Procedures"
/TGFindStr = "^[\t ]*proc[\t ]+([a-z_][0-9a-z_]*)"
/TGBegin "Parameters"
/TGFindStr = "\s*([a-z_][0-9a-z_]*)[\t ]+As[\t ]+[a-z]+,?"
/TGFindBStart = "\("
/TGFindBEnd = "\)"
/TGEnd
/TGEnd
/Regexp Type = Perl

This block contains two case-insensitive Perl regular expression search strings.

The first one is for finding procedures.

^ ... start each search at begin of a line.

[\t ]* ... find a horizontal tab or space zero or more times. So a procedure line can have leading tabs/spaces.

proc ... after optional leading tabs/spaces the line must have the keyword Proc.

[\t ]+ ... find a horizontal tab or space one or more times.

(...) ... capturing/tagging group. The string found by the expression inside the group should be displayed in the hierarchical function list.

[a-z_] ... find case-insensitive a single letter in range A-Z or an underscore.

[0-9a-z_]* ... find case-insensitive any digit or letter in range A-Z or underscore zero or more times.

The inner group defines first a block start and a block end condition with \( and \). The block to search with the second regular expression should start after first occurrence of an opening ( found from beginning of a found procedure line and end before first occurrence of a closing ) found after opening ). This means for the example procedure line that the block to search with second regular expression is:

TaskID As Word, TaskAddr As Dword, Priority As Word, Start As Word

The meaning of second Perl regular expression is:

\s* ... find any whitespace character according to Unicode standard including newline characters zero or more times. So the parameters of a procedure can span also over multiple lines.

(...) ... once again a capturing group for getting just the string matched by the expression inside this group displayed in hierarchical function list.

[a-z_] ... find case-insensitive a single letter in range A-Z or an underscore.

[0-9a-z_]* ... find case-insensitive any digit or letter in range A-Z or underscore zero or more times.

[\t ]+ ... find a horizontal tab or space one or more times.

As ... the next word must be as in any case.

[a-z]+ ... find case-insensitive any letter in range A-Z one or more times to match the type of the variable.

,? ... find a comma zero or one times, i.e. optionally matching a single comma.

JohnBarrat · Nov 19, 2017#32017-11-19T19:03+00:00

Thank you very much for your quick response and looking through your explanation it is starting to make sense.
However, I added it to my uew file and saved it but I am getting nothing back when I run it against a bit of sample code.

This is the header of my uew file. I tried ^[t\ ] ahead of other TGFindStr but they disappeared from the function list.

Code: Select all

/L20"ProtonBasic" Nocase String Chars = " File Extensions = bas inc
/TGBegin "Includes"
/TGFindStr = "^Include"
/TGEnd
/TGBegin "Declares"
/TGFindStr = "^declare"
/TGEnd
/TGBegin "Constants"
/TGFindStr = "^symbol"
/TGEnd
/TGBegin "Variables"
/TGFindStr = "dim [a-z][0-9a-z_]"
/TGEnd
/TGBegin "Aliases"
/TGFindStr = " ^ symbol"
/TGEnd
/TGBegin "Labels"
/TGFindStr = "^ [a-z][0-9a-z_]*:"
/TGEnd
/TGBegin "Procedures"
/TGFindStr = "^[\t ]*Proc[\t ]+([a-z_][0-9a-z_]*)"
/TGBegin "Parameters"
/TGFindStr = "\s*([a-z_][0-9a-z_]*)[\t ]+As[\t ]+[a-z]+,?"
/TGFindBStart = "\("
/TGFindBEnd = "\)"
/TGEnd
/TGEnd
/TGBegin "Macros"
/TGEnd
/TGBegin "Data Labels"
/TGEnd
/Regexp Type = perl
/Line Comment = '
/Line Comment Alt = ;
/Line Comment Sytle = "italic"
/Delimiters = ~!@%^&*()-+=|\/{}[]:;"' , .?
/Open Brace Strings =  "{" "(" "[" "<"
/Close Brace Strings = "}" ")" "]" ">"
/Indent Strings =     "if"    "proc"    "isr"    "do"   "while" "repeat" "for"  "select"
/Unindent Strings =   "endif" "endproc" "endisr" "loop" "wend"  "until"  "next" "endselect"
/Open Fold Strings =  "if"    "proc"    "isr"    "do"   "while" "repeat" "for"  "select" 
/Close Fold Strings = "endif" "endproc" "endisr" "loop" "wend"  "until"  "next" "endselect"
/Ignore Fold Strings ="break"Am I missing something?

I should add, I am getting stuff back for the other FindStr expressions allbeit more than I actually need.

JohnB

Mofi · Nov 19, 2017#42017-11-19T20:33+00:00

I posted /Regexp Type = Perl and you inserted into your wordfile /Regexp Type = perl. The keywords in syntax highlighting wordfiles are case-sensitive. So UltraEdit does not recognize perl and the regular expressions are interpreted with UltraEdit instead of Perl regular expression engine.

I suggest to use following in your wordfile:

Code: Select all

/L20"ProtonBasic" Nocase String Chars = " Line Comment = ' Line Comment Alt = ; File Extensions = BAS INC
/Delimiters = ! " %&'()*+,-./:;=?@[\]^{|}~
/TGBegin "Includes"
/TGFindStr = "^Include"
/TGEnd
/TGBegin "Declares"
/TGFindStr = "^declare"
/TGEnd
/TGBegin "Constants"
/TGFindStr = "^symbol"
/TGEnd
/TGBegin "Variables"
/TGFindStr = "dim [a-z][0-9a-z_]"
/TGEnd
/TGBegin "Aliases"
/TGFindStr = " ^ symbol"
/TGEnd
/TGBegin "Labels"
/TGFindStr = "^ [a-z][0-9a-z_]*:"
/TGEnd
/TGBegin "Procedures"
/TGFindStr = "^[\t ]*Proc[\t ]+([a-z_][0-9a-z_]*)"
/TGBegin "Parameters"
/TGFindStr = "\s*([a-z_][0-9a-z_]*)[\t ]+As[\t ]+[a-z]+,?"
/TGFindBStart = "\("
/TGFindBEnd = "\)"
/TGEnd
/TGEnd
/Regexp Type = Perl
/Open Brace Strings = "{" "(" "[" "<"
/Close Brace Strings = "}" ")" "]" ">"
/Indent Strings = "if" "proc" "isr" "do" "while" "repeat" "for" "select"
/Unindent Strings = "endif" "endproc" "endisr" "loop" "wend" "until" "next" "endselect"
/Open Fold Strings = "if" "proc" "isr" "do" "while" "repeat" "for" "select" 
/Close Fold Strings = "endif" "endproc" "endisr" "loop" "wend" "until" "next" "endselect"
/Ignore Fold Strings = "break"

Note: The character between " and % in second line must be a tab character and not a space character.

The regular expression search string " ^ symbol" does not make much sense to me. ^ is interpreted as begin of a line. So it is not possible that a space character is left to it.

JohnBarrat · Nov 19, 2017#52017-11-19T22:41+00:00

Thank you very much, it is now looking much more like what I am looking for. I hadn't realized the keywords were case sensitive and explains why everything I tried seemed at odds to what I was seeing when I tried it in the RegExr tester. Hopefully, I won't have to bother you again now I have set me off in the right direction.

JohnB

Nov 20, 2017#62017-11-20T19:05+00:00

I have one more question.

Variables take the form: Dim VarName As word
Aliases can take the form: Dim VarName As VarName

This is the expression I have written to find just the variables.

Code: Select all

^[\t ]*Dim[\t ]+([a-z_][0-9a-z_]*)[\t ]+as[t\ ]+[bit|byte|word|dword|float|sbyte|sword|sdword]*

This returns all Dim regardless of the text following the VarName.

What do I need to do to filter out the aliases?

JohnB

Mofi · Nov 21, 2017#72017-11-21T05:52+00:00

I suggest to use:

^[\t ]*Dim[\t ]+([a-z_][0-9a-z_]*)[\t ]+as[\t ]+(?:bit|byte|word|dword|float|sbyte|sword|sdword)\>

(?:...) ... non capturing group with an OR expression of several words. One of the words in the OR expression inside non-capturing group must be found. Question mark and colon after opening parenthesis makes this group a non-capturing group.

[ ... ] ... is a character class definition, i.e. a list of characters from which one of the characters in this list must be found.

\> ... end of word to ignore variable names like ByteCount. \b ... word boundary could be also used.

JohnBarrat · Nov 21, 2017#82017-11-21T11:56+00:00

Mofi · Nov 21, 2017#92017-11-21T12:54+00:00

For aliases: ^[\t ]*Dim[\t ]+([a-z_][0-9a-z_]*)[\t ]+as[\t ]+(?!(?:bit|byte|word|dword|float|sbyte|sword|sdword)\>)

(?!...) ... negative lookahead.

Please note that a lookahead or lookbehind must be of fixed length, i.e. the search function must know how many characters to look ahead or behind. For that reason + or * can't be used in a lookahead/lookbehind expression. But an OR expression with each argument of fixed length is possible because the Perl regular expression engine can determine how many characters to look ahead or behind by determining the longest OR argument.

JohnBarrat · Nov 21, 2017#102017-11-21T13:47+00:00

Thanks so much.

JohnB