Function List - COBOL

rmbunton · Aug 03, 2011#12011-08-03T16:17+00:00

I read earlier a discussion thread on COBOL function lists in UltraEdit. I play around with the regex from that discussion. My goal is to have a function list for COBOL that lists all paragraphs in the procedure division. I do not have a problem with the "exit" paragraphs showing up. I started out with the regex that came with the COBOL wordfile that I downloaded form the IDM webpages. Which is the first line of my current regex for the function list. Below is my current function list regex:

%[ ^t]+^([a-z^-]+^) ^{division^} ^{section^}
%[ ^t]++^([0-9a-z-_]++[~end-if .]^).^
% 0000-^main

I left the first regex line in because I like to have the COBOL sections and divisions for a point of reference. The second regex line seems to get most of the COBOL paragraphs to show up. For your information I used "[~end-if .]" to get rid of a bunch of "end-if" 's showing up in my list. The third regex line is to get the "0000-main" paragraph to show up in the list. Below is a code snippet from the COBOL program, which only shows the paragraphs.

0000-main. <----- doesn't find with just the second regex line. why??
0010-server-init.
0020-assign-input-fields.
0030-assign-output-fields.
0040-standardize-sp-params.
0050-validate-logon. <----- doesn't find with all 3 regex lines. why??
0055-get-users-hipaa-unit.
0060-process-request.
0070-LOG-ACTIONS.

So, I guess I have two questions:

Why does the second line of my regex not find paragraph "0000-main."?
Why does paragraph "0050-validate-logon." not show up in my list?

Mofi · Aug 03, 2011#22011-08-03T17:31+00:00

I explain the 3 UltraEdit regular expression you used here and what are the problems.

1) %[ ^t]+^([a-z^-]+^) ^{division^} ^{section^}

% ... start the search at beginning of a line.

[ ^t]+ ... find 1 or more spaces/tabs at beginning of a line. So the line with word division or section must have preceding whitespaces. If that is not true, you should append a second + to change the meaning to 0 or more spaces/tabs at beginning of a line.

^([a-z^-]+^) ... matches a string consisting only of letters A-Z, a-z and the hyphen character and this part of the found string is tagged and therefore only this part of the found string is displayed in the function list view.

The next character must be a single space.

Then either the word division or the word section must follow in any case. The space character inside ^}^{ should be remove because there should no space between the 2 arguments of the OR expression.

This expression will therefore not find any of the lines in your example.

2) %[ ^t]++^([0-9a-z-_]++[~end-if .]^).^

% ... again start the search at beginning of a line.

[ ^t]++ ... find 0 or more spaces/tabs at beginning of a line, in other words preceding whitespaces are allowed and should be ignored.

[0-9a-z-_]++ ... should find a string consisting of only letters, numbers and the hyphen character. Well, the hyphen character has normally no special meaning, except in square brackets where it means FROM x TO y. Usually the hyphen is used for something like 0-9 or a-z in a square bracket, but it is also possible to use it for example for character range !-/ which means all characters in ASCII (or ANSI or Unicode) table from the exclamation mark to the slash character. Therefore inside a square bracket the hyphen character should be always escaped with ^ when simple the hyphen character itself is meant. In this regular expression string the hyphen character is surely read as hyphen character because the letter z belongs already to a character range definition. But nevertheless the hyphen character should be escaped here with a preceding ^. Also interesting is that ++ is used instead of just +. ++ means this part of the found string can be also an empty string with no characters. I'm quite sure that this is not correct.

[~end-if .] ... is in combination with the previous expression the reason why the two lines you marked are not found. This expression means that the next character after 0 or more letters, digits or hyphens should NOT be either the character e or E, the character n or N, one of the characters D to I in any case, the character f or F, a space or a point. [~...] does not mean NOT the string in the square bracket, the expression means not any character listed inside the square bracket with character ranges also possible. The result with the previous expression is a definition of overlapping character classes which is never good because the result is weird for beginners although full explainable for experts in regular expressions.

The next character must be a point and the final escape character ^ is completely useless here.

3) % 0000-^main

Well, this regular expression simply finds lines starting with " 0000-main". The escape character ^ is simply useless here because the next character m has no special regular expression meaning and the hyphen left is not inside a square bracket and therefore needs also no escape character to be interpreted as hyphen character.

I don't know anything about syntax of Cobol, but I suggest following:

/Function String = "%[ ^t]++^([a-z^-]+^) +^{division^}^{section^}
/Function String 1 = "%[ ^t]++^([0-9]+-[a-z][a-z^-]++^)."

The second expression means in words:

Find a string at beginning of a line,
with optionally preceding whitespaces,
with a number with at least 1 digit,
and a hyphen character after the number,
and next character is a letter,
and zero or more additional letters or hyphens,
and a point.

Maybe this expression excludes already strings which should be found and maybe finds strings which should be ignored. If you can tell us in words the syntax rules for function strings for Cobol, and post an example code posted as code block containing strings which should be found and strings which should not be found, and post in a second code block what you want to see in the function list view for this example code block, we can find a better regular expression.

A lookbehind to exclude end-if. is not possible with the UltraEdit regular expression engine. But there are possibly other methods to exclude them or we change to the Perl regular expression engine which supports lookbehinds to evaluate a found string already within the search for ignoring it. An extreme example of such a lookbehind usage in Perl regular expression function strings can be viewed at topic How to define a case sensitive function string search?

rmbunton · Aug 05, 2011#32011-08-05T22:01+00:00

Mofi, first off, thank you very much for your help.

...

Here is some information on COBOL paragraph names. Got this from googling COBOL paragraph name length:

ILE COBOL Language Reference - COBOL Words

A COBOL paragraph is a COBOL word.

COBOL words must be character-strings from the set of letters, digits, the hyphen, and the underscore. (The hyphen and the underscore cannot appear as the first or last character, however.) In the ILE COBOL language, each lowercase letter is generally equivalent to the corresponding uppercase letter.

COBOL paragraphs are executable procedures that can be invoked or reference in a COBOL "PERFORM" statement.

...

I took your suggestion to change regex engines from UltraEdit to Perl. Here is were I am, from an implementation point of view. I have updated my UltraEdit wordfile for COBOL to support Perl. Below is a code snippet of the changes in the wordfile. BTW, I got the COBOL wordfile from the UltraEdit download webpages.

Code: Select all

/L20"Cobol" COBOL_LANG Line Comment Num = 2*  Nocase File Extensions = CBL COB CPY
/TGBegin "Function"
/TGFindStr = "^[ \t]*([a-z\-]+) +(division|section)"
/TGFindStr = "^[ \t]*([0-9a-z]+-[0-9a-z][0-9a-z\-]*)\.(?<!end-if\.)(?<!end-exec\.)(?<!end-perform\.)(?<!program-id\.)(?<!date-written\.)(?<!date-compiled\.)(?<!source-computer\.)(?<!object-computer\.)(?<!special-names\.)(?<!file-control\.)"
/TGEnd
/Regexp Type = Perl
/Delimiters = ~!@$%^&*()_+=|\/{}[]:;"'<> ,	.?/
/C1
accept access acquire actual add address advancing after all allowing alphabet alphabetic alphabetic-lower alphabetic-upper alphanumeric
...

The first regex line gets most of the standard COBOL divisions and sections. A COBOL program is logically divided into "division" which contain "sections". These are nice to have in an source outline.
The second regex line is my current attempt to get the COBOL "procedure" division paragraphs. These paragraphs represent the COBOL program's executable code. Like the methods in a C++ object. Less the fact that there is no concept of parameter passing. All variables are global.
I have also attempted to eliminate some of the COBOL reserved words like: "end-if", "end-exec", and "end-perform" for example. COBOL reserve words are part of the ANSI approved syntax. So with these two regex statements things seem to work well. Now I am not sure if they are syntactically correct. But with the above descriptions for COBOL words and paragraphs you might be able to tell me if I am close or not. Thanks in advance...

...

The following COBOL code snippet is what I am using for building my test outline:

Code: Select all

 identification division.
     program-id.
     author.
     installation.
     date-written.
     date-compiled.

 environment division.

 configuration section.
     source-computer.
     object-computer.

     special-names.

 input-output section.
    file-control.

 data division.

 file section.

 working-storage section.

 linkage section.


 procedure division

 0000-main.

     if value-eof
        continue
     else
        if value-duplicate
           move zero to sqlcode
        else
           if value-failure
              move 999
           end-if
        end-if
     end-if.

 0010-server-init.

      exec sql
         select aaaa
         from bbbbbbb
      end-exec.

 0020-assign-input-fields.
 0030-assign-output-fields.
 0040-standardize-sp-params.
 0050-validate-logon.
 0055-get-users-hipaa-unit.
 0060-process-request.
 0070-LOG-ACTIONS.

Mofi · Aug 06, 2011#42011-08-06T11:15+00:00

Very good job! The Perl regular expressions are syntactically absolutely correct. This topic can be now helpful for other COBOL programmers using UltraEdit or UEStudio and wanting a fine working function list.

I have just a few suggestions for further improvement:

The underscore character should be added to the character classes because it is a possible character even when you don't use it.
The point at end of a paragraph can be specified in the second regular expression search string also at end of the expression which avoids the need to add the point to every negative lookbehind expression.
As long as you don't really want to use a grouped function list, it is better to use the old style function string definitions in the wordfile for downwards compatibility. For you and all other users of UltraEdit with v16.00 or later and users of UEStudio with v10.00 or later it does not make a difference when using Flat List option, but for users with former versions of UE or UES which do not support the new function string definitions.
The list of delimiter characters contains the underscore although the underscore should not be a word delimiting character according to definition of COBOL words. The the slash character is present twice in the delimiters list.

So I suggest following:

Code: Select all

/L20"Cobol" COBOL_LANG Line Comment Num = 2*  Nocase File Extensions = CBL COB CPY
/Delimiters = ~!@$%^&*()+=|\/{}[]:;"'<> ,	.?
/Regexp Type = Perl
/Function String = "^[ \t]*([a-z\-_]+) +(division|section)"
/Function String 1 = "^[ \t]*([0-9a-z]+-[0-9a-z][0-9a-z\-_]*)(?<!end-if)(?<!end-exec)(?<!end-perform)(?<!program-id)(?<!date-written)(?<!date-compiled)(?<!source-computer)(?<!object-computer)(?<!special-names)(?<!file-control)\."
/C1
accept access acquire actual add address advancing after all allowing alphabet alphabetic alphabetic-lower alphabetic-upper alphanumeric
...

COBOL programmers copying above or below please note that the multiple spaces between , and . in the list of delimiter characters must be replaced by a tab character.

UltraEdit v13.10 or UEStudio v6.30 is at least needed for above Perl regular expression function strings in the wordfile.

Perhaps it is useful to use grouped function strings for COBOL files by defining 1 group for functions, another one for sections and a third for divisions.

Code: Select all

/L20"Cobol" COBOL_LANG Line Comment Num = 2*  Nocase File Extensions = CBL COB CPY
/Delimiters = ~!@$%^&*()+=|\/{}[]:;"'<> ,	.?
/Regexp Type = Perl
/TGBegin "Sections"
/TGFindStr = "^[ \t]*([a-z\-_]+) +section"
/TGEnd
/TGBegin "Divisions"
/TGFindStr = "^[ \t]*([a-z\-_]+) +division"
/TGEnd
/TGBegin "Functions"
/TGFindStr = "^[ \t]*([0-9a-z]+-[0-9a-z][0-9a-z\-_]*)(?<!end-if)(?<!end-exec)(?<!end-perform)(?<!program-id)(?<!date-written)(?<!date-compiled)(?<!source-computer)(?<!object-computer)(?<!special-names)(?<!file-control)\."
/TGEnd
/C1
accept access acquire actual add address advancing after all allowing alphabet alphabetic alphabetic-lower alphabetic-upper alphanumeric
...

Perhaps you are interested in further enhancing the user contributed wordfile for COBOL, for example with adding indent/unindent and open/close fold strings to use auto-indent and code folding feature for COBOL files. Giving the color groups names would be also fine. (I don't know what C3 to C5 are for.) end-exec is missing in word list of color group 1 as I could see on your example code. And usually it is good to highlight the delimiter characters too. I have made all these enhancements (and sorted the delimiter characters) to cobol.uew from wordfiles download page and attached the improved wordfile here.

The ILE Cobol documentation looks good for further improving the syntax highlighting wordfile by adding additional words (and perhaps remove not documented words), at least for ILE Cobol. When you are interested in further enhancements of the COBOL wordfile and you finally got it, please send the improved wordfile by email to IDM support with the request to replace the existing cobol.uew on their server. The wordfile sent to IDM should not contain any color and font style settings.

mturnb · Aug 01, 2014#52014-08-01T21:42+00:00

I found the cobol.uew very helpful and made one slight change which matches most installations standards that the paragraph name start with a character. That changed Function String 1 to be:

Code: Select all

/Function String 1 = "^[ \t]*([a-z]+[0-9]+-[0-9a-z][0-9a-z\-_]*)(?<!end-if)(?<!end-exec)(?<!end-perform)(?<!program-id)(?<!date-written)(?<!date-compiled)(?<!source-computer)(?<!object-computer)(?<!special-names)(?<!file-control)\."

However, I would like to know if it is possible to limit the structure such that the paragraph name must begin with a single letter and then be followed by numbers. Our standard is Annnn- for paragraph names so I would like to restrict to a single letter.

Also, instead of starting with ^[ \t]* is there any way to specify 7-10 spaces at the beginning of the line? That actually would make it much simpler than all the other items. COBOL has a requirement that only divisions, sections and paragraph names can begin in the 'A' margin which is defined as positions 8-11. If I could restrict this to just being 7-10 spaces, or really only 7 spaces, then I would not need most of the exclusions because they are not allowed in the 'A' margin.

Thank you very much for any assistance you can provide.

Mofi · Aug 02, 2014#62014-08-02T14:43+00:00

The requirement of a single letter at beginning is easy to achieve my removing + after character set definition [a-z] as + means 1 or more. Without + there must be exactly 1 letter.

Code: Select all

/Function String 1 = "^[ \t]*([a-z][0-9]+-[0-9a-z][0-9a-z\-_]*)(?<!end-if)(?<!end-exec)(?<!end-perform)(?<!program-id)(?<!date-written)(?<!date-compiled)(?<!source-computer)(?<!object-computer)(?<!special-names)(?<!file-control)\."

0-9 in the character set definition 4 and 5 and the entire definition [0-9] can be replaced also by \d which means also any digit, but is more difficult do understand by Perl beginners.

The other requirement of 7-10 spaces must be present at beginning of a line is also easy to achieve.

Code: Select all

/Function String 1 = "^ {7,10}([a-z]\d+-[\da-z][\da-z\-_]*)(?<!end-if)(?<!end-exec)(?<!end-perform)(?<!program-id)(?<!date-written)(?<!date-compiled)(?<!source-computer)(?<!object-computer)(?<!special-names)(?<!file-control)\."

^ {7,10} means at beginning of a line there must be a minimum of 7 spaces, but not more than 10 spaces. Horizontal tabs are not allowed anymore.

Code: Select all

/Function String 1 = "^(?: {7,10}|\t{1,2})([a-z]\d+-[\da-z][\da-z\-_]*)(?<!end-if)(?<!end-exec)(?<!end-perform)(?<!program-id)(?<!date-written)(?<!date-compiled)(?<!source-computer)(?<!object-computer)(?<!special-names)(?<!file-control)\."

(?: {7,10}|\t{1,2}) is an OR expression in a non marking group (because of ?: after opening parenthesis) which requires 7 to 10 spaces OR 1 or 2 horizontal tabs at beginning of a line for a positive match.

mturnb · Aug 11, 2014#72014-08-11T20:09+00:00

Thank you so much. That ended up making the function list perfect for me. I'm not a Perl programmer at all so I was struggling with getting the function list to work properly.

rmbunton · Mar 20, 2015#82015-03-20T19:42+00:00

Good first day of spring. I am back.

Too recap, I have used my prior stated COBOL wordfile for some time now.

Recently I have started working with some IBM COBOL programs. These are older programs that have sequencing numbers in the source code. I believe they were once represented by a deck of 80 column cards. Just in case you dropped the deck of cards and needed to sort them.

The following code snippet represents a stripped down version of some COBOL program with just the lines I want in the Function List.

Code: Select all

000100 IDENTIFICATION DIVISION.                                         00010099

016800 ENVIRONMENT DIVISION.                                            01680099

016900 CONFIGURATION SECTION.                                           01690099

017200 DATA DIVISION.                                                   01720099

017300 WORKING-STORAGE SECTION.                                         01730099

062900 LINKAGE SECTION.                                                 06290099

064300 PROCEDURE DIVISION.                                              06430099

064500 0100-STARTUP-PROCESS.                                            06450099

075000 0100-EXEC-CICS-IGNORE-RETURN.                                    07500099

076300 0100-FINISHED-EXIT.                                              07630099

076700 0110-INITIALIZE-TCPIP.                                           07670099

078200 0110-INITIALIZE-EXIT.                                            07820099

078600 0130-GETHOSTBYNAME.                                              07860099

087000 0150-CONNECT-WITH-TANDEM.                                        08700099

087300 0200-COMMUNICATE-WITH-TANDEM.                                    08730099

089000 0200-COMMUNICATE-EXIT.                                           08900099

089300 0210-DETERMINE-HX-REQUEST.                                       08930099

102400 0220-INQU-OR-CALL-HX-REQUEST.                                    10240099

106100 0230-UPDT-OR-CHG-CHRG-REQUEST.                                   10610099

109100 0240-SEND-REQUEST-TO-TANDEM.                                     10910099

117800 0242-CONVERT-ASCII.                                              11780099

120900 0240-SEND-REQUEST-EXIT.                                          12090099

121200 0245-SEND-CHG-CHRG-TO-TANDEM.                                    12120099

123300 0246-CONVERT-ASCII.                                              12330099

124000 0249-SEND-DATA.                                                  12400099

126700 0245-SEND-CHG-CHRG-EXIT.                                         12670099

127000 0247-RECEIVE-TANDEM-HX-DATA.                                     12700099

127400 0248-READ-DATA.                                                  12740099

135400 0249-CONVERT-EBCDIC.                                             13540099

139600 0247-RECEIVE-EXIT.                                               13960099

139900 0248-RECV-COMPLETELY.                                            13990099

146900 0250-HLS-INFO-TO-CICS-COMMAREA.                                  14690099

156000 0250-HLS-INFO-TO-CICS-EXIT.                                      15600099

156300 0255-CLOSE-SOCKET.                                               15630099

157300  0255-CLOSE-SOCKET-EXIT.                                         15730099

157600  0260-TERMINATE-SOCKET.                                          15760099

157800  0260-TERMINATE-SOCKET-EXIT.                                     15780099

158100 0300-RETURN.                                                     15810099

160300 0320-SEND-TERMINATE-MSG.                                         16030099

164800 0320-SEND-TERMINATE-EXIT.                                        16480099

165100 0800-CHECK-CHARTLESS.                                            16510099

181400 0800-CHECK-CHARTLESS-EXIT.                                       18140099

182000  INIT-ERROR-RTN.                                                 18200099

183400  EST-SOCKET-ERROR-RTN.                                           18340099

184800  CONN-SOCKET-ERROR-RTN.                                          18480099

186200  QUERY-SOCKET-ERROR-RTN.                                         18620099

187600  SEND-DATA-ERROR-RTN.                                            18760099

189000  READ-DATA-ERROR-RTN.                                            18900099

190400  CLOSE-SOCKET-ERROR-RTN.                                         19040099

192200  GETHOST-ERROR-RTN.                                              19220099

193800  EZACIC08-ERROR-RTN.                                             19380099

195400 WRITE-ERROR-TO-QUEUE.                                            19540099

200500 GET-SEND-TIME.                                                   20050099

201500 GET-RECEIVE-TIME.                                                20150099

202600 COMPUTE-RESP-TIME.                                               20260099

203100 MOVE-RESP-DETAILS.                                               20310099

204300 NOSPACE-HLD2.                                                    20430099

205600 CBID-ERR.                                                        20560099

206700 INV-REQ.                                                         20670099

208500 COMMON-ALERT-MSG.                                                20850099

211600 COMMON-ALERT-MSG-EXIT.                                           21160099

212000 SEND-ALERT.                                                      21200099

213100 SEND-ALERT-CONTINUE.                                             21310099

213500 SEND-ALERT-EXIT.                                                 21350099

213800 SEND-ALERT2.                                                     21380099

215200 SEND-ALERT2-EXIT.                                                21520099

215500 0999-SNAP-DUMP.                                                  21550099

216300 0999-SNAP-DUMP-EXIT.                                             21630099

216600 9999-ABEND.                                                      21660099

My problem is that sequence numbers in the first 6 columns of the source code is throwing the REGEX into a tail spin. I essentially get nothing back for a function list. I have played around with some newer REGEX, with mixed results. Anyway, I was wondering if someone could help me get back a function list with code that has sequence numbers. I am willing to use 2 different wordfiles if necessary to support this. My first wordfile was used for HP Tandem COBOL programs. This new request is for IBM server COBOL programs.

Below is a snippet from my newest wordfile.

Code: Select all

/TGBegin "Function"
/TGFindStr = "^(?: {1,6}|[0-9]{1,6})*([a-z\d\-]+) +(division|section)"
/TGFindStr = "^(?: {1,6}|[0-9]{1,6})*([0-9a-z]+-[0-9a-z][0-9a-z\-]*)\.(?<!end-if\.)(?<!end-exec\.)(?<!end-perform\.)(?<!program-id\.)(?<!date-written\.)(?<!date-compiled\.)(?<!source-computer\.)(?<!object-computer\.)(?<!special-names\.)(?<!file-control\.)(?<!exit Section\.$)"
/TGEnd

The following is my current results for the Function List. Bottom line, I am missing the bulk of the function list.

IDENTIFICATION DIVISION
ENVIRONMENT DIVISION
CONFIGURATION SECTION
DATA DIVISION
WORKING-STORAGE SECTION

Mofi · Mar 21, 2015#92015-03-21T19:00+00:00

I suggest to use

Code: Select all

/Regexp Type = Perl
/TGBegin "Function"
/TGFindStr = "^(?:\d{6} | {1,6})?([a-z\d\-]+) +(division|section)"
/TGFindStr = "^(?:\d{6} | {1,6})?([a-z\d]+-[a-z\d][a-z\d\-]*)(?<!end-if)(?<!end-exec)(?<!end-perform)(?<!program-id)(?<!date-written)(?<!date-compiled)(?<!source-computer)(?<!object-computer)(?<!special-names)(?<!file-control)(?<!exit Section)\."
/TGEnd

^(?:\d{6} | {1,6})? means:

^ ... begin each search at beginning of a line.

(?: ... )? ... a non marking group containing an expression which is optionally applied. Optionally means that the expression inside non marking group must not return a positive match, but if the expression matches a string, the expression should be only applied once. Using ? meaning 0 or 1 times is better than * meaning 0 or more times. (Yes, I learned more about Perl regular expressions in the last 4 years.)

\d{6} | {1,6} ... is an OR expression with first argument being 6 digits and a space character and second argument being 1 to 6 spaces.

The space after \d{6} makes the difference to [0-9]{1,6} in your expression as there is always a space after the number with 6 digits, but your expression does whether request nor allow a space after the 6 digits at beginning of a line.

But why does ^(?: {1,6}|[0-9]{1,6})*([a-z\d\-]+) +(division|section) match the 2 divisions and 3 sections listed by you at all?

Well, let us look how this expression is applied for example on string 000100 IDENTIFICATION DIVISION

There is no space at beginning of the line and therefore first argument of OR expression in non marking group does not match anything.

But the second argument of OR expression in non marking group match the number 000100, but without the space.

* after the non marking group means 0 or more times. As the space character after the number 000100 is not a letter, digit or hyphen character, the Perl regular expression engine tries to apply the OR expression once again on the string after the number 000100.

The first argument of the OR expression matches 1 or more spaces. Fine because this matches now the space character after number 000100.

The OR expression cannot be applied a third time, but the string IDENTIFICATION DIVISION is matched by the remaining expression of the search string. So by luck those 2 devisions and 3 sections are found although the expression works different than intended.