Subgroups & Template-Toolkit

fenway · Aug 18, 2011#12011-08-18T13:11+00:00

Hello,

I'm looking to use the function list for syntax that looks like this:

[% PROCESS tabrow  name='Fred'  email='[email protected]' %]

This is from template toolkit.

Basically, each tag starts with [%, ends with %], has a name (tabrow, in this case) and a series of attributes (name, email).

Whitespaces and linebreaks are entirely optional -- it can be spread out over multiple lines if need be.

I can't seem to figure out how to get the subgroups to handle this situation -- it's not like C where there are parenthesis demarcating the start/end of the parameter list.

Anyone?

Mofi · Aug 18, 2011#22011-08-18T17:11+00:00

Try following UltraEdit regular expression function strings in your wordfile:

For listing the tags with the attributes and values separately in subgroups:

/TGBegin "Tags"
/TGFindStr = "^[^% +[a-z]+ +^([a-z0-9_^-]+^)[~%]+^%^]"
/TGBegin "Attributes"
/TGFindStr = "^([a-z0-9_^-]+^)="
/TGFindBStart = "^["
/TGFindBEnd = "^]"
/TGEnd
/TGBegin "Values"
/TGFindStr = "=['"]^([~'"]+^)['"]"
/TGFindBStart = "^["
/TGFindBEnd = "^]"
/TGEnd
/TGEnd

For listing the tags with the attributes and their values as subgroups:

/TGBegin "Tags"
/TGFindStr = "^[^% +[a-z]+ +^([a-z0-9_^-]+^)[~%]+^%^]"
/TGBegin "Attributes"
/TGFindStr = "^([a-z0-9_^-]+=['"][~'"]+['"]^)"
/TGFindBStart = "^["
/TGFindBEnd = "^]"
/TGEnd
/TGEnd

For listing the tags with the attributes only in subgroups:

/TGBegin "Tags"
/TGFindStr = "^[^% +[a-z]+ +^([a-z0-9_^-]+^)[~%]+^%^]"
/TGBegin "Attributes"
/TGFindStr = "^([a-z0-9_^-]+^)="
/TGFindBStart = "^["
/TGFindBEnd = "^]"
/TGEnd
/TGEnd

After copying one of the 3 blocks into your wordfile opened in same instance of UltraEdit as the sytax highlighted filed and saving the wordfile, you need to switch to the syntax highlighted file and press key F8 to execute Search - Function List resulting in reparsing the syntax highlighted file with the changed regular expression strings.

fenway · Aug 19, 2011#32011-08-19T03:20+00:00

Using v17.10.0.1015, they all work!

Very impressive, Mofi -- I've been trying to fight with this for days now.

So let me make sure I understand what you've done.

The top-level "Tag" TGFindStr is straightforward -- but basically, there's nothing wrong with matching the entire thing. Can I assume that it will work multi-lined as well? Haven't tried yet, but maybe I'll only have to change the "space" to a "whitespace" character set?

As for the second-level "Attribute" TGFindStr, I'm guessing that UE simply runs this as many time as possible, similar to /g, between TGFindBStart and TGFindBEnd? If so, it makes sense that '[' and ']' are the boundaries. That is, the second-level needs to be "within" the top-level TGFindStr matching string -- is that correct?

Just a question -- if I add multiple second-level TGFindStr lines, will it re-scan the the matching part? There are many different variants, and it's difficult to capture them all in a single one, especially with the UE regex limitations.

I'm just trying to understand where the BStart and BEnd literals appear relative to the top-level FindStr match.

Thanks in advance.

P.S. Looks like there is some issue with some variation of this wordfile that breaks the subgroups in some cases -- perhaps the minor version wasn't to blame -- I'm trying to figure that one out. Is it possible that there's a maximum file size for subgroups to be in effect?

Mofi · Aug 19, 2011#42011-08-19T05:14+00:00

fenway wrote:There's nothing wrong with matching the entire thing. Can I assume that it will work multi-lined as well? Haven't tried yet, but maybe I'll only have to change the "space" to a "whitespace" character set?

You can use also

/TGBegin "Tags"
/TGFindStr = "^[^%[ ^t]+^([a-z]+[ ^t]+[a-z0-9_^-]+[~%]+^)^%^]"
/TGEnd

to get the entire tags with its attributes displayed in the function list. You can move also ^( and ^) to every position you want to get just the string found with the expression inside displayed in the function list view or remove them completely to get the entire found string displayed in function list view. Multiline strings are no problem because this expression finds also multiline strings and UltraEdit converts every line ending to a space for display. But the beginning like [% PROCESS tabrow must be always within a line with that expression, line terminators are ignored only in the following string because of negative character set definition [~%]+.

fenway wrote:As for the second-level "Attribute" TGFindStr, I'm guessing that UE simply runs this as many time as possible, similar to /g, between TGFindBStart and TGFindBEnd? That is, the second-level needs to be "within" the top-level TGFindStr matching string -- is that correct?

Twice correct.

fenway wrote:If I add multiple second-level TGFindStr lines, will it re-scan the the matching part?

Yes, using multiple /TGFindStr = on every level results in researching entire matching part (entire file on first level, just the entire block defined by /TGFindBStart = and /TGFindBEnd = on the lower levels).

fenway wrote:Is it possible that there's a maximum file size for subgroups to be in effect?

I don't know of such a limitation. But I know that there are several bugs in grouped function string scanning which I have reported a few weeks ago resulting in unexpected results for the used regular expression strings. I hope the IDM developers will fix them soon.

fenway · Aug 19, 2011#52011-08-19T12:41+00:00

Thanks for your detailed responses.

Turns out the "size limitation" wasn't the issue -- it was actually the presence of an open script tag (<script). FYI, HTML_LANG in this wordfile.

So it looks like something about the open script tag is messing up the subgroups -- that is, it seems to engage the Function Strings of the Javascript.uew file, and those can't seem to co-exist with TGFindStr. Is that the problem?

If I remove HTML_LANG, it's just fine. If I change <script to <foo, it's fine. In fact, if <script is present ANYWHERE in the file, subgroups are broken.

Also, I'm just wondering about nesting of such tags, for example:

Code: Select all

[% BLOCK footer name1='value1' %]
   Copyright 2000.
   [% INCLUDE company name2='value2' %]
[% END %]

How does it handle matching TGFindBStart/TGFindBEnd pairs? This gets complicated with multi-lined versions, too.

Mofi · Aug 19, 2011#62011-08-19T14:03+00:00

The language marker HTML_LANG enables special HTML related functions in the UltraEdit syntax highlighting engine. With HTML_LANG strings starting with < (opening tag) or </ (closing tag) and ending with > (tags) or = (attributes) are interpreted special according to HTML specification. This keyword enables also multi-language syntax highlighting.

<script in a file highlighted with a wordfile containing HTML_LANG in the first line results in using for the block up to </script> the wordfile containing the language marker JSCRIPT_LANG. So everything of a script block is interpreted according to the syntax highlighting definitions in the wordfile for Javascript. You can see that by looking on 4th boy in the status bar at bottom where the name of the currently active syntax highlighting language for the text at current cursor position in the file is displayed.

Perhaps it would be better not using HTML_LANG for the template files.

Alternatively you could use the same function strings as used in the wordfile for the template also in the wordfile for Javascript which should result in displaying this tags also in the function list view. Of course the function strings as used in the template wordfile should be added to the Javascript wordfile without replacing the function strings for real javascript code. So you need to convert the old style function strings in javascript.uew to new grouped function string definition.

/TGBegin "Functions"
/TGFindStr = "%[ ^t]++function[ ^t]++^([a-zA-Z_][a-zA-Z_0-9]++[^t ]++(*)^)"
/TGFindStr = "%[ ^t]++^(*^)=[ ^t]++function[ ^t]++(*)"
/TGFindStr = "%[ ^t]++^(*^):[ ^t]++function[ ^t]++(*)"
/TGEnd

Nested tags are no problem on finding the tags, but you will not see them nested in the function list view. The function list view is not designed for displaying the found strings according to a document structure which a general text editor can't know. It could be possible to get them partly structured displayed in the function list view when you include that for tags on top level [% must be at start of a line without preceding spaces, and use the same group of function strings again with the difference that now [% must be after X spaces/tabs at beginning of the line. But I don't think this is really worth the effort. It would be easier to turn off alphabetic sorting of the function list and include the preceding spaces and the keyword before the tag name in the displayed "tag" string in function list view to see more or less the document structure also in function list view.

\t is the special character code for the tab character in Unix and Perl syntax, but for UltraEdit regular expression engine and for non regular expression searches you have to use ^t for the tab character. See the table Special character summary for example on help page for Find command.

fenway · Aug 19, 2011#72011-08-19T19:31+00:00

Yes, but I do want JSCRIPT_LANG within script blocks (highlighting, keywords, etc.) -- I'm not all that interested in the function list, and like you said, I can easily add them if I want to.

I'm trying to figure out why I can't have subgroups and multi-language syntax highlighting at the same time.

One more thing: I can't seem to figure out how to write a UE-style regex that captures any number of any character -- including multi-line.

Code: Select all

[ ^t^p^r^n]+

Will give me any whitespace.

But I can't seem to add "?" into the character class.

Suggestions?

Mofi · Aug 20, 2011#82011-08-20T20:27+00:00

You made a mistake caused by missing documentation. I wrote about special meaning of ^p , ^r and ^n in syntax highlighting wordfiles in this post.

The character set definition [^t^p -ÿ]+ matches in a function string in syntax highlighting wordfiles all tabs, line terminators of any type and all characters in ANSI table from space character to character ÿ which is the last one with hexadecimal value 0xFF in code page 1252.

But usually a negative character set definition is used as for example [~0-9a-z]+ which matches any character including line terminators except letters (in any case) and digits. I used [~%]+ in the regular expression finding the tags to match everything up to the percentage character.

Inside a [...] all characters are interpreted as characters in any regular expression engine without there special meanings outside, except

] because this is the character to define the end of the character set definition and must be therefore escaped when it should be part of the character set,
- which is interpreted as from character left of - to character right of - and must be therefore escaped when should be part of the character set,
and all special character codes with the escape character like ^p, ^t, ^b respectively \t, \n, etc.

So regular expression characters like ?, +, $, etc. loose their special meaning inside [...].

fenway · Aug 21, 2011#92011-08-21T12:23+00:00

Mofi wrote:The character set definition [^t^p -ÿ]+ matches in a function string in syntax highlighting wordfiles all tabs, line terminators of any type and all characters in ANSI table from space character to character ÿ which is the last one with hexadecimal value 0xFF in code page 1252.

But usually a negative character set definition is used as for example [~0-9a-z]+ which matches any character including line terminators except letters (in any case) and digits. I used [~%]+ in the regular expression finding the tags to match everything up to the percentage character.

That makes sense -- however, I was trying to capture everything up to a particular string, not a particular character, and I see no way to do that in UE-style regex.

Incidentally, that character set doesn't seem to work for me -- or maybe it's the greediness? I'm not certain.

Let me give you a specific example:

Code: Select all

[%MYTAG block %]
    ...contents...

  [%INNERTAG1 %]
    ...contents...
  [%INNERTAG2 %]
    ...contents...
  [%INNERTAG3 %]
    ...contents...

[%/MYTAG %]

I wanted to collect each of the inner tags as a subgroup of MYTAG. This is complex, since the start/end strings of each tag appear dozens of times -- so I figured if I wrote a regex to match, across lines, until the closing tag, I would be able to match them. But then I'm confused about what to use the TGFindBStart/End for the subgroups... very confused.

Any ideas?

fenway wrote:Yes, but I do want JSCRIPT_LANG within script blocks (highlighting, keywords, etc.) -- I'm not all that interested in the function list, and like you said, I can easily add them if I want to.

I'm trying to figure out why I can't have subgroups and multi-language syntax highlighting at the same time.

Any idea why this would break subgroups????

Mofi · Aug 21, 2011#102011-08-21T16:28+00:00

Yes, UltraEdit regular expressions like string1[^t^p -ÿ]+string2 or string1[~^b]+string2 are greedy and there is no way to make them not greedy as the Perl regular expression engine offers.

I think, you are the first user who tries to get a grouped function list for a multi-language highlighted file. And I think, IDM has not designed the grouped function strings for multi-language highlighted files. It makes absolutely sense for me from a programmers point of view that whenver the syntax highlighting language is changed, a new function string parsing session starts. You must think of a multi-language HTML file like a set of individual files just packed together into a single file like in an archive file. Each block has its own syntax highlighting file with its own definitions. Why should anybody think that functions in a Javascript block should be listed in the function list view as a subgroup of a tags block for something like following?

Code: Select all

[%MYTAG block %]

  [%INNERTAG1 %]

   <script type="text/javascript">
       // Javascript code
   </script> 

   [%INNERTAG2 %]

[%/MYTAG %]

The Javascript block is a completely different language interpreted by a completely different software module than the lines outside. So for the UltraEdit function string parser this block is read as

File 1 - HTML:

Code: Select all

[%MYTAG block %]

  [%INNERTAG1 %]

   <script type="text/javascript">

File 2 - Javascript:

Code: Select all

       // Javascript code

File 3 - HTML:

Code: Select all

</script> 

   [%INNERTAG2 %]

[%/MYTAG %]

I hope this makes clear why the Javascript code breaks function string parsing also into 3 parts for this example. It would be possible to read a multi-language HTML file different. I could image to interpret a HTML file with multiple languages also different by splitting the file into multiple files with one file per language each containing all text from the source file being part of a language. With such an algorithm the example above would result internally in 2 files to parse.

Code: Select all

[%MYTAG block %]

  [%INNERTAG1 %]

   <script type="text/javascript">
</script> 

   [%INNERTAG2 %]

[%/MYTAG %]

File 2 - Javascript:

Code: Select all

       // Javascript code

But it looks like this is not the way UltraEdit parses a HTML file. I don't know if browsers do that, but I don't think so because of small inline codes of other languages which need the context like <p style="color:red"> (HTML, CSS, HTML).

My problem on helping you here is that you want general regular expression answers on questions for very content based problems. I'm not able to give them without the content. I need the content to help you purposeful. Best would be you pack a good example file, the wordfiles needed to syntax highlight this example file and a text file showing me what you want to see in function list view in which structure together into a ZIP or RAR file and upload this archive file as attachment to your next post. Then I could help you much better or at least can tell you that what you want is not possible.

Please note that the function list feature is in general not designed for files like HTML, XHTML, XML and similar where tags can exist within same or other tags. As pietzcker wrote in this post (and some others): Regular expressions are not able to deal with arbitrarily nested structures. That's why for XML Manager a special XML parser is used and why for HTML Close Tag feature of UEStudio the IntelliTips feature is necessary which are coded specially to handle nesting of tags. Regular expression finds/replaces are not designed for taking into account a content structure. They are designed for character streams without any structure because finds/replaces do not split a file content into hierarchical blocks or objects.

fenway · Aug 22, 2011#112011-08-22T13:02+00:00

I understand the limitations of regular expressions -- there will definitely be things that I cannot capture.

The example I gave earlier -- with a block tag and a bunch of INNERTAGs -- is a really good example. As I alluded to earlier, each one can have name=value pairs on the open tag.

[%BLOCKTAG findme one=yes two=no %]

[%INNERTAG green three=maybe %]
... any arbitrary amount of html code, and possible other [% TAGS %], but not INNERTAG, and not %/BLOCKTAG
[%INNERTAG blue four=go %]
... any arbitrary amount of html code, and possible other [% TAGS %], but not INNERTAG, and not %/BLOCKTAG
[%/BLOCKTAG %]

The function list should be

Code: Select all

findme
--Attributes
----one=yes
----two=no
--Inners
----green
----Attributes
------three=maybe
----blue
----Attributes
------four=go

Not sure how the wordfile will help -- I'm designing one from scratch -- so the only thing present at the moment is the function group expressions that we've been talking about this whole time.

Mofi · Aug 25, 2011#122011-08-25T06:14+00:00

Sorry, I was busy due to beta testing UE v17.20.

A wordfile like

/L20"Templates" HTML_LANG Nocase Noquote File Extensions = TXT
/TGBegin "Blocks"
/TGFindStr = "%^[^%BLOCKTAG[ ^t]+^([a-z0-9_^-]+^)[~%]+^%^]"
/TGBegin "Attributes"
/TGFindStr = "^([a-z0-9_^-]+=['"]++[a-z0-9_^-]+['"]++^)"
/TGFindBStart = "^[^%"
/TGFindBEnd = "^%^]"
/TGEnd
/TGBegin "Inners"
/TGFindStr = "%[ ^t]+^[^%INNERTAG[ ^t]+^([a-z0-9_^-]+^)[~%]+^%^]"
/TGFindBStart = "^[^%BLOCKTAG"
/TGFindBEnd = "^[^%/BLOCKTAG"
/TGBegin "Attributes"
/TGFindStr = "^([a-z0-9_^-]+=['"]++[a-z0-9_^-]+['"]++^)"
/TGFindBStart = "^[^%"
/TGFindBEnd = "^%^]"
/TGEnd
/TGBegin "Others"
/TGFindStr = "%[ ^t]+^[^%[ ^t]+^([a-z0-9_^-]+^)[~%]+^%^]"
/TGFindBStart = "^[^%INNERTAG"
/TGFindBEnd = "^[^%^{INNER^}^{/BLOCK^}TAG"
/TGEnd
/TGEnd
/TGEnd

results for a *.txt file containing

Code: Select all

[%BLOCKTAG findme one=yes two=no %]

 [%INNERTAG green three=maybe %]

 [% test1 %]

 [% test2 %]

 [%INNERTAG blue four=go %]

 [% test3 %]

[%/BLOCKTAG %]

[%BLOCKTAG second attrib1="xxx" attrib2='yyy' %]

 [%INNERTAG red three=maybe %]

 [%INNERTAG green four=go %]

[%/BLOCKTAG %]

in follwing display in the function list view (all expanded) with UE v17.20 Beta 2.

Captured function list ouptut

fenway · Aug 26, 2011#132011-08-26T12:39+00:00

So far so good -- is there any way to get more than 2 options in a UE-style alternation? Or do I simply need to create multiple FindStr blocks instead?

Mofi · Aug 26, 2011#142011-08-26T13:21+00:00

The OR expression of the UltraEdit regexp engine supports only 2 arguments. If character set definition(s) can't be used instead of the OR expression you need multiple function strings.

I explain on an example what I mean with character set(s) instead of OR expression. I used in the function strings

^{INNER^}^{/BLOCK^}TAG

to find INNERTAG and /BLOCKTAG.

But a regular expression string like /++[BI][LN][ON][CE][KR]TAG finds these 2 strings too. Of course this expression could find also other strings. But when other strings matching this expression never exist in the file in the context this part of a larger regular expression is applied on, it does not matter. Perhaps even /++[BCEIKLNOR]+TAG could be used too.

Well, as an author of a wordfile using such an expression for a function string to find a group of well defined words, I would document in a separate text file what are the words which should be found by this expression. Nobody, including the author, would later understand (easily) what the expression /++[BCEIKLNOR]+TAG or /++[BI][LN][ON][CE][KR]TAG is for when not explaining them anywhere.

fenway · Aug 26, 2011#152011-08-26T13:26+00:00

That's too bad. The repeated character sets would only work if the lengths of the prefixes matched, and the giant character set would need to impose length restrictions (min/max), making it almost impossible.