Convert File with Tags to a Delimited Format

misscrf · Mar 08, 2017#12017-03-08T20:32+00:00

I have a file, which has a bunch of tags in it, because it is a component file. I need to get this information into an Excel spreadsheet, so I can organize the info for planning purposes. I'm looking for a way to convert this file, so I can prevent the need from having to do it manually.

The file looks like this:

Code: Select all

		<hd:text name="CityTown TE">
			<hd:defFormat>Like This</hd:defFormat>
			<hd:prompt>Property City or Town</hd:prompt>
		</hd:text>
		<hd:number name="Zip NU" maximum="99999">
			<hd:defFormat>99999</hd:defFormat>
			<hd:prompt>Zip Code</hd:prompt>
		</hd:number>
		<hd:date name="MonthYear DocPrepared DT">
			<hd:defFormat>June, 1990</hd:defFormat>
			<hd:prompt>Date Document was Prepared:</hd:prompt>
			<hd:fieldWidth widthType="calculated"/>
		</hd:date>
		<hd:trueFalse name="StateForm TF">
			<hd:defFormat>true/false</hd:defFormat>
			<hd:prompt>State Form (Add more text)</hd:prompt>
		</hd:trueFalse>
		<hd:multipleChoice name="County MC">
			<hd:defFormat>Like This</hd:defFormat>
			<hd:prompt>County</hd:prompt>
			<hd:options defMergeText="Albany/Allegany/Bronx/Broome/Cattaraugus/Cayuga/Ch">
				<hd:option name="Albany"/>
				<hd:option name="Allegany"/>
				<hd:option name="Bronx"/>
				<hd:option name="Broome"/>
				<hd:option name="Cattaraugus"/>
			</hd:options>
			<hd:singleSelection style="dropDownList"/>
		</hd:multipleChoice>
		<hd:computation name="OrgID CT" resultType="text">
			<hd:script>IF HaveOrgIDNONE TF = TRUE
	OrgID TE
ELSE
	""
END IF
</hd:script>
		</hd:computation>
		<hd:dialog name="Title ABC" linkVariables="false">
			<hd:contents>
				<hd:item name="Field1"/>
				<hd:item name="Field2" onPreviousLine="true"/>
				<hd:item name="Field3" onPreviousLine="true"/>
				<hd:item name="Field4"/>
				<hd:item name="Field5" onPreviousLine="true"/>
				<hd:item name="Field6"/>
				<hd:item name="Field7" onPreviousLine="true"/>
				<hd:item name="Field8" onPreviousLine="true"/>
			</hd:contents>
			<hd:script>HIDE Field1 
HIDE Field2

IF Field3 ="xyz"
	SHOW Field1 
	SHOW Field2
ELSE
	HIDE Field1 
	HIDE Field2
END IF</hd:script>
		</hd:dialog>

I'm looking to get this all into a spreadsheet where the first column tells us the type of variable it is. This is the first part of the opening tag, e.g. text, number, date, multiple choice, computation, dialog. Then we need it's name, prompt, default format, and format, if those are available. If there is a script, that can be an extra column as well.

Anyone know a good way to make this conversion happen more efficiently than by doing it manually? Not only will that take forever, I will have to periodically re-do it all.

Thank you for any help!

Mofi · Mar 09, 2017#22017-03-09T06:26+00:00

If you would post for the posted input example also the expected output we could most likely quickly write the macro for you using mainly UltraEdit tagged regular expression or Perl regular expression using backreferences Replace All commands to reformat the file to a CSV file.

misscrf · Mar 09, 2017#32017-03-09T13:22+00:00

The format is in the original code snippet I posted. I'm attaching a screenshot of the output I am looking to achieve as displayed in Excel and a ZIP file with the CSV file. I asked some colleagues, and it turns out, it is important to note that this has an XML tag at the top. I had stripped out the top tags, to get to the meat of the tags I need. One of the issues, is that I need the key of the tags, as those specify the data type of the variable.

This is at the top of the file:

Code: Select all

<?xml version="1.0" encoding="UTF-8"?>

I'm looking at some XML converters, but they don't seem to grab that key, which I need. I know I will have to do this again and again, as I plan, re-organize, update, revise, etc. My hope is definitely that I can work with a script in UltraEdit, because it's so powerful, and can quickly do this kind of task.

Let me know if you need any other info. I really appreciate it!

Mofi · Mar 13, 2017#42017-03-13T20:43+00:00

Okay, here is the macro which converts the posted XML file into a CSV file.

Code: Select all

InsertMode
ColumnModeOff
PerlReOn
Top
Find "<?xml version="1.0" encoding="UTF-8"?>"
IfNotFound
ExitMacro
EndIf
"Variable Type,Variable Name,Default Format,Prompt,OptionNames,Default Merge Text,Options,Script"
Find MatchCase RegExp "^[\t +]*<hd:(\w+) name="(.+?)".*?>\s+<hd:defFormat>(.+?)</hd:defFormat>\s+<hd:prompt>(.+?)</hd:prompt>\s+(?:<hd:fieldWidth widthType="calculated"/>\s+)?</hd:\1>[\t ]*$"
Replace All "¿\1¿,¿\2¿,¿\3¿,¿\4¿,,,,¶"
Top
Find MatchCase RegExp ""/>\s+<hd:option name=""
Replace All ";"
Top
Find MatchCase RegExp "^[\t +]*<hd:(\w+) name="(.+?)">\s+<hd:defFormat>(.+?)</hd:defFormat>\s+<hd:prompt>(.+?)</hd:prompt>\s+<hd:options defMergeText="(.+?)">\s+<hd:option name="(.+?)"/>\s+</hd:options>\s+<hd:(\w+) style="(.+?)"/>\s+</hd:\1>[\t ]*$"
Replace All "¿\1¿,¿\2¿,¿\3¿,¿\4¿,¿\6¿,¿\5¿,¿\7 - \8¿,¶"
Top
Find MatchCase RegExp "^[\t +]*<hd:(computation) name="(.+?)".*?>\s+<hd:script>([\s\S]+?)</hd:script>\s+</hd:\1>[\t ]*$"
Replace All "¿\1¿,¿\2¿,,,,,,¿\3¿¶"
Top
Find MatchCase RegExp "(<hd:item name=".+?)" (\w+)=.+?/>"
Replace All "\1 - \2"/>"
Top
Find MatchCase RegExp ""/>\s+<hd:item name=""
Replace All ";"
Top
Find MatchCase RegExp "^[\t +]*<hd:(dialog) name="(.+?)".*?>\s+<hd:contents>\s+<hd:item name="(.+?)"/?>\s+</hd:contents>\s+<hd:script>([\s\S]+?)</hd:script>\s+</hd:\1>[\t ]*$"
Replace All "¿\1¿,¿\2¿,,,,,¿\3¿,¿\4¿¶"
Top
Find MatchCase RegExp "\s+¿¶$"
Replace All "¿¶"
Top
Find MatchCase """
Replace All """"
Find MatchCase "¿"
Replace All """
Find MatchCase RegExp "(?<!¶)\r"
Replace All ""
Top
Find MatchCase "¶"
Replace All ""

The last 5 lines could be also:

Code: Select all

HexOn
Find "0D 0A"
Replace All "0A"
Find "B6 0A"
Replace All "0D 0A"

Some additional information:

All field values are text values. Therefore it is advisable to enclose all non empty values in double quotes as Excel interprets values in double quotes as text value instead of text or integer/floating point number or date/time depending on value.
The field values could contain also a comma. For that reason enclosing the values in double quotes is also advisable.
Each double quote within a field value must be escaped with a double quote in entire field value enclosed in double quotes. This is the reason for using first ¿ as a placeholder for an enclosing double quote which is finally replaced by a double quote after duplicating all double quotes in field values. ¿ can be also a different character like a control character with a code value lower than 32 (space character, hexadecimal 20). Important is only that this placeholder character never exists in the XML file.
Comma-separated values files can contain also newline characters within a field value enclosed in double quotes.

Excel uses a special encoding for line breaks within a field value. A data row is terminated with carriage return + line-feed, but a line break within a field value is written with only a line-feed into the CSV file. So the CSV file contains CR+LF as well as just LF.

This is one reason for UltraEdit configuration setting Only recognize DOS terminated lines (CR/LF) as new lines for editing.

The macro marks each end of a data row with placeholder character ¶ which is later used to replace CR+LF without a preceding ¶ by just LF and finally remove all ¶. Again any other character than ¶ never existing in XML file could be used as placeholder including a control character.

I could verify that the macro produces byte by byte exactly the file which German Excel 2007 (with Windows Region and Languages settings set for German countries) also produces on saving the Excel file with the contents of the XML file as also shown on posted image as CSV file.

But I failed to import the data from the CSV file into an empty Excel sheet from the CSV file with German Excel 2007 using Text Import Wizard. The line-feeds within the double quoted multi-line field values (scripts code) were interpreted as end of a data row. It looks like this is a known issue with Excel. You might workaround this issue with special VBA code, see for example Import multiline CSV files into Excel internationally or other solutions found on searching in world wide web for CSV Excel import multiline values.

Please let me know if the macro code works as expected and if you would like it when I explain the macro code (regular expressions).

Last note: The macro should need only one Top command because a Replace All should never modify the caret position. But there are versions of UltraEdit which move caret on a Perl regular expression Replace All executed from within a macro/script. This is the reason why I added after each Perl regular expression Replace All the command Top. I don't know if the version of UltraEdit used by you is affected by this bug.

misscrf · Mar 14, 2017#52017-03-14T12:31+00:00

Thank you very much for this!