Got 'Premature end of data in tag...' in XML Manager

rblock · Sep 12, 2014#12014-09-12T14:39+00:00

Hi,

I don't know why but with the attached config file of Notepad++ I always got the following error message in XML Manager:

Code: Select all

Premature end of data in tag NotepadPlus line 2
«* Line: 58, Column:57

A screenshot is attached and the config.xml file as well (click on the screenshot to see the whole picture, please).

Because there is no column 57 in this line I really have no idea what the problem is. Perhaps I'm just blind, though, it would be nice if sombody could help finding back to the light.

Blind greetings

Reiner

Mofi · Sep 13, 2014#22014-09-13T11:24+00:00

It was really interesting in solving this problem.

I opened config.xml in English UE v21.20.0.1014 and opened XML Manager view. No problem. Parsing was successful. So problem was not reproducible.

Therefore I executed next Format - XMLlint Tool and I could see immediately that XMLlint did not find an error in structure as output window showed already the lines from config.xml instead of an error message. But UltraEdit was not responsible anymore. I opened Windows task manager and could see that XMLlint was still running and consumed complete power of 1 CPU core. In other words XMLlint ran into an endless loop. I killed XMLlint process with Windows task manager.

To find out what is wrong with this XML file, I removed many lines at bottom with keeping structure and executed Format - XMLlint Tool. No problem anymore for XMLlint to parse the shortened file. So I used Ctrl+Z to undo the lines deletion, removed a smaller block at bottom of the file, and executed once again XMLlint. After one more step I could see at bottom of output window the error message:

I/O error : encoder error

I looked on the last lines of shortened config.xml and could see

Code: Select all

        <Find name="am zweiten Wochenende, auf dem r&#xFFFC;ck" />

Yes, this is the encoding error as German ü has hexadecimal Unicode value 00FC and not FFFC.

U+FFFC belongs to group of Unicode specials, see Wikipedia article Specials (Unicode block). The object replacement character has a very special meaning. It was now explainable for me why XMLlint tool and XML parser in your version of UltraEdit have problems on parsing the XML file.

To fix this problem replace  by ü in config.xml, or remove the entire find history line containing this invalid encoded ü, or the entire find/replace history block.

rblock · Sep 13, 2014#32014-09-13T12:27+00:00

Hi Mofi,

Wow! You put a lot of work in it but ...

I changed it to 00FC but nothing changed. Of course it seems to be a bug in Notepad++ and I'll send a bug report that there is a problem with the UTF-8 conversion.

So I'll go on investigating further more.

Investigating greetings

Reiner

Mofi · Sep 13, 2014#42014-09-13T13:19+00:00

Hm, I don't understand why you still have a parsing problem in XML manager view. The structure of config.xml is correct.

What happens if you remove on a copy of config.xml all comments by running a Perl regular expression replace with search string (?s)^[\t ]*[\t ]*\r\n and an empty replace string?

rblock · Sep 13, 2014#52014-09-13T14:47+00:00

Hi Mofi,

That it was! Though, it seems that the XML parse of XML-Manager has some problems with those XML comments because the ...

Now it is really strange! I loaded the orignal file with comments and ... the XML-Manager is working.

I made now some test and it seems to be the following problem:

The config.xml file is codepage 1252. When it is first time loaded the codepage on the status bar is set to UTF-8. Therefore I changed it back from 1252 to UTF-8 and closed UE without saving the file. Then I restarted UE and again UTF-8 was selected and XML-Manager does not work. Again I changed it to 1252 and restarted UE without saving the file. After restart 1252 was automatically selected and the XML-Manager is working. All the time the codepage selected by menu with 'View/Select codepage...' was 1252.

That's really something I don't like with UE. One codepage for editing and one for the view. Therefore it seems that the XML-Manager takes the codepage for edit, what was UTF-8, and got in trouble because the file is just 1252.

BTW, I really hate those dialog for selecting the codepage because the entries are not sorted and it is not possible to search for substrings to find the right one.

Ups, just saw that I still not send this. In the meantime I already send the bug report to sourceforge for Notepad++.

And because it is funny as I'd to send a bug report to IDM. UE crashes each time if I open a certain XML file. Notepad++ has no problems with it neither oXygen 12, neither WebStorm 7 ...

Sighing greetings

Reiner

Mofi · Sep 13, 2014#62014-09-13T17:03+00:00

I have downloaded and installed (the first time) also the German version of UltraEdit and could reproduce what you have find out with German as well as with English version of UE. Selecting UTF-8 in status bar for Windows-1252 encoded file results in a conversion from 1252 to UTF-8 encoding with of course not changing the encoding information in first line of XML file. The XML manager fails now to parse the file correct. Switching back in status bar from UTF-8 to code page 1252 and pressing F5 in XML manager view results in a correct parsed file.

I'm using usually not enhanced (standard) status bar, but instead the basic status bar not offering to switch the encoding via status bar. Therefore I have not notice before that switching code page in status bar from Windows-1252 to UTF-8 results in execution of command ASCII to UTF-8. So the file is changed as indicated on file tab. That is very interesting as English UltraEdit describes on help page Status Bar:

English UE help wrote:Encoding Type
The Encoding Type control allows users to change the encoding used to display the active file. This does not actually affect the underlying content of the file. No conversion is done. This merely changes the encoding used to display the file in the editor.

That is definitely not correct as selecting UTF-8 for a single byte encoded text file (Windows-1252 is not really an ANSI standard) results in execution of conversion command ASCII to UTF-8. I will report this to IDM by email.

Further, View - Set Code Page is identical to code page in enhanced status bar. But if a Unicode encoding is selected and the file is therefore really a Unicode encoded file, the code page setting is of no importance for the display of the text or the editing. What is selected in View - Set Code Page does not matter if the file itself is encoded with UTF-8, UTF-16 little endian, UTF-16 big endian, or ASCII Escaped Unicode. This setting is only important for a Unicode file if File - Conversions - UTF-8 to ASCII or one of the other Unicode to ASCII conversion commands is executed as in this case UltraEdit needs to know to which code page the Unicode file should be converted. This is not described at all in help of UltraEdit.

I will send another email to IDM support with the suggestion to explain View - Set Code Page better, especially what this setting is for if the file is a Unicode file.

There is not really one code page for editing and another one for displaying of the text. There is just one code page per file which is only important for text files not being Unicode encoded (as long as no conversion is done). But as not every font supports every code page, the user has to select a font or the right "script" (code page) when switching the code page for a single byte encoded text file. Of course for Unicode encoded files it is also necessary to select an appropriate font. Most Unicode encoded text files in Eastern Asian languages need a different font than usually selected in Western European countries for viewing/editing the text as their characters are simply not present in Windows-1252 code page or in most fonts installed on Windows computers in Western European countries.

On Linux the problem with code page versus font is nowadays made not visible by using always UTF-8 (Unicode) editing and having by default a font set which works for all Unicode characters North American and Western European users usually use from Unicode set. The same is true for Windows for North American and Western European countries. A font like Courier New or Consolas usually set contains all Unicode characters usually used in North American and Western European countries and so most users in those countries do not even know that bytes in a file is not equal character displayed and there are many, many different encodings. What is an encoding? A text file is a text file, isn't it? No, it is not as IDM explained in power tip Working with Unicode in UltraEdit/UEStudio.

What I do not understand is why UTF-8 was selected for config.xml as this was not the case as I opened this file in English and German UltraEdit.

Do you have manually selected UTF-8 although encoding declaration in first line of the config.xml is Windows-1252?

Note: UltraEdit remembers in uedit32.ini a manually selected code page for a file for next opening. This information can be removed by using button Clear history at Advanced - Configuration - Toolbars / Menus - Miscellaneous which clears this manual code page selection information as well as all other histories stored in uedit32.ini. I use this button at least once per month.

rblock wrote:BTW, I really hate those dialog for selecting the codepage because the entries are not sorted and it is not possible to search for substrings to find the right one.

Do you mean View - Set Code Page?

The list in this very small dialog is sorted alphabetically, but not numerical. The list is more or less the same as when looking on code page conversion tables in region and language settings of Windows with following exceptions:

The currently set code page for the file is always at top of the list. I don't know the reason. But this behavior is good in case of selecting something different in list and than the user decides: No, I do not want to change the code page. What was selected before? Ah, yes, the code page at top of the list.
Code page 28603 which is ISO/IEC 8859-13. This is on my German Windows XP listed as last but two entry. Interesting is that this code page is not listed at all in Windows region and language settings which is most likely the reason why this code page is no between code page 28599 and 28605 in the list.
The "code pages" 65000 (UTF-7) and 65001 (UTF-8) which are listed at the end as they are not really code pages, see Is codepage 65001 and utf-8 the same thing?

I agree that the grouped listing as used for encoding selection via status bar is better. But I think, this list is hard coded in UltraEdit while the list in code page selection dialog is filled by calling a Windows kernel function.

rblock · Sep 13, 2014#72014-09-13T18:54+00:00

Hi Mofi,

I already gone crazy about the codepage stuff during the last month. Because there was always a difference between the status bar and the menu selection.

As I wanted to translate the documents of WriteItNow from English into German and from 8859-1 to UTF-8. But at a certain point I stopped using UE and turned to WebStorm because it chooses the right code page of XML files out of the definition at the be beginning of the file...

Crap! What is it with the spell checker over here in these editor. Stupid thing checks just for German.

I already found those entries in the UE config file some month ago. Quite a lot of it.

Even if the list in code page selection Dialog is filled by calling a Windows kernel function it would be possible to fill a list, dictionary, struct or something else first before adding this as data source to the combo box.

But it depends on the abilities of the developer. And when I recall how often I see still dialogs that've a fixed size... really nice if you are browsing for a folder or file, or search for something that's long names and 80 % of it is not visible, no scroll bars, no help text with the full value. These are moments even I'd like to use a baseball bat.

Sighing greetings

Reiner

Mofi · Sep 13, 2014#82014-09-13T20:02+00:00

I suppose that 99.9% of all UltraEdit users never set code page manually and therefore not much time was spent by the IDM developers to make the dialog easier to use. I think, for the status bar a different solution than a simple list was necessary because the entire list does not fit on many screens. Grouping the code pages was a solution to get the lists smaller on screen in menus instead of a list box.

There is configuration setting Auto code page detection at Advanced - Configuration - File Handling - Code Page Detection. UltraEdit can detect the code page of a HTML, XHTML or XML file automatically if this setting is enabled and the file is really encoded with this code page.

UTF-8 detection is automatic independent on this setting as long as Auto detect UTF-8 files is enabled at Advanced - Configuration - File Handling - Unicode/UTF-8 Detection.

Of course all those detection features do not help if the encoding/charset declaration does not match the encoding really used in file. I have seen so many HTML and XML files with a UTF-8 encoding declaration in header although the file was single byte encoded in Windows-1252. Why? The creators of those HTML and XML files copied the line with UTF-8 encoding declaration into their files without knowing what this really means. And then often the text editor or the web browser is blamed for not displaying the content of the file correct or having a file with mixed encodings used instead of the creator of the file not reading more about character set or encoding declaration before using it in the files. Encoding declaration not matching real encoding is one reason why it is possible to set the code page manually.

With German version of UltraEdit just the German dictionary files are packed to the installer package and installed. The English dictionary files are packed by default just with English UltraEdit. From Downloads - Extras - Dictionaries other Aspell dictionaries can be downloaded and installed.

After installing dictionaries for other languages like English or Russian, it is possible to switch at Advanced - Configuration - Spell Checker - Dictionary to a different language for spell-checking the files in running instance of UltraEdit with the selected dictionary.

The Check Spelling dialog has the button Options to open the configuration dialog for quickly selecting a different dictionary if text files in different languages are opened in same instance of UltraEdit. For a translator it is most likely better to have 2 instances of UltraEdit open with spell-checking being enabled by default: one with dictionary X and the other one with dictionary Y selected. And in first instance the files in language X are opened and in the other instance the files in language Y.

And yes, the spell-checking feature is not as good as in word processing applications which even support the usage of multiple languages within same file, but spell-checking is done rarely in text editors as they are mainly used for code writers where the big majority do not spell check ever in my experience.

(20 code writers are in my department, but only one is using spell-checking, guess who. The others write comments be terrified by everybody reading them. Good luck on searching for a word in the comments in those source code files. It is better using regular expression searches with lots of wildcards. Of course indications of wrongly written words in MS Word documents or MS Outlook emails are ignored by my colleagues, too. I think, the red wavy line is filtered out already by their brains as unimportant information.)

rblock · Sep 14, 2014#92014-09-14T08:50+00:00

Hi Mofi,

Mofi wrote:There is configuration setting Auto code page detection at Advanced - Configuration - File Handling - Code Page Detection. UltraEdit can detect the code page of a HTML, XHTML or XML file automatically if this setting is enabled and the file is really encoded with this code page.

UTF-8 detection is automatic independent on this setting as long as Auto detect UTF-8 files is enabled at Advanced - Configuration - File Handling - Unicode/UTF-8 Detection.

I know this and tried different settings to avoid wrong codepage recognition. But I didn't find a satisfying solution.

Mofi wrote:Of course all those detection features do not help if the encoding/charset declaration does not match the encoding really used in file. I have seen so many HTML and XML files with a UTF-8 encoding declaration in header although the file was single byte encoded in Windows-1252. Why? The creators of those HTML and XML files copied the line with UTF-8 encoding declaration into their files without knowing what this really means. And then often the text editor or the web browser is blamed for not displaying the content of the file correct or having a file with mixed encodings used instead of the creator of the file not reading more about character set or encoding declaration before using it in the files. Encoding declaration not matching real encoding is one reason why it is possible to set the code page manually.

But this can't be professionals, can they?

Mofi wrote:For a translator it is most likely better to have 2 instances of UltraEdit open with spell-checking being enabled by default: one with dictionary X and the other one with dictionary Y selected. And in first instance the files in language X are opened and in the other instance the files in language Y.

And yes, the spell-checking feature is not as good as in word processing applications which even support the usage of multiple languages within same file, but spell-checking is done rarely in text editors as they are mainly used for code writers where the big majority do not spell check ever in my experience.

That's not what I meant. I was annoyed because of the problems with the codepage. The original files of WriteItNow are ISO-8859-1 (and ISO-8859-15) but the developer currently tries to convert the whole application into UTF-8 but still has trouble with it. Therefore I tried to convert them for the translation too, first by changing, of course, the encoding attribute of the XML file. But still it doesn't work until I recognized the meaning of the status bar selector and the view selector. As long as I didn't switch both to UTF-8 it wasn't saved as I wanted. And sometimes, it still opened in 8859-1 oder 1252 until I deleted the corresponding lines in the config file.

Mofi wrote:(20 code writers are in my department, but only one is using spell-checking, guess who. The others write comments be terrified by everybody reading them. Good luck on searching for a word in the comments in those source code files. It is better using regular expression searches with lots of wildcards. Of course indications of wrongly written words in MS Word documents or MS Outlook emails are ignored by my colleagues, too. I think, the red wavy line is filtered out already by their brains as unimportant information.)

I wasn't angry about the spell checking in UE but in the text editor over here in the forum. This was, I guess, because yesterday in this moment I used IE10 instead of Firefox 32 and there is a spell checker for German activated and all two or three words it capitalized the first letter of a word I wrote.

And there is no possibility to add a button to the buttonbar to switch between the installed languages. It stucks to the default language, in my case German.

O.k. yesterday I wanted to do something and all the day until the evening I just cleaned up my software installations. Uninstalled not longer needed software like the different MS SQL-Server. And perhaps I save my system, create additionally a VM image and then I uninstall the whole Visual Studio stuff because I really don't need it anymore.

So know I've to start and do some work today to go on with my project.

Wow, at this moment there was a small glimpse of sunlight outside!

Eager greetings

Reiner

Mofi · Sep 14, 2014#102014-09-14T17:36+00:00

rblock wrote:But this can't be professionals, can they?

Professionals? No. The large majority of webpages are not created by professionals. Even the developers of web editors have often not read the HTML and CSS standards, at least this is my assumption on looking on HTML files produced for example by RoboHelp and some other WYSIWYG editors producing HTML files. (Example: <td width="125px"> - there was never a unit px defined for HTML attribute width, just for CSS property width.)

rblock wrote:Therefore I tried to convert them for the translation too, first by changing, of course, the encoding attribute of the XML file. But still it doesn't work until I recognized the meaning of the status bar selector and the view selector.

Ah, I see. Yes, changing just the declaration of the encoding in header has no effect if not converting at the same time also the file to UTF-8. Both must be changed, the declaration and of course the encoding of the file at the same time.

With selecting Unicode - UTF-8 in status bar, UltraEdit runs a conversion to UTF-8 like when using File - Conversions - ASCII to UTF-8, and next the declaration in header of the XML file must be changed, too.

As many HTML, XHTML and XML writers do not know that, I have written a script to convert files in a folder to UTF-8 which additionally changes also the charset and encoding declaration in HTML, XHTML and XML files.

BTW: The help page with title Unicode and UTF-8 Support (last item in Getting Started section on Contents tab) describes briefly (and not 100% precise) how UTF-8 detection works in UltraEdit which is the most difficult one as a UTF-8 encoded file with no character with a code value greater 127 is 100% identical to an ASCII encoded file. Therefore an editor supporting single byte encoded text files as well as Unicode encoded text files needs to know from other sources like BOM or charset/encoding declaration that Unicode encoding with special storage format UTF-8 should be used for a file.

rblock wrote:I wasn't angry about the spell checking in UE but in the text editor over here in the forum. This was, I guess, because yesterday in this moment I used IE10 instead of Firefox 32 and there is a spell checker for German activated and all two or three words it capitalized the first letter of a word I wrote. :) And there is no possibility to add a button to the button bar to switch between the installed languages. It stucks to the default language, in my case German. :(

Yes, the only dictionaries installed by default for spell-checking in Internet Explorer are those of the operating system language. I know that as we use at work only English Windows and therefore also only English applications, but I must write Wiki pages in German in IE. So I needed to install the German spell-checking language support and select German as default spell-checking language. This can be done from within IE and does not cost anything (in comparison to installing other languages for spell-checking in Office 2010). How can I change the spell check and auto-correction language of IE10/Windows8? and ieSpell - Spell Checker add-on for Internet Explorer might be interesting for you.