Why are accented French characters written into HTML file not correct displayed in browser?

Why are accented French characters written into HTML file not correct displayed in browser?

4
NewbieNewbie
4

    Mar 29, 2017#1

    Hi folks,
    I was using an old version of UE, and suddenly I had troubles with my old files encoding.
    I was using previously a French ISO, everything was working, and suddenly when I opened files in UE via FTP, the accent went wrongs. So I decided to change my encoding to UTF-8, but then, even if it was looking okay on my webpage (after changing the charset) the accent are wrong on UE.
    I thought I made a mistake somewhere, so I downloaded the currently latest version of UE trial.
    I opened a new file, converted it to UTF-8, and then I start to type the French accented characters (é, è, à, etc.) they all appear as squares in UE.

    Where can it come from?

    Thanks for any help

    6,603548
    Grand MasterGrand Master
    6,603548

      Mar 30, 2017#2

      It is nearly impossible to help you without having the file. It would be good if you pack this file into a ZIP or RAR archive and upload the archive file as attachment to your next post. Than we could look into it how the characters are really encoded and if charset declaration is right for the used text encoding.

      UltraEdit shows in status bar at bottom the detected encoding of the file and the line terminator type. For example in older UltraEdit U8-DOS means UTF-8 encoded file with DOS/Windows line terminators. In currently latest UltraEdit there is with default status bar an encoding selector which not just displays the current encoding of the file, but makes it possible to convert the file with a few clicks to a different encoding. The line terminator type is shown in a different box in status bar.

      For displaying the characters the configured font matters. The font must support the code page respectively the Unicode characters used in active file. A square is shown if a font does not support the character. Well, all fonts designed for text display support Western European characters. Just fonts designed for example for drawing like Wingdings or special purposes like Terminal don't do that. Which font do you have configured for normal text editing with a proportional or a fixed width font and which font for column/hex mode editing where only a fixed width font can be used?
      Best regards from an UC/UE/UES for Windows user from Austria

      4
      NewbieNewbie
      4

        Mar 30, 2017#3

        Hi again, here it is:

        I made a new file on UE, converted it UTF-8 / DOS.
        I wrote the characters éàèù, and I got 4 squares.
        I saved it, rar'ed it, and send it here.

        About the font I use, where should I find it? All the others characters have no problems.

        I saved my file via FTP, on a Linux server, and then I opened this file by SSH.

        é became <82>
        à became <85>
        è became <8a>

        I hope this is helpful.

        Edit: Attached file was deleted later after issue was solved.

        6,603548
        Grand MasterGrand Master
        6,603548

          Mar 30, 2017#4

          I could see now the problem. I suggest that you first read on Unicode text and Unicode files in UltraEdit/UEStudio, especially the brief overview of Unicode.

          You are using a region dependent OEM code page like code page 850 on entering the text. The OEM code pages are used by Windows by default only in console / command prompt window, but not in applications with a graphical user interface (GUI applications). You can open a command prompt window and run the command chcp to get displayed which code page is used on your computer in console for your user account according to your Windows Region and Language settings.

          But you let UltraEdit think, the text file is encoded using the region dependent standard code page for GUI like Windows-1252. The entered OEM text is displayed right because of using most likely an OEM font like Terminal, perhaps because of having the font setting Use OEM fixed pitch font enabled. UltraEdit can detect automatically the various Unicode encodings, but UltraEdit can't detect automatically which code page is used in a text file using only 1 byte per character containing no information about encoding. The reason for charset= in HTML/XHTML and encoding= in XML files is for telling an application reading the bytes of the file how to interpret them. Of course the charset/encoding declaration must match with the text encoding really used for the characters in the file.

          So you have éàèù entered with an OEM code page like OEM 850 resulting in storing the bytes with the hexadecimal byte values 82 85 8A 97.

          Then you converted those 4 OEM encoded characters without conversion to ANSI and without setting the really used code page to UTF-8. This resulted in storing the bytes C2 82 C2 85 C2 8A C2 97 in the file. But those 4 now UTF-8 encoded characters are not éàèù in Unicode. Those 4 characters are control characters in Unicode as it can be seen for example at UTF-8 encoding table and Unicode characters.

          éàèù corrected encoded in UTF-8 would have the byte stream C3 A9 C3 A0 C3 A8 C3 B9. This would be the result if the 4 characters would have been entered using for example ASCII/ANSI code page Windows-1252 resulting in having stored in file the 4 bytes E9 E0 E8 F9 and converted the Windows-1252 encoded file to UTF-8.

          In UltraEdit < v23.20 and in UE >= v23.20 with using toolbar/menu mode with traditional menus the font can be configured after clicking in menu View on menu items Set Font and Set HEX/Column Mode Font.

          On using UltraEdit >= v23.00 with using ribbon mode click on ribbon tab View on down arrow of most left symbol Fonts which opens a popup menu with the items Set font and Set hex / column mode font to change the fonts.

          On using UltraEdit >= v23.00 with using toolbar/menu mode with contemporary menus click in menu View on submenu Fonts at top with the menu items Set font and Set hex / column mode font.

          The GUI mode can be changed at any time in UE v24.00.0.56 (currently latest UE) by right clicking on ribbon or toolbar and click on appropriate context menu items.

          The default font set is Consolas Regular 10 on first start of UltraEdit with no configuration settings existing.

          For HTML editing using UTF-8 by default I would strongly recommend to configure UltraEdit to create new files by default in UTF-8 which results in Unicode editing from the beginning. In UE for Windows v24.00 open Advanced - Settings/Configuration - File Handling - Encoding and select in first list box UTF-8. I think for a French user user the default ANSI code page is 1252 (ANSI - Latin I) like for me (English and German OS, but German locale on all computers).
          Best regards from an UC/UE/UES for Windows user from Austria

          4
          NewbieNewbie
          4

            Mar 30, 2017#5

            The command chcp is indeed returning:

            Code: Select all

            Active code page: 850
            So if you say this OEM code page is used in console/commands, why is it used in UE?
            Why it changed suddenly?
            I had all my website done and working in ISO-8859-1 and suddenly, out of the blue, each time I try to modify a file, I can't put properly accent anymore. So I tried to change to Unicode, but now they don't display properly in UE.
            I'm kinda lost.
            Did I change something by mistake? Can it be a virus?
            How to go back to my old ISO-8859-1 which was working like a charm?

            If I change my fonts, like you said, to Consolas Regular 10, but then I have only blind characters coming.
            Thanks for your help.

            6,603548
            Grand MasterGrand Master
            6,603548

              Mar 31, 2017#6

              What I think happened on your computer.

              As you used old UltraEdit you clicked by mistake on menu item OEM Character Set in menu View. UltraEdit automatically inserts characters with a code value greater 127 decimal like éàèù (French) or äöüÄÖÜß (German) with the code value according to system OEM code page defined according to Windows Region and Language settings with this option enabled.

              But OEM Character Set is a command to toggle an option. It does not convert the characters in active file. It just enables the feature to write text from now on using system OEM code page instead of system ANSI code page as by default.

              This option is very useful on writing batch files where it is necessary to write the batch file code with using OEM code page, for example when a text containing éàèùäöüÄÖÜß should be output into console window using command echo on batch file execution. But this option is definitely not useful on writing/editing HTML files.

              The option toggled by OEM Character Set is set per file extension group. The file extension groups can be configured at Advanced - Settings/Configuration - Editor - Word Wrap/Tab Settings. I suppose you have only group Default. Therefore this option is now enabled for all non Unicode files.

              UltraEdit has also the commands OEM to ANSI and ANSI to OEM to convert everything in active file to ANSI/OEM. Those two commands are for example in menu File in submenu Conversions on using UE < v23.00 or UE v24.00 with traditional menus.

              However, since UE v14.10 the command OEM Character Set exists in UltraEdit, but is not available anymore in any menu or toolbar or the customization dialogs for menu, toolbar, ribbon. I suppose that IDM removed this command because of too many users enabled it by mistake although for users like me often writing batch files it is very useful. See the forum topics Manual customization of command OEM Character Set and UE/UES configuration to edit batch files (*.bat, *.cmd) by default with OEM character set.

              You don't need this option as not useful on editing HTML files. So you need to toggle off this option in old version of UltraEdit by clicking once again in menu View on menu item OEM Character Set.

              But in case of using now UE v24.00 with configuration taken over from old version of UltraEdit, you have to toggle off this option by editing the INI file as the command does not exist anymore in UE v24.00 in any menu/toolbar/ribbon. I wrote an instruction on how to add this command to toolbar, but I suggest not doing all those steps as you do not really need this command.

              Do the following to disable OEM Character Set in INI file of UltraEdit v24.00:
              1. Exit all running instances of UltraEdit.
              2. Copy %APPDATA%\IDMComp\UltraEdit to clipboard, paste with Ctrl+V this folder path into address bar of Windows Explorer and press key RETURN to open this by default hidden folder.

                There are at least 1, but more likely several uedit*.ini files. 64-bit UltraEdit v24.00 uses only uedit64u.ini and 32-bit UltraEdit v24.00 uses only uedit32u.ini. Any other uedit*.ini are from previous versions of UltraEdit.
              3. Open the INI file used by UltraEdit v24.00 in Windows Notepad.
              4. Search for setting Force OEM which exists with Force OEM=, Force OEM2=, Force OEM3=, ... in section [Settings].
              5. Modify each Force OEM value from 1 to 0. I suppose, there is only Force OEM=1 which needs to be modified to Force OEM=0.
              6. Save and closed the edited INI file.
              7. Redo the steps 3 to 6 for the other files if you plan to uninstall UE v24.00 and reinstall old UltraEdit version.
              8. Start UltraEdit and the accented characters are inserted into non Unicode files again with using system ANSI code page, i.e. Windows-1252 in your case.
              Note:

              ISO 8859-1 has no character defined in code value range 7F to 9F while Windows-1252 contains in this range Western European characters like currency symbol.

              So be careful with charset=iso-8859-1 in header of an HTML file not inserting a character of this range not encoded with appropriate HTML entity, i.e. &euro; instead of . Otherwise your HTML file is not really valid because of using characters not defined in declared character set. It would be better to use charset=windows-1252 in case of inserting not encoded as HTML entity into an ANSI encoded HTML file.

              Well, in real for displaying the HTML file in browser it does not really matter if is in the HTML file as single byte with hexadecimal code value 80 while the HTML file contains the character set declaration charset=iso-8859-1. It is standard that all browsers interpret an HTML file with charset=iso-8859-1 identical to an HTML file with charset=windows-1252 and convert the bytes correct to Unicode for displaying. The browser manufacturers know that most HTML writers do not know what charset= really means and that there are differences between ISO 8859-1 and Windows-1252.

              Important to know for every HTML/XHTML writer:

              The character set declaration in HTML file defines how to interpret the bytes in the HTML file. It does not define which character set to use for displaying the HTML file contents. All browsers convert text to display/print to Unicode on loading the HTML file. So charset= defines for the browser how to interpret the bytes of the HTML file and not how to display it. An HTML file with German text containing only ASCII characters because of encoding äöüÄÖÜ߀ and other characters with a code value greater 7F with HTML entities can be declared with charset=us-ascii and is nevertheless displayed correct by any browser as the browser converts the byte stream with the HTML entities correct to Unicode for displaying the text.
              Best regards from an UC/UE/UES for Windows user from Austria

              4
              NewbieNewbie
              4

                Mar 31, 2017#7

                You are a life savior! :)
                Thanks for those precise and valuable details!
                From now, I will always use UTF-8 anywhere! ;)
                But just on this precise website, it was way too much work to change all my files/DB.

                I wonder how I toggled this option by mistake tho.
                Thanks again, and if IDM don't pay you, they should. :D