Finding invalid characters

Finding invalid characters

5
NewbieNewbie
5

    May 23, 2019#1

    I opened a file in UltraEdit 25.00.0.82 (Windows 10, 64-bit) and received the following error:

    The file, File.htm, has been detected as UTF-8 but includes invalid UTF-8 characters.

    Invalid characters will display as: � (may vary depending on font)

    Editing this file as UTF-8 may result in file corruption.

    Is there a way to search through a directory of files for any that contain invalid characters?

    Thanks in advance!!

    6,604548
    Grand MasterGrand Master
    6,604548

      May 23, 2019#2

      What do you know about character encoding, especially about character encoding in HTML files with character set declaration informing all applications interpreting the bytes of an HTML file with which character encoding the characters are stored in this HTML file?

      I suppose your knowledge about character encoding is poor and so I suggest first to read really carefully from top to bottom my second post on Script to convert special characters to HTML and everything linked from this post.

      Further I suppose the HTML files are declared as UTF-8 encoded, but everything in the HTML files is in real ANSI encoded with Windows-1252 or any other code page according to default ANSI code page set by Windows according to configured country. In this case the right solution to fix the character encoding is correcting the character set declaration in the HTML files.

      A Find in Files to find characters in HTML files declared as UTF-8 encoded and really containing UTF-8 encoded characters, but also ANSI encoded characters is not really possible as you know hopefully after reading everything written by me. There must be at least known which ANSI encoding is used for the characters in UTF-8 declared and partly also encoded HTML files to define one or more Perl regular expression search strings which can be used with Find in Files to find the ANSI instead of UTF-8 encoded characters.
      Best regards from an UC/UE/UES for Windows user from Austria

      5
      NewbieNewbie
      5

        May 23, 2019#3

        Your supposition of my knowledge of encoding, while a bit condescending 😳, is absolutely correct! However, your reference to "configured country" does go along with a suspicion I had. I worked for a few months with a couple of contractors based out of Belfast. We were using the same tools, primarily Microsoft Word and MadCap Flare, but I suspected there may be some setting that was different on their side that resulted in these changes.  For future reference, it would be helpful if you had some idea of where to look for these settings (Windows control panel, or some other place), so we can avoid these things in the future. Any thoughts? You mention the "default ANSI code page set by Windows according to configured country. Where would I find that?

        6,604548
        Grand MasterGrand Master
        6,604548

          May 24, 2019#4

          For each user account the user can set the country and the language. This is done usually on first start of Windows and on creation of a user account. But the settings can be changed at any time. On Windows 7 open Control Panel and click on Region and Language. Click on every tab and look, click on every button opening one more window and look. On tab Formats there is at top Format. Changing this setting changes the UserLocale. Click on tab Administrative on button Change system locale... to change  SystemLocale. See also Locale.

          After making changes on Format and system locale the used OEM code page in console windows and the ANSI code page in GUI windows for non-Unicode text could be changed by Windows depending on which country with which locale language was selected.

          The OEM code used by default after changing format can be seen for example by opening a Windows command prompt window after making the modification and execute chcp (change code page) which shows the active code page for the command process.

          The more important ANSI code page for GUI applications cannot be easily seen somewhere because Windows does not display it anywhere. It can be seen by opening a PowerShell window and execute a PowerShell command, see the answer on How can I manually determine the CodePage and Locale of the current OS?

          In UltraEdit open Advanced - Settings or Configuration - File handling - Encoding and look on the settings. UltraEdit uses by default the default code page for ANSI encoding according to what is configured for active user account. But please not that default code page for ANSI encoding does not change on changing Format in Region and Language settings on being once set.
          Best regards from an UC/UE/UES for Windows user from Austria