How to convert all files in a folder to UTF-8?

Otomatic · Jul 11, 2013#12013-07-11T14:47+00:00

Hi,

I want to convert all *.php and *.txt files in a folder from ISO-8859-1 (or 1252 - ANSI Latin1) to UTF-8 except ones that are already UTF-8 encoded. Is there a way to do that with Replace in Files?

Thanks.

Mofi · Jul 11, 2013#22013-07-11T17:51+00:00

You could run for conversion several case-sensitive replaces searching (in worst case 128 times) a single ANSI character with a byte value greater than 127 to appropriate UTF-8 byte sequence. But for existing UTF-8 files that would damage the character encodings.

So it would be better to this with a macro as posted at Convert multiple files to UTF-8.

But nowadays with UltraEdit for Windows ≥ v17.30 or UEStudio ≥ v11.20 it is even better to use a script for this task.

Edited on 2013-12-05:

After most script code was already available public for this task and some script newbies needed additional help on putting my script code snippets together to a working script for this task, I have decided to write a complete, full featured script for conversion of text files to UTF-8 and publish it.

On Macros & Scripts download page there is Convert files to UTF-8 with a link to the script file ConvertFilesToUtf8.js which was last updated on 2018-11-14.

This script file can be downloaded, opened in UltraEdit or UEStudio and executed with clicking in menu Scripting on menu item Run active script on using toolbar/menu with traditional menus, or clicking on ribbon tab Advanced in group Script on item Play script on using ribbon mode.

Attention: Users of a non English version of UltraEdit/UEStudio must first edit the two strings of sSummaryInfo and sResultsDocTitle in the script as explained in the comments of the script.

The script user is asked if the script should be executed on files in a directory or a list of files like all open files or all project files.

If the choice is taken for running the script on all files of a directory, the script user has to enter the path to the directory, the file type specification which can contain wildcards and allows multiple strings separated by semicolons, and if all matching files in all subdirectories of the specified directory should be processed, too.

The script creates now first a list of files to process. Note: Files with Unicode characters in name are not supported!

Then the script processes each file in the list with writing information about the activities into the output window.

Finally the script writes into the output window how many files have been processed in total, how many files have been converted to UTF-8, how many files were not modified because already being a Unicode file (encoded in UTF-8) and how many files were skipped because of being a binary file.

The script additionally searches in all files converted to UTF-8 for a character set declaration as present usually in head section of an HTML or XHTML file and for an encoding declaration as usually present in first line of an XML file and modifies those declarations also to UTF-8 to match the used encoding. The summary in output window contains also the information how many character set and how many encoding declarations were additionally modified.

Special hint:
If the script should be used to run on all open files, it is better to add the script file via Scripting - Scripts (traditional menus) or Advanced - All scripts (contemporary menus or ribbon mode) to the list of scripts and run it from the menu Scripting (traditional menus) or the Script List window opened with View - Views/lists - Script list (traditional menus) or Layout - Script list (contemporary menus or ribbon mode) or instead of opening the script file in UE/UES and run it with command Run active script (Alt+Shift+R).
The script has no code to skip itself on conversion to UTF-8. Therefore the script would on execution convert itself to UTF-8. Usually that does not really matter as ConvertFilesToUtf8.js does not contain any character with a code value greater 127 as long as a localization of the 2 strings does not result in adding ANSI characters. Therefore the conversion to UTF-8 changes nothing on byte content of the script file. So the script remains an ASCII file as required for the JavaScript interpreter. But that is only true if UE/UES is configured to not write a byte order marker (BOM) on saving UTF-8 encoded files, see the comments of the scripts for details about BOM behavior.
If a UTF-8 BOM is added by mistake to ConvertFilesToUtf8.js, the JavaScript interpreter would output a syntax error on execution into the output window which is in this case not automatically opened.

The line and block comments can be removed from script file by running a replace all (from top of file) searching with Perl regular expression for ^ *//.+[\r\n]+|^ */\*[\s\S]+?\*/[\r\n]+| +//.+$ and using an empty replace string. The first part in this OR expression with three arguments matches entire lines containing only a line comment, the second part matches block comments, and third part matches line comments right to code. Removal of the comments makes the usage of this script more efficient on using it often because of JavaScript interpreter has to interpret less characters and lines.

Otomatic · Jul 16, 2013#32013-07-16T08:54+00:00

Hi,

My complete site : 3122 files (php, txt, html) into 1064 folders and four level of folders is now totally utf-8 (from iso-8859-1).

44 minutes to run the script ConvertFilesToUtf8.js
five minutes to:
- replace iso-8859-1 by utf-8 in four files
- add SET NAMES 'utf8' in two files for database connections
fifteen minutes to UPDATE CHARSET of the database (only 35 tables on 55)
twelve minutes to transfer modified files by FTP

And voila! This is done, as we say in France: "In two shots ladle" (En deux coups de cuillère à pot)

Many, many thanks to you, Mofi. Without the script to transcode all files automatically, I would not have had the courage to do it by hand.

jaaktmgi · Dec 04, 2013#42013-12-04T10:18+00:00

Thank you, Mofi, for all these useful scripts!

Could you, please, give some guidelines/hints about enhancing the script in such a way that it would convert UTF-16 encoded files as well.

I have a mixed set of ASCII and UTF-16 (with BOM) encoded files that I would all like to be converted to UTF-8.
I am not sure myself how to implement the conversion part, because of my limited knowledge of JavaScript.

Many thanks in advance,
Jaak

Mofi · Dec 04, 2013#52013-12-04T15:16+00:00

It is very easy to modify the script to convert all files to UTF-8 in a folder or folder tree except those files being already encoded in UTF-8.

Code: Select all

     if ((UltraEdit.activeDocument.encoding != 65001) &&   // not UTF-8
         (UltraEdit.activeDocument.encoding != 1200)  &&   // not UTF-16 LE
         (UltraEdit.activeDocument.encoding != 1201))      // not UTF-16 BE

needs to be changed to

Code: Select all

     if (UltraEdit.activeDocument.encoding != 65001)      // not UTF-8

That's it.

The command UltraEdit.activeDocument.ASCIIToUTF8(); works also for UTF-16 encoded files which is the reason why there is no separate scripting command for conversion from UTF-16 to UTF-8.

The three lines of the if statement are commented in the script I published on 2013-12-05. So the complete, full featured public version of the script as described briefly in my edited first post converts really all text files not being already encoded in UTF-8 to UTF-8.

jaaktmgi · Dec 05, 2013#62013-12-05T09:06+00:00

Thank you, Mofi, once more for your help!

The modified script works perfectly!

Kind regards,
Jaak

Mofi · Dec 05, 2013#72013-12-05T17:19+00:00

Thanks, Jaak.

I have improved the script once more and sent it by email to IDM support for uploading it on their server.

The script ConvertFilesToUtf8.js skips now additionally binary files. And it can be executed also on all open files, all project files, all favorite files or all files of a solution (UEStudio only). Finally the improved script creates also a report in output window what was done on each file matching the search criteria and a summary about the modifications.

Jul 09, 2018#82018-07-09T19:19+00:00

The script file ConvertFilesToUtf8.js using function GetListOfFiles was updated on 2018-06-24. Lots of comments were updated and improved for hopefully easier reading by non-native English speaking users.

But there are also following functional improvements:

The script detects much more variants of HTML/XHTML charset and XML encoding declarations including short HTML5 character set declarations. See this post for a list of character encoding declarations supported by the script released on 2018-07-09 on converting HTML, XHTML or XML files to UTF-8.
The line ending type of a file is kept by the script released on 2018-07-09 on converting a file to UTF-8 independent on what is configured as preferred line ending for new files.
The information was improved which is written to output window about a file not converted to UTF-8 and includes information about UTF-8 or UTF-16 encoding. Some other messages were improved too.
A new, unnamed file converted by the script to UTF-8 is no longer automatically saved and closed as this is not really possible for a new file without file name.
The script file ConvertFilesToUtf8.js can be opened in UltraEdit/UEStudio and executed without adding it to the script list as long as file name is not modified by the user. The script ignores the file ConvertFilesToUtf8.js on being opened even on using it to convert all currently opened files to UTF-8.
Full qualified file names with one or more Unicode characters are handled correct by UltraEdit for Windows ≥ v24.00 or UEStudio ≥ v17.00 on reading them from a Unicode encoded file using scripting command UltraEdit.activeDocument.selection. For that reason the included function GetListOfFiles does no longer convert the UTF-16 LE encoded results file with the file names from Unicode to ANSI which means file names or file paths can contain also characters not available in system code page for non Unicode aware applications.
Users of UltraEdit for Windows < v24.00 or UEStudio < v17.00 should modify the value of variable bNoUnicode from false to true at top of the function GetListOfFiles for converting the results file with all the full qualified file names to ANSI using the system code page as used by UE/UES by default to handle file names/paths containing non ASCII characters as good as possible.
A code optimization was applied on code of function GetListOfFiles which displays debug messages with a message box or outputs them to output window.

Nov 14, 2018#92018-11-14T15:54+00:00

The script file ConvertFilesToUtf8.js was updated on 2018-11-14 once again because of last comment line at top of the script was not inside the block comment. Further some small improvements have been made on code with no functional effect on execution.