Creating definition file for delimited files

Creating definition file for delimited files

2
NewbieNewbie
2

    Aug 14, 2015#1

    UltraEdit v21.20.0.10009
    Windows 7

    With that out of the way, I'm trying to find a way to use scripts to accomplish something at my company, where we do data processing and direct mailing. UE seems like a good candidate since we use it for a lot of basic file manipulation but I know it can be utilized more.

    What I want to accomplish is simple enough but I'm not familiar with Javascript or UE scripting. Say I have 50 delimited .csv files. I can convert these already to fixed-length files via another program but the problem is I need definition files along with them that describe the fields. There are several different layouts in this group of .csv files, so I need a way to batch this process up. The conversion program doesn't have an easy way of creating those definition files so the fastest way I've found is to open each file, close it, and say 'Yes' to create the definition.

    The format of the definition file is simple, here's an example:

    TopOffset,500
    Field Delimiter= 044
    Record Delimiter= 013 010
    Field Framing Character= 034
    ex_Field1,,C
    ex_Field2,,C
    ex_Field3,,C
    ex_Field4,,C
    ex_Field5,,C

    So essentially the TopOffset is just the length of the first row ( the header for the .csv ), and the next 3 lines specify what the delimiter specifications are. In this case - comma, CRLF, and double quote. Then each field name is listed.

    What I need to do is basically count the length of the header row including CRLF for the TopOffset, and then format the rest of the file according to the above example, not necessarily detecting the delimiters or anything because I can specify them manually. Below the delimiters is then just a parsed, vertical version of the header row with ',,C' appended to it. Once formatted, output to a .dmt file with the same name as the corresponding input file.

    I'm open to suggestions for simple ways to do this and thank you for your time.

    6,675585
    Grand MasterGrand Master
    6,675585

      Aug 15, 2015#2

      I hope, the script below works for your CSV files as expected. It is written to get all names of CSV files found in a directory and process all those CSV files.

      You can add the script with Scripting - Scripts to list of scripts for easy execution from within UltraEdit via menu Scripting or via the Script List.

      The script can be also executed by opening it in UltraEdit and using Scripting - Run Active Script.

      And the script could be also executed from command line (or a batch file) with

      start "Create DMT Files" /wait /min "%ProgramFiles(x86)%\IDM Computer Solutions\UltraEdit\Uedit32.exe" /fni /s,e="script file name with full path"

      This example is for running a new instance of UltraEdit from within a batch file with minimized window in a new instance for processing the CSV files with the script below and automatic exit of UltraEdit when all CSV files have been processed on Windows x64 with UltraEdit installed into default program files directory.

      There are once some additions and changes necessary before the code below can be used the first time.
      1. Code of function GetListOfFiles without the explaining comments at top of script file and without the code demonstrating the usage of this function at bottom must be copied also into the script file. Please don't forget to configure the variables sSummaryInfo and sResultsDocTitle according to language of used UltraEdit and configuration for Find Output Format when not using English UE with default configuration for find output format.
      2. Directory name "C:\\Temp\\" in first line of code below must be adapted to real directory name on your machine. For each backslash in path one more backslash is necessary because the backslash character is the escape character in a JavaScript string.
      3. The character assigned to variable sFieldDelimiter needs to be adapted to what your CSV files use as field delimiter. For a horizontal tab character use "\t".
      Note: The script file should be saved as ASCII/ANSI file with DOS line terminators.

      Read the comments, especially those regarding assumptions on format of the header rows - no double quote and no field delimiter in a field value. Let me know if the names of the columns contain also a double quote and/or a field delimiter to extend the script code for supporting also something like "Column name with "" and with ," in header row.

      Code: Select all

      if (GetListOfFiles(0,"C:\\Temp\\","*.csv",false))
      {
         // Define the field delimiter character.
         var sFieldDelimiter = ",";
      
         // Get decimal code value of delimiter as string with 1 or 2 leading zeros.
         var sCodeFieldDelimiter = sFieldDelimiter.charCodeAt(0).toString(10);
         while (sCodeFieldDelimiter.length < 3)
         {
            sCodeFieldDelimiter = '0' + sCodeFieldDelimiter;
         }
      
         // If there was no error and files are found in the directory, the
         // function made the search results output with the file names active.
         // Select all lines in the file and load the file names into an array.
         UltraEdit.activeDocument.selectAll();
         var sLineTerm = "\r\n";   // Default line terminator type is DOS.
         var nLineTermPos = UltraEdit.activeDocument.selection.search(/\r\n|\n|\r/);
         if (nLineTermPos >= 0)    // Any line terminator found?
         {
            // The list file is a Unix file if first character found is a line-feed.
            if (UltraEdit.activeDocument.selection[nLineTermPos] == '\n') sLineTerm = "\n";
            // The list file is a Mac file if first character found is a carriage
            // return and the next character is not a line-feed as in a DOS file.
            else if (UltraEdit.activeDocument.selection[nLineTermPos+1] != '\n') sLineTerm = "\r";
         }
         var asFileNames = UltraEdit.activeDocument.selection.split(sLineTerm);
      
         // The list is not needed anymore and therefore the results window is closed.
         UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
         asFileNames.pop();  // Remove empty string at end of the list.
      
         // Now open one file after the other and process each file.
         for (var nFileIndex = 0; nFileIndex < asFileNames.length; nFileIndex++)
         {
            var sHeaderRow = ""; // Clear header row string from previous loop run.
      
            // Check if the CSV file to process is opened already in UltraEdit/UEStudio.
            for (var nDocIndex = 0; nDocIndex < UltraEdit.document.length; nDocIndex++)
            {
               if (asFileNames[nFileIndex] == UltraEdit.document[nDocIndex].path)
               {
                  // Load current line and column number into variables.
                  var nLineNumber = UltraEdit.document[nDocIndex].currentLineNum;
                  var nColumnNumber = UltraEdit.document[nDocIndex].currentColumnNum;
                  // The next line is only needed for UE < v16.00 as those versions
                  // of UE return a column number with 0 for first number instead of 1.
                  if (typeof(UltraEdit.activeDocumentIdx) == "undefined") nColumnNumber++;
      
                  // Move caret to top of file.
                  UltraEdit.document[nDocIndex].top();
      
                  // Select the first line in file and copy it to string variable.
                  UltraEdit.document[nDocIndex].selectLine();
                  sHeaderRow = UltraEdit.document[nDocIndex].selection;
      
                  // Restore initial position of caret in the file.
                  UltraEdit.document[nDocIndex].gotoLine(nLineNumber,nColumnNumber);
                  break;
               }
            }
      
            // Was the file to process not opened already and therefore the string
            // variable for header row is still an empty string with length 0?
            if (sHeaderRow.length == 0)
            {
               // Open the file to process.
               UltraEdit.open(asFileNames[nFileIndex]);
      
               // Select the first line in file and copy it to string variable.
               UltraEdit.document[nDocIndex].selectLine();
               sHeaderRow = UltraEdit.document[nDocIndex].selection;
      
               // Close the just opened file without saving it.
               UltraEdit.closeFile(UltraEdit.activeDocument.path,2);
            }
      
            // Get length of header row with line termination
            // and convert the number into a decimal string.
            var sTopOffset = sHeaderRow.length.toString(10);
      
            // Get type of line termination, define record delimiter
            // accordingly and remove the line termination form the string.
            if (sHeaderRow.lastIndexOf("\r\n") >= 0)
            {
               var sRecordDelimiter = "013 010";
               sHeaderRow = sHeaderRow.substr(0,sHeaderRow.length-2);
            }
            else if (sHeaderRow[sHeaderRow.length-1] == "\n")
            {
               var sRecordDelimiter = "010";
               sHeaderRow = sHeaderRow.substr(0,sHeaderRow.length-1);
            }
            else
            {
               var sRecordDelimiter = "013";
               sHeaderRow = sHeaderRow.substr(0,sHeaderRow.length-1);
      
            }
      
            // Remove all double quotes from header row. This would be wrong
            // if a double quote would be part of a field name in header row.
            sHeaderRow = sHeaderRow.replace(/\"/g,"");
      
            // Split header row into an array of strings using field delimiter.
            // This produces wrong result if a field string is quoted and contains
            // additionally the field delimiter. However, the header row does
            // usually not contain a field delimiter within a field value.
            var asFields = sHeaderRow.split(sFieldDelimiter);
      
            // Create in a string variable the lines for the definition file.
            var sDefinitionFile = "TopOffset," + sTopOffset + "\r\n";
            sDefinitionFile += "Field Delimiter= " + sCodeFieldDelimiter + "\r\n";
            sDefinitionFile += "Record Delimiter= " + sRecordDelimiter + "\r\n";
            sDefinitionFile += "Field Framing Character= 034\r\n";
            for (nFieldIndex = 0; nFieldIndex < asFields.length; nFieldIndex++)
            {
               sDefinitionFile += asFields[nFieldIndex] + ",,C\r\n";
            }
      
            // Create a new file and make sure it's an ASCII file with DOS
            // line terminators. Then write the lines into the new file.
            UltraEdit.newFile();
            UltraEdit.activeDocument.unixMacToDos();
            UltraEdit.activeDocument.unicodeToASCII();
            UltraEdit.activeDocument.write(sDefinitionFile);
      
            // Save the new file with name of CSV file after replacing
            // the file extension "csv" case-insensitive by "dmt".
            UltraEdit.saveAs(asFileNames[nFileIndex].replace(/csv$/i,"dmt"));
            UltraEdit.closeFile(UltraEdit.activeDocument.path,0);
         }
      }
      
      One more note: If all CSV files contain in header row a string not existing in the other lines below (too often), it would be also possible to run from within the script a Find in Files to get names of all CSV files and their first lines into a new file which is processed by the script to create the definition files without the need to open each CSV file. That would be faster, but type of line termination for record delimiter could not be determined by the script in this case.

      Last note: Record delimiter is only correct detected by script above if Never prompt to convert files to DOS format is selected at Advanced - Configuration - File Handling - DOS/Unix/Mac Handling for Unix/Mac file detection/conversion. The script does not use lineTerminator property introduced with UE v16.00 as I do not know which version of UltraEdit you currently use.
      Best regards from an UC/UE/UES for Windows user from Austria

      2
      NewbieNewbie
      2

        Aug 20, 2015#3

        I haven't had a chance to give this a try but will try it out soon. I'll also try to let you know how it goes. Thanks for the reply and potential solution!

        EDIT: I'm blown away right now. I ran it on 174 .csv files and every single .dmt was created within a minute or two, and not only that but I don't see any layouts that are off. Good work, seriously.

        For future reference - if we need to modify the delimiters, for example for pipe delimiting or if there are also field surrounding characters ( ex. "field1","field2","etc") what is the general process behind that? I'm assuming if the files were only delimited with no surrounding characters it's as easy as changing the definition of sFieldDelimiter.

        6,675585
        Grand MasterGrand Master
        6,675585

          Aug 21, 2015#4

          I explained handling of quotes in header row already in my previous post. As long as the header row is something like

          Code: Select all

          ID,"First Name","Last Name",Address
          the script processes the header row correct as it removes simply all double quotes before splitting the line into separate strings using character defined sFieldDelimiter. See scripting code

          Code: Select all

                // Remove all double quotes from header row. This would be wrong
                // if a double quote would be part of a field name in header row.
                sHeaderRow = sHeaderRow.replace(/\"/g,"");
          
                // Split header row into an array of strings using field delimiter.
                // This produces wrong result if a field string is quoted and contains
                // additionally the field delimiter. However, the header row does
                // usually not contain a field delimiter within a field value.
                var asFields = sHeaderRow.split(sFieldDelimiter);
          For a header row like

          Code: Select all

          ID,"""First, Last Name""",Address
          the *.dmt file produced would be

          Code: Select all

          TopOffset,35
          Field Delimiter= 044
          Record Delimiter= 013 010
          Field Framing Character= 034
          ID,,C
          First,,C
          Last Name,,C
          Address,,C
          instead of

          Code: Select all

          TopOffset,35
          Field Delimiter= 044
          Record Delimiter= 013 010
          Field Framing Character= 034
          ID,,C
          "First, Last Name",,C
          Address,,C
          So for such headers the scripting code block with the 2 commands to remove all double quotes using a simple replace and then splitting the line up needs to be replaced by a function which really parses the header row according to CSV specification for double quoted field values. That would be easy to code as well. But if this is worth to code depends on support by the application for quoted field values with field delimiter in value, i.e. is "First, Last Name",,C correct interpreted at all. Also the script as is would not produce the right result for a header with 1 or more line breaks within quoted field values. But does the application reading the *.dmt files support quoted field names with 1 or more line breaks in name?
          Best regards from an UC/UE/UES for Windows user from Austria

          9

            Sep 08, 2015#5

            It's too bad that this new code doesn't seem to work (or didn't for me in a few tests):

            Code: Select all

            var lt = UltraEdit.activeDocument.lineTerminator;
            from power tip UltraEdit v16.00 Scripting Enhancements.

            That would help shorten some of the line delimiter code. Anyone else have success with it?

            EDIT:
            D'oh, I was trying to use it to return actual the line terminator (e.g. "\r\n"). Here's the specification from the help:
            UE help wrote:Supported values:
            -2 = MAC but content of file contains currently DOS line terminators
            -1 = UNIX but content of file contains currently DOS line terminators
            0 = DOS
            1 = UNIX
            2 = MAC
            Might be useful for the above code?

            Code: Select all

            var lt = UltraEdit.activeDocument.lineTerminator;
            var lineBreak = "\r\n";             // Default line terminator type is DOS; use for DOS or mixed content
            if (lt == 1) lineBreak = "\n";      // pure Unix file - uses a line-feed.
            else if (lt == 2) lineBreak = "\r"; // pure Mac file - uses a carriage return.
            Thanks for all you guys do and post.