Update references in all XML files with data found for each reference in corresponding XML files

Update references in all XML files with data found for each reference in corresponding XML files

6,686585
Grand MasterGrand Master
6,686585

    Aug 21, 2014#1

    sandeep requested following originally on Stack Overflow:

    I am working on two sets/folders of XML files.
    • Folder A (On the left - articles related to medical microbiology)
    • Folder B (On the right - Corresponding author and their affiliations)


    Now, a portion of file 988_66-2-121.xml has the following content:

    Code: Select all

    <p>......the number of pertussis cases in California was again as high in 2010 as it was in 1947 (Kuehn, <xref ref-type="fig" rid="fim988-bib-0047"/>. Despite a close to 85% worldwide...</p>
    The portion Kuehn, <xref ref-type="fig" rid="fim988-bib-0047"/> is of interest as the author Kuehn has fim988-bib-0047 as id in corresponding author affiliation file 988_66-2-121_REF.xml.

    Code: Select all

    <ref id="fim988-bib-0047"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kuehn</surname><given-names>BM</given-names></name></person-group> (<year>2010</year>) <article-title>Panel backs wider pertussis vaccination to curb outbreaks, prevent deaths</article-title>. <source>JAMA</source> <volume>304</volume>: <fpage>2684</fpage>&#x2013;<lpage>2686</lpage>.</citation></ref>
    What I want to do is:
    • Open the corresponding author-affiliation file, e.g. 988_66-2-121_REF.xml.
    • Search by author id, i.e. fim988-bib-0047, take the value within the <year> tag, in this case 2010.
    • Go back to the 988_66-2-121.xml and replace line 1 with line 2 and I need to do this for all the files.
    What would be a good way to achieve the task?

    Code: Select all

    Kuehn, <xref ref-type="fig" rid="fim988-bib-0047"/>
    <xref ref-type="bibr" rid="fim988-bib-0047">Kuehn, 2010</xref>
    I searched by the pattern <xref ref-type="fig" rid="fim[0-9]+-bib- on all the files and it gave me a count of 1031 occurrences in 14 files.

    So, I need to fix/replace all those 1031 occurrences. Please help.

    Here is an UltraEdit script for this task. Copy and paste this code into a new ASCII (not UTF-8) file and save the file with file extension js.

    The folder paths at top of the script must be set, see the lines with:

    Code: Select all

      // Define folder path with the XML files to modify.
       var sFolderXML = "C:\\Temp\\FolderA\\";
       // Define folder path with the XML reference files.
       var sFolderREF = "C:\\Temp\\FolderB\\";
    Note: The script does not verify if for each *.xml there is also a corresponding *_REF.xml. It requires that each *_REF.xml exists for each *.xml file.

    Further 2 additional functions must be appended to the script:
    • GetFileName
    • GetListOfFiles
      Read also the comments at top of the script file with GetListOfFiles in case of not using English UltraEdit.
    The script is written for Windows paths with DOS line terminators for the results file.

    After saving the *.js file with making necessary changes on folder path variables and appending the 2 functions, run it by clicking in menu Scripting on Run Active Script. It is required that no other file is opened on execution of this script.

    The script can be also added to Scripts list for execution in UltraEdit with no file open and without opening the script file itself.

    The script outputs in output window process information like which file is currently processed, how many references are found in a file and how many of them could be updated, total number of found and updated references and also if a reference identification found in a *.xml file could not be found in corresponding *_REF.xml file.

    The script is written to work on XML files of any size and therefore not optimized for maximum speed by doing as much as possible in memory.

    Code: Select all

    //== Main ==================================================================
    
    // Verify existence of function GetListOfFiles in this script.
    if(typeof(GetListOfFiles) != "function")
    {
       UltraEdit.messageBox("Function GetListOfFiles is missing in script.\n\nSee UltraEdit forum topic\n\nhttp://forums.ultraedit.com/viewtopic.php?f=52&t=5442");
    }
    // Verify existence of function GetFileName in this script.
    else if(typeof(GetFileName) != "function")
    {
       UltraEdit.messageBox("Function GetFileName is missing in script.\n\nSee UltraEdit forum topic\n\nhttp://forums.ultraedit.com/viewtopic.php?f=52&t=6762");
    }
    else
    {  // The two folder paths below must end with a backslash!
    
       // Define folder path with the XML files to modify.
       var sFolderXML = "C:\\Temp\\FolderA\\";
       // Define folder path with the XML reference files.
       var sFolderREF = "C:\\Temp\\FolderB\\";
    
       // Define environment for this script.
       UltraEdit.insertMode();
       UltraEdit.columnModeOff();
    
       // Define document index numbers for the XML file to modify and
       // the corresponding reference file as they will have later in
       // array of opened documents of UltraEdit/UEStudio.
       var g_nXmlFile = UltraEdit.document.length + 1;
       var g_nRefFile = g_nXmlFile + 1;
       var nListFile  = UltraEdit.document.length;
    
       // Global counters used for summary information.
       var g_nTotalFound   = 0;
       var g_nTotalUpdated = 0;
    
       // Enable debug messages for GetListOfFiles and GetFileName.
       var g_nDebugMessage = 2;
    
       // Get the list of file names of all XML files in specified folder.
       if (GetListOfFiles(0,sFolderXML,"*.xml",false))
       {
          // Creation of list was successful. Select the complete list and
          // load it into an array of strings for opening each file in a loop.
          UltraEdit.activeDocument.selectAll();
          var asFileNames = UltraEdit.activeDocument.selection.split("\r\n");
          if (asFileNames[asFileNames.length-1] == "") asFileNames.pop();
          // Cancel the selection as it looks better on script execution.
          UltraEdit.activeDocument.top();
    
          // Prepare output window for script process information output.
          UltraEdit.outputWindow.clear();
          if (UltraEdit.outputWindow.visible == false)
          {
             UltraEdit.outputWindow.showWindow(true);
          }
    
          // Select once Perl regular expression engine.
          UltraEdit.perlReOn();
    
          // Process now the XML file pairs in a loop.
          for (var nFile = 0; nFile < asFileNames.length; nFile++)
          {
             UltraEdit.open(asFileNames[nFile]);
             UltraEdit.open(sFolderREF+GetFileName(asFileNames[nFile])+"_REF.xml");
    
             // Always make list file the active file to avoid display updates
             // if the document windows are maximized which reduces total time
             // needed to finish this script.
             UltraEdit.document[nListFile].setActive();
    
             // Show in output window which file is currently processed.
             UltraEdit.outputWindow.write("Processing file "+asFileNames[nFile]+" ...");
    
             // Do the actual find and replace in the two XML files.
             FindAndReplace();
    
             // Close not modified reference file and modified XML file
             // with saving. Attention: The order is important here!
             UltraEdit.closeFile(UltraEdit.document[g_nRefFile].path,2);
             UltraEdit.closeFile(UltraEdit.document[g_nXmlFile].path,1);
          }
    
          // Close now also the file with the file names without saving.
          UltraEdit.closeFile(UltraEdit.document[nListFile].path,2);
    
          // Output a short summary information in output window.
          UltraEdit.outputWindow.write("Summary:\n\nTotal number of references found:   "+g_nTotalFound);
          UltraEdit.outputWindow.write("Total number of references updated: "+g_nTotalUpdated);
          UltraEdit.outputWindow.showStatus=false;
       }
    }
    
    //== FindAndReplace ========================================================
    
    function FindAndReplace()
    {
       var nRefFound = 0;
       var nRefUpdated = 0;
       var bTopPosition = true;
       var bFirstMissing = true;
    
       // Define once all parameters for the Perl regular expression
       // finds used in both XML files to find the references.
       UltraEdit.document[g_nXmlFile].findReplace.mode=0;
       UltraEdit.document[g_nXmlFile].findReplace.matchCase=true;
       UltraEdit.document[g_nXmlFile].findReplace.matchWord=false;
       UltraEdit.document[g_nXmlFile].findReplace.regExp=true;
       UltraEdit.document[g_nXmlFile].findReplace.searchDown=true;
       if (typeof(UltraEdit.document[g_nXmlFile].findReplace.searchInColumn) == "boolean")
       {
          UltraEdit.document[g_nXmlFile].findReplace.searchInColumn=false;
       }
    
       UltraEdit.document[g_nRefFile].findReplace.mode=0;
       UltraEdit.document[g_nRefFile].findReplace.matchCase=true;
       UltraEdit.document[g_nRefFile].findReplace.matchWord=false;
       UltraEdit.document[g_nRefFile].findReplace.regExp=true;
       UltraEdit.document[g_nRefFile].findReplace.searchDown=true;
       if (typeof(UltraEdit.document[g_nXmlFile].findReplace.searchInColumn) == "boolean")
       {
          UltraEdit.document[g_nXmlFile].findReplace.searchInColumn=false;
       }
    
       while(UltraEdit.document[g_nXmlFile].findReplace.find('\\(.*?, *?<xref ref-type="fig" rid="fim[0-9]+-bib-[0-9]+"/>'))
       {
          nRefFound++;
          var sXmlFound = UltraEdit.document[g_nXmlFile].selection;
          var sRID = sXmlFound.replace(/^.*rid="(fim\d+-bib-\d+)"\/>/,"$1");
    
          var sRefFound = "";
          if(UltraEdit.document[g_nRefFile].findReplace.find('<ref id="'+sRID+'">.*?<year>[0-9]+</year>'))
          {
             sRefFound = UltraEdit.document[g_nRefFile].selection;
          }
          else if(!bTopPosition)
          {  // Identification not found downwards from last found string
             // position. Search in entire file from top to bottom.
             bTopPosition = true;
             UltraEdit.document[g_nRefFile].top();
             if(UltraEdit.document[g_nRefFile].findReplace.find('<ref id="'+sRID+'">.*?<year>[0-9]+</year>'))
             {
                sRefFound = UltraEdit.document[g_nRefFile].selection;
             }
          }
          if(sRefFound.length)
          {
             bTopPosition = false;
             var sYear = sRefFound.replace(/^.*?<year>(\d+)<\/year>/,"$1");
             var sNewRef = sXmlFound.replace(/^\((.*?,) *?<xref ref-type="fig"( rid=".*")\/>/,'(<xref ref-type="bibr"$2>$1 '+sYear+"</xref>");
             UltraEdit.document[g_nXmlFile].write(sNewRef);
             nRefUpdated++;
          }
          else
          {
             if(bFirstMissing)
             {
                bFirstMissing = false;
                UltraEdit.outputWindow.write("");
             }
             UltraEdit.outputWindow.write("Line "+UltraEdit.document[g_nXmlFile].currentLineNum+': "'+sRID+'" not found.');
          }
       }
       g_nTotalFound += nRefFound;
       g_nTotalUpdated += nRefUpdated;
       UltraEdit.outputWindow.write("\nNumber of references found:   "+nRefFound);
       UltraEdit.outputWindow.write("Number of references updated: "+nRefUpdated+"\n");
    }
    
    Best regards from an UC/UE/UES for Windows user from Austria

    5
    NewbieNewbie
    5

      Aug 21, 2014#2

      Thank you. The task is done with your script. :)
      Regards,
      Sandeep
      It is easy to be born, it is difficult to be a human being.:)

      19
      Basic UserBasic User
      19

        Aug 29, 2014#3

        Mofi, you are awesome. I am guessing that with this work, you could implement a search for an entity in the associated xsd schema files fairly easily.
        • Start up XMLManager.
        • Find an element
          <nc:DocumentDescriptionText>Motion for misc relief</nc:DocumentDescriptionText>
        • Highlight and copy the element into the clipboard.
        • Search the xsds included.
          For example, in this case xmlns:nc="http://niem.gov/niem/niem-core/2.0"
        • Find the DocumentDescriptionText and open a link to it at the spot you find it.
        Whenever I have to decide between two evils, I always choose the one I haven't tried before. -Mae West

        6,686585
        Grand MasterGrand Master
        6,686585

          Aug 30, 2014#4

          There are no scripting commands to work with XML Manager. Work with XML Manager view can be only done by user of UltraEdit using keyboard and mouse (and similar devices). But that would not be a problem here as finding an element can be done also with using command Find.

          But UltraEdit is a text editor and not a web browser. It is not possible to download an XSD file from a web source and next search in this file for an element. Therefore there is no scripting solution for this task, except a third-party tool like wget is configured as user tool in UltraEdit for downloading an XSD file using HTTP to a fixed location on hard disk from which the UltraEdit script opens the file for further processing.
          Best regards from an UC/UE/UES for Windows user from Austria

          19
          Basic UserBasic User
          19

            Aug 30, 2014#5

            Thanks.

            You replied to my other post on xmllint. I appreciate your taking the time to respond. Your knowledge of UltraEdit is awesome.
            Whenever I have to decide between two evils, I always choose the one I haven't tried before. -Mae West