sandeep requested following originally on Stack Overflow:
I am working on two sets/folders of XML files.
Now, a portion of file 988_66-2-121.xml has the following content:
The portion Kuehn, <xref ref-type="fig" rid="fim988-bib-0047"/> is of interest as the author Kuehn has fim988-bib-0047 as id in corresponding author affiliation file 988_66-2-121_REF.xml.
What I want to do is:
I searched by the pattern <xref ref-type="fig" rid="fim[0-9]+-bib- on all the files and it gave me a count of 1031 occurrences in 14 files.
So, I need to fix/replace all those 1031 occurrences. Please help.
Here is an UltraEdit script for this task. Copy and paste this code into a new ASCII (not UTF-8) file and save the file with file extension js.
The folder paths at top of the script must be set, see the lines with:
Note: The script does not verify if for each *.xml there is also a corresponding *_REF.xml. It requires that each *_REF.xml exists for each *.xml file.
Further 2 additional functions must be appended to the script:
After saving the *.js file with making necessary changes on folder path variables and appending the 2 functions, run it by clicking in menu Scripting on Run Active Script. It is required that no other file is opened on execution of this script.
The script can be also added to Scripts list for execution in UltraEdit with no file open and without opening the script file itself.
The script outputs in output window process information like which file is currently processed, how many references are found in a file and how many of them could be updated, total number of found and updated references and also if a reference identification found in a *.xml file could not be found in corresponding *_REF.xml file.
The script is written to work on XML files of any size and therefore not optimized for maximum speed by doing as much as possible in memory.
I am working on two sets/folders of XML files.
- Folder A (On the left - articles related to medical microbiology)
- Folder B (On the right - Corresponding author and their affiliations)
Now, a portion of file 988_66-2-121.xml has the following content:
Code: Select all
<p>......the number of pertussis cases in California was again as high in 2010 as it was in 1947 (Kuehn, <xref ref-type="fig" rid="fim988-bib-0047"/>. Despite a close to 85% worldwide...</p>
Code: Select all
<ref id="fim988-bib-0047"><citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kuehn</surname><given-names>BM</given-names></name></person-group> (<year>2010</year>) <article-title>Panel backs wider pertussis vaccination to curb outbreaks, prevent deaths</article-title>. <source>JAMA</source> <volume>304</volume>: <fpage>2684</fpage>–<lpage>2686</lpage>.</citation></ref>
- Open the corresponding author-affiliation file, e.g. 988_66-2-121_REF.xml.
- Search by author id, i.e. fim988-bib-0047, take the value within the <year> tag, in this case 2010.
- Go back to the 988_66-2-121.xml and replace line 1 with line 2 and I need to do this for all the files.
Code: Select all
Kuehn, <xref ref-type="fig" rid="fim988-bib-0047"/>
<xref ref-type="bibr" rid="fim988-bib-0047">Kuehn, 2010</xref>
So, I need to fix/replace all those 1031 occurrences. Please help.
Here is an UltraEdit script for this task. Copy and paste this code into a new ASCII (not UTF-8) file and save the file with file extension js.
The folder paths at top of the script must be set, see the lines with:
Code: Select all
// Define folder path with the XML files to modify.
var sFolderXML = "C:\\Temp\\FolderA\\";
// Define folder path with the XML reference files.
var sFolderREF = "C:\\Temp\\FolderB\\";
Further 2 additional functions must be appended to the script:
- GetFileName
- GetListOfFiles
Read also the comments at top of the script file with GetListOfFiles in case of not using English UltraEdit.
After saving the *.js file with making necessary changes on folder path variables and appending the 2 functions, run it by clicking in menu Scripting on Run Active Script. It is required that no other file is opened on execution of this script.
The script can be also added to Scripts list for execution in UltraEdit with no file open and without opening the script file itself.
The script outputs in output window process information like which file is currently processed, how many references are found in a file and how many of them could be updated, total number of found and updated references and also if a reference identification found in a *.xml file could not be found in corresponding *_REF.xml file.
The script is written to work on XML files of any size and therefore not optimized for maximum speed by doing as much as possible in memory.
Code: Select all
//== Main ==================================================================
// Verify existence of function GetListOfFiles in this script.
if(typeof(GetListOfFiles) != "function")
{
UltraEdit.messageBox("Function GetListOfFiles is missing in script.\n\nSee UltraEdit forum topic\n\nhttp://forums.ultraedit.com/viewtopic.php?f=52&t=5442");
}
// Verify existence of function GetFileName in this script.
else if(typeof(GetFileName) != "function")
{
UltraEdit.messageBox("Function GetFileName is missing in script.\n\nSee UltraEdit forum topic\n\nhttp://forums.ultraedit.com/viewtopic.php?f=52&t=6762");
}
else
{ // The two folder paths below must end with a backslash!
// Define folder path with the XML files to modify.
var sFolderXML = "C:\\Temp\\FolderA\\";
// Define folder path with the XML reference files.
var sFolderREF = "C:\\Temp\\FolderB\\";
// Define environment for this script.
UltraEdit.insertMode();
UltraEdit.columnModeOff();
// Define document index numbers for the XML file to modify and
// the corresponding reference file as they will have later in
// array of opened documents of UltraEdit/UEStudio.
var g_nXmlFile = UltraEdit.document.length + 1;
var g_nRefFile = g_nXmlFile + 1;
var nListFile = UltraEdit.document.length;
// Global counters used for summary information.
var g_nTotalFound = 0;
var g_nTotalUpdated = 0;
// Enable debug messages for GetListOfFiles and GetFileName.
var g_nDebugMessage = 2;
// Get the list of file names of all XML files in specified folder.
if (GetListOfFiles(0,sFolderXML,"*.xml",false))
{
// Creation of list was successful. Select the complete list and
// load it into an array of strings for opening each file in a loop.
UltraEdit.activeDocument.selectAll();
var asFileNames = UltraEdit.activeDocument.selection.split("\r\n");
if (asFileNames[asFileNames.length-1] == "") asFileNames.pop();
// Cancel the selection as it looks better on script execution.
UltraEdit.activeDocument.top();
// Prepare output window for script process information output.
UltraEdit.outputWindow.clear();
if (UltraEdit.outputWindow.visible == false)
{
UltraEdit.outputWindow.showWindow(true);
}
// Select once Perl regular expression engine.
UltraEdit.perlReOn();
// Process now the XML file pairs in a loop.
for (var nFile = 0; nFile < asFileNames.length; nFile++)
{
UltraEdit.open(asFileNames[nFile]);
UltraEdit.open(sFolderREF+GetFileName(asFileNames[nFile])+"_REF.xml");
// Always make list file the active file to avoid display updates
// if the document windows are maximized which reduces total time
// needed to finish this script.
UltraEdit.document[nListFile].setActive();
// Show in output window which file is currently processed.
UltraEdit.outputWindow.write("Processing file "+asFileNames[nFile]+" ...");
// Do the actual find and replace in the two XML files.
FindAndReplace();
// Close not modified reference file and modified XML file
// with saving. Attention: The order is important here!
UltraEdit.closeFile(UltraEdit.document[g_nRefFile].path,2);
UltraEdit.closeFile(UltraEdit.document[g_nXmlFile].path,1);
}
// Close now also the file with the file names without saving.
UltraEdit.closeFile(UltraEdit.document[nListFile].path,2);
// Output a short summary information in output window.
UltraEdit.outputWindow.write("Summary:\n\nTotal number of references found: "+g_nTotalFound);
UltraEdit.outputWindow.write("Total number of references updated: "+g_nTotalUpdated);
UltraEdit.outputWindow.showStatus=false;
}
}
//== FindAndReplace ========================================================
function FindAndReplace()
{
var nRefFound = 0;
var nRefUpdated = 0;
var bTopPosition = true;
var bFirstMissing = true;
// Define once all parameters for the Perl regular expression
// finds used in both XML files to find the references.
UltraEdit.document[g_nXmlFile].findReplace.mode=0;
UltraEdit.document[g_nXmlFile].findReplace.matchCase=true;
UltraEdit.document[g_nXmlFile].findReplace.matchWord=false;
UltraEdit.document[g_nXmlFile].findReplace.regExp=true;
UltraEdit.document[g_nXmlFile].findReplace.searchDown=true;
if (typeof(UltraEdit.document[g_nXmlFile].findReplace.searchInColumn) == "boolean")
{
UltraEdit.document[g_nXmlFile].findReplace.searchInColumn=false;
}
UltraEdit.document[g_nRefFile].findReplace.mode=0;
UltraEdit.document[g_nRefFile].findReplace.matchCase=true;
UltraEdit.document[g_nRefFile].findReplace.matchWord=false;
UltraEdit.document[g_nRefFile].findReplace.regExp=true;
UltraEdit.document[g_nRefFile].findReplace.searchDown=true;
if (typeof(UltraEdit.document[g_nXmlFile].findReplace.searchInColumn) == "boolean")
{
UltraEdit.document[g_nXmlFile].findReplace.searchInColumn=false;
}
while(UltraEdit.document[g_nXmlFile].findReplace.find('\\(.*?, *?<xref ref-type="fig" rid="fim[0-9]+-bib-[0-9]+"/>'))
{
nRefFound++;
var sXmlFound = UltraEdit.document[g_nXmlFile].selection;
var sRID = sXmlFound.replace(/^.*rid="(fim\d+-bib-\d+)"\/>/,"$1");
var sRefFound = "";
if(UltraEdit.document[g_nRefFile].findReplace.find('<ref id="'+sRID+'">.*?<year>[0-9]+</year>'))
{
sRefFound = UltraEdit.document[g_nRefFile].selection;
}
else if(!bTopPosition)
{ // Identification not found downwards from last found string
// position. Search in entire file from top to bottom.
bTopPosition = true;
UltraEdit.document[g_nRefFile].top();
if(UltraEdit.document[g_nRefFile].findReplace.find('<ref id="'+sRID+'">.*?<year>[0-9]+</year>'))
{
sRefFound = UltraEdit.document[g_nRefFile].selection;
}
}
if(sRefFound.length)
{
bTopPosition = false;
var sYear = sRefFound.replace(/^.*?<year>(\d+)<\/year>/,"$1");
var sNewRef = sXmlFound.replace(/^\((.*?,) *?<xref ref-type="fig"( rid=".*")\/>/,'(<xref ref-type="bibr"$2>$1 '+sYear+"</xref>");
UltraEdit.document[g_nXmlFile].write(sNewRef);
nRefUpdated++;
}
else
{
if(bFirstMissing)
{
bFirstMissing = false;
UltraEdit.outputWindow.write("");
}
UltraEdit.outputWindow.write("Line "+UltraEdit.document[g_nXmlFile].currentLineNum+': "'+sRID+'" not found.');
}
}
g_nTotalFound += nRefFound;
g_nTotalUpdated += nRefUpdated;
UltraEdit.outputWindow.write("\nNumber of references found: "+nRefFound);
UltraEdit.outputWindow.write("Number of references updated: "+nRefUpdated+"\n");
}
Best regards from an UC/UE/UES for Windows user from Austria