Deleting duplicate glosses on a line

dictdoc · Aug 18, 2013#12013-08-18T10:46+00:00

I am working on an Urdu to Hindi dictionary and I have created the following file structure:

Headword=Gloss1,Gloss2,Gloss3
i.e. glosses delimited by a comma.

It so happens that in some cases (around 6000+ in a file of over 200,000+ the glosses are duplicated.
Since this may be a recurrent phenomenon, could a macro or a script be deployed which could check the glosses on the right hand side and if there are duplicates, remove the same and maintain only a single gloss.
An example will make this clear:
Input

Code: Select all

a=b,c,b
d=p,q,p
e=z,y,g,z,g,y

Th expected output would be

Code: Select all

a=b,c
d=p,q
e=g,y,z

In case live data is need here is a sample:

Code: Select all

آبادِیوں=आबादिओं,आबादियों
آبادی=जनसंख्या,आबादी
آبجیکشن=ऑबजेक्शन,ऑब्जेक्शन
آبلا=अबला,उबला
آبو=आबू,आबो
آتشک=आतशक,आतिशक
آتم=आतम,आतम,आत्म,आत्म
آتون=आतून,आतोन
آتیں=आतीं,आतें,आतें,आतीं
آجا=आ जा,आजा
آجاتی=आ जाती,आजाती
آجانا=आ जाना,आजाना
آجکل=आज कल,आजकल
آخری=अंतिम,आख़री
آد=आद,आद,आदि

I have a ver. 15.20 of Ultraedit.
Many thanks for a macro or a script.

Jaretin · Aug 23, 2013#22013-08-23T08:30+00:00

Here is a slow one.
Works line by line and updates it.
Working blocks of lines will improve speed.

Code: Select all

UltraEdit.insertMode();

UltraEdit.activeDocument.bottom();
var allLines = UltraEdit.activeDocument.currentLineNum;

var sLineTerminator = "\r\n";
if (UltraEdit.activeDocument.lineTerminator == 1)
	sLineTerminator = "\n";
else
		if (UltraEdit.activeDocument.lineTerminator == 2)
			sLineTerminator = "\r";

for ( var i=1; i<=allLines; i++ ) {
	UltraEdit.activeDocument.gotoLine( i, 1 );
	UltraEdit.activeDocument.selectLine();
	var sLine = UltraEdit.activeDocument.selection;
	var pos = sLine.indexOf( "=" ) + 1;
	var sOut = sLine.substring( 0, pos );
	sLine = sLine.substring( pos, sLine.length - sLineTerminator.length );
	var asFields = sLine.split( "," );
	var words = {};
	for ( myField in asFields ) {
		if ( words[ asFields[ myField ] ] == null ) {
			sOut += asFields[ myField ] + ",";
			words[ asFields[ myField ] ] = myField;
		}
	}
	UltraEdit.activeDocument.write( sOut.substring( 0, sOut.length - 1 ) + sLineTerminator );
}

dictdoc · Aug 27, 2013#32013-08-27T01:21+00:00

Sorry, I had given up hope of finding a script to do the job. The script works perfectly. Many thanks

Mofi · Aug 27, 2013#42013-08-27T17:22+00:00

The script works fine for the simplified example, but fails on my computer on the example with Urdu and Hindi strings. The duplicate strings are removed by the script, but the Unicode file contains only ? for every Urdu/Hindi character after script executed on the file whereas the Urdu/ Hindi strings were displayed like in browser window before running the script and using font @Arial Unicode MS.

Well, perhaps Windows itself must be configured to support those languages which is definitely not the case on my computer. That could be also the reason why all my quick checks on how this task could be done failed, too.

Thanks to Jaretin for developing and posting the script, especially when it works for dictdoc also on Urdu and Hindi strings.