Deleting duplicate glosses on a line

Deleting duplicate glosses on a line

24
Basic UserBasic User
24

    Aug 18, 2013#1

    I am working on an Urdu to Hindi dictionary and I have created the following file structure:

    Code: Select all

    Headword=Gloss1,Gloss2,Gloss3
    i.e. glosses delimited by a comma.
    
    It so happens that in some cases (around 6000+ in a file of over 200,000+ the glosses are duplicated.
    Since this may be a recurrent phenomenon, could a macro or a script be deployed which could check the glosses on the right hand side and if there are duplicates, remove the same and maintain only a single gloss.
    An example will make this clear:
    Input

    Code: Select all

    a=b,c,b
    d=p,q,p
    e=z,y,g,z,g,y
    
    Th expected output would be

    Code: Select all

    a=b,c
    d=p,q
    e=g,y,z
    
    In case live data is need here is a sample:

    Code: Select all

    آبادِیوں=आबादिओं,आबादियों
    آبادی=जनसंख्या,आबादी
    آبجیکشن=ऑबजेक्शन,ऑब्जेक्शन
    آبلا=अबला,उबला
    آبو=आबू,आबो
    آتشک=आतशक,आतिशक
    آتم=आतम,आतम,आत्म,आत्म
    آتون=आतून,आतोन
    آتیں=आतीं,आतें,आतें,आतीं
    آجا=आ जा,आजा
    آجاتی=आ जाती,आजाती
    آجانا=आ जाना,आजाना
    آجکل=आज कल,आजकल
    آخری=अंतिम,आख़री
    آد=आद,आद,आदि
    
    I have a ver. 15.20 of Ultraedit.
    Many thanks for a macro or a script.

    21
    Basic UserBasic User
    21

      Aug 23, 2013#2

      Here is a slow one.
      Works line by line and updates it.
      Working blocks of lines will improve speed.

      Code: Select all

      UltraEdit.insertMode();
      
      UltraEdit.activeDocument.bottom();
      var allLines = UltraEdit.activeDocument.currentLineNum;
      
      var sLineTerminator = "\r\n";
      if (UltraEdit.activeDocument.lineTerminator == 1)
      	sLineTerminator = "\n";
      else
      		if (UltraEdit.activeDocument.lineTerminator == 2)
      			sLineTerminator = "\r";
      
      for ( var i=1; i<=allLines; i++ ) {
      	UltraEdit.activeDocument.gotoLine( i, 1 );
      	UltraEdit.activeDocument.selectLine();
      	var sLine = UltraEdit.activeDocument.selection;
      	var pos = sLine.indexOf( "=" ) + 1;
      	var sOut = sLine.substring( 0, pos );
      	sLine = sLine.substring( pos, sLine.length - sLineTerminator.length );
      	var asFields = sLine.split( "," );
      	var words = {};
      	for ( myField in asFields ) {
      		if ( words[ asFields[ myField ] ] == null ) {
      			sOut += asFields[ myField ] + ",";
      			words[ asFields[ myField ] ] = myField;
      		}
      	}
      	UltraEdit.activeDocument.write( sOut.substring( 0, sOut.length - 1 ) + sLineTerminator );
      }
      

      24
      Basic UserBasic User
      24

        Aug 27, 2013#3

        Sorry, I had given up hope of finding a script to do the job. The script works perfectly. Many thanks

        6,686585
        Grand MasterGrand Master
        6,686585

          Aug 27, 2013#4

          The script works fine for the simplified example, but fails on my computer on the example with Urdu and Hindi strings. The duplicate strings are removed by the script, but the Unicode file contains only ? for every Urdu/Hindi character after script executed on the file whereas the Urdu/ Hindi strings were displayed like in browser window before running the script and using font @Arial Unicode MS.

          Well, perhaps Windows itself must be configured to support those languages which is definitely not the case on my computer. That could be also the reason why all my quick checks on how this task could be done failed, too.

          Thanks to Jaretin for developing and posting the script, especially when it works for dictdoc also on Urdu and Hindi strings.