How to find lines containing a number stored in a list into another file?

How to find lines containing a number stored in a list into another file?

8
NewbieNewbie
8

    Apr 22, 2013#1

    Hi Mofi.

    Thanks for the guidance so far.

    Assuming I have 50 million data with different fields name as below:

    Code: Select all

    ACCOUNT|MOBILENO|NAME|ADD1|ADD2|ADD3|CITY|ZIP|STATE|NEWID|OTHERID|OLDID|REGDATE|STATUS|GENDER|BIRTHDATE|RACE|EMAIL
    10100801|0192168465|MRS RAJESWARY A/P BHASKARAN|NO 42|KG PINANG||HULU BERNAM|35900|PERAK|871119085122||||18-SEP-09|Active|Female|19-NOV-87|Indian|
    10100841|0192311363|CIK ZANEZA BINTI MOHAMMED ZAIKE|NO 16 JALAN 40 DESA JAYA KEPONG|||KUALA LUMPUR|51200|W PERSEKUTUAN KUALA LUMPUR|771002145314|||A3848390|18-SEP-09|Active|Female|02-OCT-77|Malay|
    10102691|0193176085|MR MOHD ZAREMDEEN BIN MOHD ZAMAN|NO 44A 1ST FLOOR JLN TUN MOHD FUAD|SATU TMN TUN DR ISMAIL||KUALA LUMPUR|60000|W PERSEKUTUAN KUALA LUMPUR|701008035171|||A1618225|18-SEP-09|Active|Male|08-OCT-70|Malay|
    10103091|0135333198|LOH KIEN SENG|102 JLN AMAN JELAPANG|||IPOH|30020|PERAK|700218085551|||A1557532|18-SEP-09|Active|Male|18-FEB-70|Chinese|
    10104261|0133920594|PUAN HUSNA BINTI OSMAN|KUARTERS KLINIK KESIHATAN TEKEK|KAMPUNG TEKEK, PULAU TIOMAN||KUALA ROMPIN|26800|PAHANG|860111335584||||18-SEP-09|Active|Female|11-JAN-86|Malay|
    10104911|0165342333| ENCIK.AZHARI BIN ABDULLAH|SLIM PANTAS ENTERPRISE|ESSO FILLING STATION|JALAN BESAR|SLIM RIVER|35800|PERAK|630627085615|||7076892|18-SEP-09|Active|Male|27-JUN-63|Malay|
    10100631|0126360984| PUAN NOR HAYATI BINTI MUSA|434|JALAN MARGOSA 15|TAMAN BUKIT MARGOSA|AMPANGAN|70400|NEGERI SEMBILAN|650728055464|||A0208356|18-SEP-09|Active|Female|28-JUL-65|Malay|
    10102841|0132878737| ENCIK MUHD SYAHAMIRUL EDININ BIN ABD HAMID|C-2-1 DANAU VILLA APT.|JALAN 5/23E TAMAN DANAU KOTA||KL|53200|W PERSEKUTUAN KUALA LUMPUR|721014125411|||A2308960|18-SEP-09|Active|Male|14-OCT-72|Malay|
    What is the best method If I want to search multiple data ( more than 10k data ) using NEW ID No as unique....
    e.g. from the data below I want to extract all the information as above.

    Code: Select all

    851021086081
    730421045333
    850828016469
    781115065635
    770630115226
    800204065259
    881218035189
    790905035304
    620225015308
    730704035311
    840501055627
    870418295109
    641016015136
    870401055512
    590225105958
    600923055106
    620912125998
    760805055095
    870506015314
    890423145920
    571010055531
    800211025570
    751003065190
    691031025055
    860203055125
    790218085043
    740713085280
    610711085740
    850328085775
    880608016210
    880817025308
    800215085255
    Thanks.

    6,675585
    Grand MasterGrand Master
    6,675585

      Apr 22, 2013#2

      I offer 2 solutions:

      1. You make a copy of the large file and run a Perl regular expression Replace All from top of the file with the search string ^(?:[^\r\n|]*\|){9}([^\r\n|]+)\|.*$ and the replace string \1 to delete from all lines all data except field value 10 - NEWID. For an explanation of the search string see How to delete lines in CSV file if a certain value is found in defined data field?

      2. You use the script FindStringsToNewFileExtended.js with search string \n(?:[^\r\n|]*\|){9}([^\r\n|]+) and $1 for the output format string. The first line is ignored by this solution as there is no line-feed. But that should be no problem as the first line is the header line. As explained in readme of this scripts collection you have to execute several times to get the final output as the script processes only 800.000 lines per script execution. With 50 millions of lines it is most likely better to use the first solution as starting the script so many times manually is not funny.

      8
      NewbieNewbie
      8

        Apr 22, 2013#3

        Hi Mofi.

        Thanks for the info. Sorry, I'm not sure if you got me right.

        I have 50 million database with field name as below:

        ACCOUNT|MOBILENO|NAME|ADD1|ADD2|ADD3|CITY|ZIP|STATE|NEWID|BUSREG|OTHERID|OLDID|REGDATE|STATUS|GENDER|BIRTHDATE|RACE|EMAIL
        10100801|0192168465|MRS RAJESWARY A/P BHASKARAN|NO 42|KG PINANG||HULU BERNAM|35900|PERAK|871119085122||||18-SEP-09|Active|Female|19-NOV-87|Indian|
        10100841|0192311363|CIK ZANEZA BINTI MOHAMMED ZAIKE|NO 16 JALAN 40 DESA JAYA KEPONG|||KUALA LUMPUR|51200|W PERSEKUTUAN KUALA LUMPUR|771002145314|||A3848390|18-SEP-09|Active|Female|02-OCT-77|Malay|
        10102451|0177703799| MR.AZHAR BIN ATAN|NO 57 SELASIH 2|TMN PASIR PUTIH||PASIR GUDANG|81700|JOHOR|650507015987|||A0117302|18-SEP-09|Active|Male|07-MAY-65|Malay|
        10102571|0169659889| MR.SOO YEOW LOONG|LOT 623 JLN TELOK|BUNUT||BANTING|42700|SELANGOR|790408105973||||18-SEP-09|Active|Male|08-APR-79|Chinese|
        10102691|0193176085|MR MOHD ZAREMDEEN BIN MOHD ZAMAN|NO 44A 1ST FLOOR JLN TUN MOHD FUAD|SATU TMN TUN DR ISMAIL||KUALA LUMPUR|60000|W PERSEKUTUAN KUALA LUMPUR|701008035171|||A1618225|18-SEP-09|Active|Male|08-OCT-70|Malay|
        10102781|0195906523|MR ISMAIL BIN AWANG|11 PESARA KELEBANG JAYA 12, TAMAN KELEBANG JAYA|||CHEMOR|31200|PERAK|511203075163|||4211178|18-SEP-09|Active|Male|03-DEC-51|Malay|
        10102821|0133991147|MS SITI NOORBAYA BINTI MOHD YUNUS|NO. 1,|JALAN PJU 1A/21,|ARA DAMANSARA,|PETALING JAYA|47301|SELANGOR|750308115238||||18-SEP-09|Active|Female|08-MAR-75|Malay|
        10103091|0135333198|LOH KIEN SENG|102 JLN AMAN JELAPANG|||IPOH|30020|PERAK|700218085551|||A1557532|18-SEP-09|Active|Male|18-FEB-70|Chinese|
        10104261|0133920594|PUAN HUSNA BINTI OSMAN|KUARTERS KLINIK KESIHATAN TEKEK|KAMPUNG TEKEK, PULAU TIOMAN||KUALA ROMPIN|26800|PAHANG|860111335584||||18-SEP-09|Active|Female|11-JAN-86|Malay|
        10104911|0165342333| ENCIK.AZHARI BIN ABDULLAH|SLIM PANTAS ENTERPRISE|ESSO FILLING STATION|JALAN BESAR|SLIM RIVER|35800|PERAK|630627085615|||7076892|18-SEP-09|Active|Male|27-JUN-63|Malay|
        10100631|0126360984| PUAN NOR HAYATI BINTI MUSA|434|JALAN MARGOSA 15|TAMAN BUKIT MARGOSA|AMPANGAN|70400|NEGERI SEMBILAN|650728055464|||A0208356|18-SEP-09|Active|Female|28-JUL-65|Malay|
        10101341|0192823387|ENCIK MOHAMMAD MIZAN BIN MOHAMMAD ARIF|3Q EQUESTRAIN SG SERAI|KUANG||RAWANG|48050|SELANGOR|870303565105||||18-SEP-09|Active|Male|03-MAR-87|Malay|
        10102271|0135333688|MR MUHUZAI BIN MUSTAFA|DM 138|KAMPUNG TELUK BARU|TANJUNG LUMPUR|KUANTAN|26060|PAHANG|821014065429||||18-SEP-09|Active|Male|14-OCT-82|Malay|
        10102311|0132407812|MS KHAIRUNNISA BINTI RAMLI|NO 304 LRG ANGGERIK 9|BDR SUNGGALA||PORT DICKSON|71050|NEGERI SEMBILAN|881011055692||||18-SEP-09|Active|Female|11-OCT-88|Malay|
        10102791|0137987794|MR MD BALYA HIDIR BIN MD SALEH|PTD 1032 TAMAN MAS SURIA PESERAI|||BATU PAHAT|83000|JOHOR|831216016325||||18-SEP-09|Active|Male|16-DEC-83|Malay|
        10102841|0132878737| ENCIK MUHD SYAHAMIRUL EDININ BIN ABD HAMID|C-2-1 DANAU VILLA APT.|JALAN 5/23E TAMAN DANAU KOTA||KL|53200|W PERSEKUTUAN KUALA LUMPUR|721014125411|||A2308960|18-SEP-09|Active|Male|14-OCT-72|Malay|
        10103001|0133999479|MR DOL FATAH BIN ABDUL WAHAB|NO 6 JALAN 14/1 FASA 5|TAMAN CHERAS JAYA||CHERAS|43200|SELANGOR|790512036023||||18-SEP-09|Active|Male|12-MAY-79|Malay|
        10103251|0195898406|ENCIK MOHD FAKRULRAZI BIN ABD KADIR|NO 20-C LORONG KENANGA|KAMPUNG BARU||KUALA NERANG|06300|KEDAH|850513025635||||18-SEP-09|Active|Male|13-MAY-85|Malay|
        10103531|0194092982|CIK SITI ZAHIDAH BINTI AYOB|102 JLN AU2A/14|TAMAN SRI KERAMAT||KUALA LUMPUR|54200|W PERSEKUTUAN KUALA LUMPUR|821128045020||||19-SEP-09|Active|Female|28-NOV-82|Malay|[email protected]
        10100411|0199447300|MR KUMARASAMY A/L RAJAGOPAL|NO 7 TAMAN SRI MAKMUR|||JERANTUT|27000|PAHANG|710426065685|||A1970518|18-SEP-09|Active|Male|26-APR-71|Indian|
        10100441|0193855300|MR KUMARASAMY A/L RAJAGOPAL|NO 7 TAMAN SRI MAKMUR|||JERANTUT|27000|PAHANG|710426065685|||A1970518|18-SEP-09|Active|Male|26-APR-71|Indian|
        10100591|0192630066| MADAM.KHOR KIM SAM|31 BU 11/4|BANDAR UTAMA||PETALING JAYA|47800|SELANGOR|690907086078|||A1376748|18-SEP-09|Active|Female|07-SEP-69|Chinese|
        10100931|0122042936| MR.LOW CHAN WENG|81-6-5|RESOURCE SPRINGS|JALAN AYER PANAS|SETAPAK|53200|W PERSEKUTUAN KUALA LUMPUR|570825105929|||5365094|18-SEP-09|Active|Male|25-AUG-57|Chinese|
        10101141|0199315703|MR WAN AZHAR BIN WAN YUSOFF|NO 16 JALAN IM 2/89|BANDAR INDERA MAHKOTA||KUANTAN|25200|PAHANG|670701035963|||A0723015|18-SEP-09|Active|Male|01-JUL-67|Malay|
        10101801|0122543735| MR.FU POH SING|3-12D|JALAN DESA 2/2|DESA AMAN PURI|KEPONG|52100|W PERSEKUTUAN KUALA LUMPUR|640529107357|||7308347|18-SEP-09|Active|Male|29-MAY-64|Chinese|
        10102651|0122852585| MADAM.RUSLINA BINTI ABU HASSAN|NO 27|JALAN SG 10/12|TAMAN SERI GOMBAK|BATU CAVES|68100|SELANGOR|610418095034|||6172441|18-SEP-09|Active|Female|18-APR-61|Malay|

        Now I want to run multiple search.
        100k NEWID to map with the 50 millions data and get all the relevant fields.

        Sorry about my explanation. Hope you get me. Thanks

        Cheers

        6,675585
        Grand MasterGrand Master
        6,675585

          Apr 23, 2013#4

          Let me try to explain what I understood.

          You have a text file opened in UltraEdit which contains a list of ID numbers line by line.

          You want to search in a directory tree for CSV files using character | as separator containing one or more of the ID numbers.

          In a new file you want all the lines which contain in any of the CSV files an ID number listed in the opened file as tenth field value.

          The output file should contain only the found lines, no other information like in which file the line was found and on which line.

          You are using currently UltraEdit v??.??.??.????

          Is that the description for the task to do?

          8
          NewbieNewbie
          8

            Apr 23, 2013#5

            Hi Mofi,

            I'm using UltraEdit Professional Text/HEX Editor Version 19.00.0.1022

            The task:

            I have a CSV file open in UltraEdit which contains fields as below:

            Code: Select all

            MOBILE_NO|NAME|NEW_IC|OLD_ID|OTHER_IC|ADD1|ADD2|ADD3|CITY|STATE|POSTCODE
            
            0198183831|CHUNG MIANG POH|450226135249|K570346||LOT 886 JALAN GUBAH BINTAWA|||KUCHING|SARAWAK|93450
            0198331936|NGU TAI HONG|630205135635|K0000380||P O BOX 60925|||TAWAU|SABAH|91019
            0195896368|MR LIM SIEW SENG|561031075233|5111429||G 12,TAMAN SEGAR JAYA,BAGAN LALANG,|||BUTTERWORTH|PULAU PINANG|13400
            0195191722|ENCIK BASRIZAL BIN CHE BAHAROM|770708026101||T720988|NO 51 JALAN PONDOK TG BEDIL|SUNGAI BARU GUNUNG||ALOR SETAR|KEDAH|05150
            0198528442|DATU BASRUN BIN  DATU MANSOR|560830125079|H0065240||LOT 11 TAMAN PARK|PUTATAN||KOTA KINABALU|SABAH|88100
            0132284857|MR MOHAMAD LUKHMAN NOOR HAKIM BIN JAAFAR|871216105121|||LOT 3482 JALAN MERBAU|KAMPUNG MELAYU SUBANG||SHAH ALAM|SELANGOR|40150
            0123656579| MR MUSTAPA BABA|620628045085|||453-1 KM 6 KAMPUNG DUYONG MELAKA|||MELAKA|MELAKA|75460
            0196888721|ENCIK KHAIRUL AFIF BIN KAMRIN|870805305047|||NO 3|JALAN USJ 3A/1||SUBANG JAYA|SELANGOR|47610
            0192863660|MR MOHD MIZUAR BIN MOHD YUSOF|770918055955|A3754477||NO 384 RUMAH RAKYAT PANCHOR PAROI|||SEREMBAN|NEGERI SEMBILAN|70400
            0199157131|LUKEMA MERI BIN SALLEH@ ABDUL LATIF|660326115371|A0369041||JKR 229 KUARTERS KERAJAAN|JALAN SULTAN MAHMUD|BATU BURUK|KUALA TERENGGANU|TERENGGANU|20400
            0198049008|MR CHUA SENG NYEP|650729105013|A0196788||TB 10553 LORONG 7/2|TAMAN MEGAH JAYA JALAN APAS BATU  3 1/2||TAWAU|SABAH|91000
            In a new file I have 100k of only ID NO without any other details. How do I search and map it with the above data so I get all the details

            ID NO without details

            450226135249
            630205135635
            561031075233


            After search and mapping

            0198183831|CHUNG MIANG POH|450226135249|K570346||LOT 886 JALAN GUBAH BINTAWA|||KUCHING|SARAWAK|93450
            0198331936|NGU TAI HONG|630205135635|K0000380||P O BOX 60925|||TAWAU|SABAH|91019
            0195896368|MR LIM SIEW SENG|561031075233|5111429||G 12,TAMAN SEGAR JAYA,BAGAN LALANG,|||BUTTERWORTH|PULAU PINANG|13400

            Thanks. Cheers

            6,675585
            Grand MasterGrand Master
            6,675585

              Apr 24, 2013#6

              Your input data permanently changes. Here is a script which searches for lines with one of the listed ID according to your last post.

              Open the large CSV file with all the data as first file (most left on open file tabs bar).

              Open the file with the IDs line by line as second file.

              Open respectively create and save as third file the script file which must be the active one. Use Scripting - Run Active Script.

              The script file can be added also to the list of scripts and executed from menu or the Script List view.

              Code: Select all

              if (UltraEdit.document.length > 1)  // Are at least two files opened?
              {
                 // Define environment for this script.
                 UltraEdit.insertMode();
                 if (typeof(UltraEdit.columnModeOff) == "function") UltraEdit.columnModeOff();
                 else if (typeof(UltraEdit.activeDocument.columnModeOff) == "function") UltraEdit.activeDocument.columnModeOff();
                 UltraEdit.perlReOn();
              
                 var CsvFile = UltraEdit.document[0];   // First file (most left) must be the large CSV file.
                 var ListFile = UltraEdit.document[1];  // Second file must be the file with the list of IDs.
              
                 // Load all IDs into an array of strings.
                 ListFile.selectAll();
                 var asIDs = ListFile.selection.split("\r\n");
                 ListFile.top();
                 // Remove last string if it is an empty string because the list file ended with a line termination.
                 if (asIDs[asIDs.length-1] == "") asIDs.pop();
              
                 // Open output window for showing progress.
                 UltraEdit.outputWindow.clear();
                 UltraEdit.outputWindow.showWindow(true);
              
                 // Define the parameters for the multiple Perl regular expression finds.
                 CsvFile.findReplace.mode=0;
                 CsvFile.findReplace.matchCase=true;
                 CsvFile.findReplace.matchWord=false;
                 CsvFile.findReplace.regExp=true;
                 CsvFile.findReplace.searchDown=true;
                 CsvFile.findReplace.searchInColumn=false;
              
                 // Use user clipboard 9 for collecting the found data.
                 UltraEdit.selectClipboard(9);
                 UltraEdit.clearClipboard();
              
                 CsvFile.top();
                 var nFoundCount = 0;
              
                 for (var nID = 0; nID < asIDs.length; nID++)
                 {
                    var sSearch = "^(?:[^\\r\\n|]*\\|){2}" + asIDs[nID] + "\\|.+$";
                    if (CsvFile.findReplace.find(sSearch))
                    {
                       UltraEdit.clipboardContent += CsvFile.selection + "\r\n";
                       CsvFile.top();
                       UltraEdit.outputWindow.write(asIDs[nID]+" found.");
                       nFoundCount++;
                    }
                    else UltraEdit.outputWindow.write(asIDs[nID]+" not found.");
                 }
              
                 // Output found lines into a new file if anything was found at all.
                 if (nFoundCount)
                 {
                    // Create a new file for the results.
                    UltraEdit.newFile();
                    UltraEdit.activeDocument.unixMacToDos();
                    UltraEdit.activeDocument.paste();
                    UltraEdit.clearClipboard();
                    UltraEdit.activeDocument.top();
                 }
                 UltraEdit.selectClipboard(0);  // Select Windows clipboard.
                 // Display a short summary message prompt.
                 UltraEdit.messageBox("Found "+nFoundCount+" ID"+(nFoundCount!=1 ? "s":"")+" of "+asIDs.length+" ID"+(asIDs.length!=1 ? "s.":"."));
              }

              8
              NewbieNewbie
              8

                Apr 25, 2013#7

                Hi Mofi

                I follow your instruction, I open a first file with 30m data, then I open second file with 6m ID no only. I open third file copied the script and run active script. I also did execute form the script menu.
                The second file ID NO was highlighted in blue but nothing happens. Did I do it correctly. Please advise. Cheers

                6,675585
                Grand MasterGrand Master
                6,675585

                  Apr 25, 2013#8

                  What I did to verify that the script works:
                  1. I started UltraEdit resulting in having only an empty new file open.
                  2. I copied the block from your previous post with the data and pasted them into this new file.
                  3. I pressed Ctrl+N to open one more new file, copied the 3 lines with the IDs from your previous post and pasted them into the second new file.
                  4. I created a third new file, saved it as Test.js, wrote the script code and saved the modifications with Ctrl+S.
                  5. So there are now 3 files opened: first one is a new file with the input data, second file is a new file with the 3 IDs, third file is Test.js which is the active file.
                  6. Now I executed Scripting - Run Active Script and the script produced a new file with same output as you posted.
                  That is exactly what you should do first too. Check if the script does what you requested in your previous post.

                  The script will fail to find anything if the CSV file contains data not exactly as you specified in your previous post.

                  Please note that UltraEdit will need most likely several minutes to finish on your huge file with 30 millions of lines, especially when the IDs could not be found at all because the CSV file is formatted different than posted.

                  If the list file with the IDs contains really 6 million IDs, the script will most likely need hours to finish. And it could easily happen that an out of memory situation occurs as every found and selected string during script execution is copied to memory twice. A 32-bit application like UltraEdit can allocate only 2 GB (with sign) or 4 GB (without sign) of memory. When there is no allocatable memory available anymore, the script terminates or more badly UltraEdit crashes because of out of memory. It would be better to divide the huge list of IDs into smaller parts like 50.000 per script run to avoid the out of memory situation.

                  And please install latest hotfix of UltraEdit as v19.00.0.1022 was the first public build of UE v19.00 with an updated Perl regular expression engine inside in comparison to v18.20. There were some bugs in Perl regexp engine implementation and most of them are fixed in currently latest release v19.00.0.1031. I tested the script with build 1031 of UE v19.00.

                  8
                  NewbieNewbie
                  8

                    Apr 27, 2013#9

                    Hi Mofi.
                    It work well with the data I posted.
                    When I run with other data an error occurred, read as below:

                    Code: Select all

                    Running script: J:\SERVER FILE\Script\test.js
                    =======================================================================
                    An error occurred on line 13:
                    �� 
                    Script failed.

                    I tried the script with the id column in the 8th column.

                    When I run the script message says No Id found.

                    In the first data the ID column was in the 3rd column, and it work well.

                    Should I fix all the ID column to 3rd column...

                    Thanks

                    6,675585
                    Grand MasterGrand Master
                    6,675585

                      Apr 28, 2013#10

                      The error is caused by an out of memory situation. As I wrote already, limit the number of IDs in the list to 50.000, 100.000 or 200.000 per script run.

                      The script contains the line

                      var sSearch = "^(?:[^\\r\\n|]*\\|){2}" + asIDs[nID] + "\\|.+$";

                      Change the number from 2 to 7 and the IDs are searched in eight data column.

                      8
                      NewbieNewbie
                      8

                        Apr 29, 2013#11

                        Hi Mofi..

                        You were right. It is taking a long time to run the script. I have tried with 50k per script run and after almost 10 hours it is still running.
                        I have 6 million data to be sorted out. Is there any other way to do this task?
                        Anyway how do I combine data. I have hundreds folders of data with fields name jumble up. All folder have fields name in different column with one another. How do I sort the data to be uniform.
                        What is the max data number I can open in UltraEdit. Is it possible to open 30 million data.

                        Cheers

                        6,675585
                        Grand MasterGrand Master
                        6,675585

                          May 01, 2013#12

                          UltraEdit can be used to edit files of any size. But for large and huge files it is strongly recommended to configure UltraEdit for working with large files as explained in the IDM power tip Large file text editor.

                          Editing large files can be done in general efficiently only if every memory usage is reduced to a minimum or the total opposite is done by loading everything into memory and do all modifications in memory. The method of loading all to memory requires for huge files a computer with 4, 8, 16 or even more GB on RAM and of course a 64-bit application which can really make use (= address) of so much RAM. Although computers nowadays have more and more RAM, a really large block of free RAM is nevertheless often rare. UltraEdit as a 32-bit application cannot make use of more than 2 GB RAM (or 4 GB with special coding) at all.

                          The problem here is that you do not use UltraEdit for editing the huge file. What you want is extracting/copying data from a huge file based on a list which contains also a very large number of strings. That's a task a text editor is not designed for. This task requires a special handling of data to use as less memory as possible but nevertheless handle the data efficient.

                          Do you have ever calculated how many string (not integer) compares the program has to do for your task in worst case? 30.000.000 x 6.000.000 which are 180.000.000.000.000 string compares. You don't need to wonder that this takes very long especially as the script has to run complex Perl regular expression finds and not only simple string compares.

                          A more efficient way to do this task would be:
                          • Load one ID string after the other into memory and keep in memory in a list only the ID converted to an unsigned integer, but not the strings. So in memory not a list of ID strings, but a list of ID integers is hold.
                          • Next in a loop the task would need to load one line after the other from the huge data file.
                          • For every line the ID string is extracted. This ID string is converted next also to an unsigned integer.
                          • Now in a second inner loop the ID from the loaded line is compared against every number in the ID number list.
                          • If there is a match, the line still in memory is appended to an output buffer and if this output buffer contains for example 1000 lines, it is written to an output file and memory for the found lines is released.
                          • If the IDs are unique and therefore it is not possible that 2 lines contain the same ID, it would be best to remove the matching ID from the ID number list so that on next line the number of integer compares is reduced by one.
                          • The main loop continues with releasing the memory used for loading the current line if it does not contain an ID of interest and loads the next line. So we are back at step 3.
                          In general the method written here could be coded in an UltraEdit script, but in practice it will not work. The reason is that UltraEdit copies every selected text accessed by the UltraEdit document property selection into memory as JavaScript string object and keeps this string object until the script is terminated on which all memory used during script execution is released. That is of course a bad memory management and I reported this a few weeks ago as I detected this memory management behavior on writing a script for another user also searching for lots of data in a huge file. As every selected text is kept in memory until the script terminates, it is inevitable that sooner or later an out of memory situation occurs when running a data extraction task on a very large or a huge file. So UltraEdit scripts are at the moment not really useful for such special tasks with a very large amount of data being involved. I hope, the IDM developers soon improve memory management for selected text by removing the string object of the current selection from memory immediately when the selection is canceled or replaced by another selection.

                          I have an idea how your task with getting 6 millions of lines from a file with 30 millions of lines could be done by modifying a copy of the file with 30 millions of lines using an UltraEdit macro. As UltraEdit macros do not copy strings to memory except when using clipboard(s) or ^s in Finds/Replaces (which is not kept in memory up to macro termination), this approach would make it possible to achieve the task without running into an out of memory situation. But UltraEdit macros do not support variables and conditions for doing something line by line. So although the macro could do the job, it would take very long and would stress your hard disk extremely as all string compares are done with heavy access of the file data on the hard disk.

                          As even the optimal solution as described in the numbered list will take very long to finish, it would be definitely best to write a C++/C# application for this task, especially because the job could be done very good with parallel running threads using all cores of the CPU. So a C++/C# application using worker threads for comparing the ID numbers against an ID from a loaded line could do the task much quicker as every UE script or macro could ever do.

                          Another method doing what you want is using a database application like Access, MySQL, ... Database applications are optimized for such tasks.

                          If you have multiple CSV files with different data structures, you need to reformat the files for example with tagged regular expression replaces so that all CSV files have finally the same data structure. Then it would be possible to merge them. I wrote a script for merging CSV files which can be downloaded from user-submitted scripts page. But the script is written for merging a set of small CSV files to a large CSV file. It is not written for merging several large CSV files to a huge CSV file.

                          8
                          NewbieNewbie
                          8

                            Jul 08, 2013#13

                            Hi Mofi,

                            If my id number is in the first column, how do I change the script:

                            var sSearch = "^(?:[^\\r\\n|]*\\|){2}" + asIDs[nID] + "\\|.+$";

                            Code: Select all

                            IDNO|OLDIDNO|NAME|ADDRESS1|ADDRESS2|ADDRESS3|STATE|MOBILE|
                            440521065029|3532232|MUHAMAD BIN KACHA|L |KG BUKIT LEPAS|SARANG TIONG|TIOMAN|ROMPIN|0197416284
                            450521015055|3995972|DEWA A/L WAJI|L |KG BUKIT LEPAS|SARANG TIONG|TIOMAN|ROMPIN|0139053276
                            460521065043|3231984|LION BIN KASIM|L |KG BUKIT LEPAS|SARANG TIONG|TIOMAN|ROMPIN|0137338501
                            511016055351|4101344|CHEE HEE FONG|L |KG BUKIT LEPAS|SARANG TIONG|TIOMAN|ROMPIN|0197051883
                            600803065377|A0473349|SEFRAN BIN MUHAMAD|L |KG BUKIT LEPAS|SARANG TIONG|TIOMAN|ROMPIN|0197607652

                            6,675585
                            Grand MasterGrand Master
                            6,675585

                              Jul 08, 2013#14

                              With the ID number in first column the search string is: "^" + asIDs[nID] + "\\|.+$";