Merge data from file 1 with data in file 2 ignoring accents

Merge data from file 1 with data in file 2 ignoring accents

9
NewbieNewbie
9

    Nov 15, 2008#1

    Hello,

    I'm looking for a solution with UltraEdit before to buy it.

    Is it possible to do this and how to do it ?

    In my database (Mysql) I would like to replace some stuffs

    Example :

    In my file, I have 5000 rows
    zipcode, 'city', 'Right_url_website',
    means:
    zipcode, 'Chicago', 'www.website-chicago.com',

    and in this same file I have 8000 different rows
    (id, 'department', 'area', zipcode, 'city', 'Not yet url website', 'address'),
    means
    (id, 'department', 'area', zipcode, 'Chicago', 'Not yet url website', 'address'),

    I would like replace 'No_url_website' by 'Right_url_website'
    It means:
    Search 'chicago' ('city' = Column 2) if found 'chicago' ('city' = Columns 5) with 'No_url_website' (Columns 6) Replace 'No_url_website' (Columns 6) by 'www.website-chicago.com' ('Right_url_website' Columns 3)

    Many thanks in advance for your answer
    (I hope my English was not to bad because I'm French ;-) )

    6,675585
    Grand MasterGrand Master
    6,675585

      Nov 15, 2008#2

      Okay, the macro worked on your example files. It is a variant of the macro posted at Replace text in file A with info from file B.

      It is important that you have only the 2 files open and the text file has the focus when you start the macro. The main difference is the red code. This code reformats your text file to the format of the sql file.

      First it deletes all white-spaces at end of every line and then moves the cursor to start of the first line with an URL. Next it replaces 2 or more spaces with a single space. These 2 actions are just for security.

      The 2 following regular expression replaces in UltraEdit syntax are the tricky one. They convert every line to the format used in the sql file.

      The reformatted block is copied into user clipboard 9. Because of these changes the text file is always modified after executing the macro. So you can see after macro execution what the regular expresssion replaces have done in the text file. Use File - Revert To Saved or close the text file without saving if you don't want to keep this format. You could also replace the command NextWindow by CloseFile NoSave to close the modified text file without saving inside the macro.

      The first regular expression Find command in the loop selects the zip_postal_code and the city in the line from the text file. This string is used for the replace in the loop which searches for that string plus the empty url string '' and replaces it with the whole line from the text file. I thought, it is better to include also the zip postal code for security if two different cities have the same name.

      But unfortunately this is not enough to get the correct result in the sql file because in your text file the cities are always completely in uppercase and so the cities in the SQL file would be also completely in uppercase after macro execution. To avoid that I have inserted 2 additional regular expression replaces before and after the loop. The first one before the loop creates a copy of the city names in the sql file at start of every line (column is copied). After the loop this column is moved back to the correct position in the sql file replacing the uppercase city names from the text file.


      The macro property Continue if search string not found must be checked for this macro.

      InsertMode
      ColumnModeOff
      HexOff
      UnixReOff
      Clipboard 9
      Top
      TrimTrailingSpaces
      Find "http"
      EndSelect
      Key HOME
      Find RegExp " +"
      Replace All " "
      Find RegExp " ^([0-9]+^)$"
      Replace All "|^1"
      Find RegExp "%^(*^) ^([~|]+^)|^([0-9]+^)$"
      Replace All "^3, '^2', '^1'"
      SelectToBottom

      Copy
      NextWindow
      Top
      Paste
      IfColNumGt 1
      "
      #
      "
      Else
      "#
      "
      EndIf
      Find RegExp "%^(*, '*', '*', '*', [0-9]+, [0-9]+, ^)^('*'^)"
      Replace All "^2, ^1^2"

      Top
      Loop
      IfCharIs "#"
      DeleteLine
      ExitLoop
      EndIf
      Find RegExp "%*, '*',"
      Copy
      EndSelect
      Key HOME
      StartSelect
      Key END
      Find "^c ''"
      Replace All "^s"
      DeleteLine
      EndLoop
      Find RegExp "%^('*'^), ^((*, '*', '*', '*', [0-9]+, [0-9]+, ^)'*'"
      Replace All "^2^1"

      ClearClipboard
      Clipboard 0
      Best regards from an UC/UE/UES for Windows user from Austria

      9
      NewbieNewbie
      9

        Nov 15, 2008#3

        Excellent, thank you so much for your time.
        I'm very lucky, you have spent a lot of time for help me, I really appreciate.

        Actually in France it's 9.00pm I will try your advises tomorrow morning, and let you know asap if I have understood everything.
        Mofi wrote:it is better to include also the zip postal code for security if two different cities have the same name.
        In fact no, it's the opposite, in France we can have 5 same zip postal for 5 different cities but not two different cities have the same name

        Thanks again

        6,675585
        Grand MasterGrand Master
        6,675585

          Nov 16, 2008#4

          Okay, no problem. The slightly modified macro ignoring the zip postal code. The red lines are the lines where I made changes.

          InsertMode
          ColumnModeOff
          HexOff
          UnixReOff
          Clipboard 9
          Top
          TrimTrailingSpaces
          Find "http"
          EndSelect
          Key HOME
          Find RegExp " +"
          Replace All " "
          Find RegExp " ^([0-9]+^)$"
          Replace All "|^1"
          Find RegExp "%^(*^) ^([~|]+^)|[0-9]+$"
          Replace All "'^2', '^1'"

          SelectToBottom
          Copy
          CloseFile NoSave
          Top
          Paste
          IfColNumGt 1
          "
          #
          "
          Else
          "#
          "
          EndIf
          Find RegExp "%^(*, '*', '*', '*', [0-9]+, [0-9]+, ^)^('*'^)"
          Replace All "^2, ^1^2"
          Top
          Loop
          IfCharIs "#"
          DeleteLine
          ExitLoop
          EndIf
          Find RegExp "%'*',"
          Copy
          EndSelect
          Key HOME
          StartSelect
          Key END
          Find "^c ''"
          Replace All "^s"
          DeleteLine
          EndLoop
          Find RegExp "%^('*'^), ^((*, '*', '*', '*', [0-9]+, [0-9]+, ^)'*'"
          Replace All "^2^1"
          ClearClipboard
          Clipboard 0
          Best regards from an UC/UE/UES for Windows user from Austria

          9
          NewbieNewbie
          9

            Nov 16, 2008#5

            I have spent more than 5 hours without success :-(
            Of course the problem is from me not from your code I'm sure ;-)

            To be more simple I have changed the file (attached)

            convert in sql file and deleted ', and replace by : |

            Like this, It will be easier for me to understand, and this will give me the possibility to use your code for other things.

            Would you like to make a new one for me :oops:

            For more explanations :
            This macro doesn't add the url adress at the right place and I don't understand why.

            That's why, I have asked you to make a new macro with less parameters, in this way it will be easier for me to find the right way.

            I really want to learn something and discover how to use UltraEdit, I really want to understand and not just get the solution without the possibily to do something else in the future with this software.

            Many thank's in advance

            Attached ZIP file deleted by Mofi to save privacy.

            6,675585
            Grand MasterGrand Master
            6,675585

              Nov 17, 2008#6

              Please note that the lines

              |start|http://mairie.pagespro-orange.fr/charno ... tm|Charnoz|Sur|Ain|end|
              |start|http://www.ville-laboisse.fr|La|Boisse|end|
              |start|http://www.villette-sur-ain.fr|Villette|Sur|Ain|end|

              in file base_with_www17nov.sql are not correct. That must be corrected or these 3 lines will be wrong interpreted by the macro. I used the UltraEdit regular expression search string |start|*|*|*|*| to find those 3 lines with too many | characters.

              Because file base_with_www17nov.sql contains the city names now in correct case, the macro is much easier now.

              InsertMode
              ColumnModeOff
              HexOff
              UnixReOff
              Clipboard 9
              Top
              Find RegExp "%|start|^(*|^)^(*|^)end|"
              Replace All "|^2^1"
              SelectAll
              Copy
              CloseFile NoSave
              Top
              Paste
              IfColNumGt 1
              "
              #
              "
              Else
              "#
              "
              EndIf
              Find RegExp "%^(|*|*|*|*|*|*|*|*|*|*|*|*|*|^)^(*^)|"
              Replace All "^2^1^2|"

              Top
              Loop
              IfCharIs "#"
              DeleteLine
              ExitLoop
              EndIf
              Find RegExp "%|*|"
              Copy
              EndSelect
              Key HOME
              StartSelect
              Key END
              Find "^cwebsite|"
              Replace All "^s"
              DeleteLine
              EndLoop
              Find RegExp "%^(*|^)^(*|*|*|*|*|*|*|*|*|*|*|*|*|^)*|"
              Replace All "|^2^1"

              ClearClipboard
              Clipboard 0

              The same macro again as UEM file (UltraEdit macro code). There is a syntax highlighting wordfile for such files - see Macro examples and reference for beginners and experts. I have added in this version lots of comments to make it easier for you to understand the macro.

              Code: Select all

                  InsertMode
                  ColumnModeOff
                  HexOff
              /*! This macro uses the UltraEdit regular expression engine. But any other
                  engine could be used too if the regular expression search strings would
                  be translated for the Unix or Perl regular expression engine. !*/
                  UnixReOff
              //  Use user clipboard 9 for copying data and as variable storage buffer.
                  Clipboard 9
                  Top
              /*! File "base_with_www17nov.sql" must have the focus. The format
                  of the data in this file is:
                  
              |start|website|city name|end|
              
                  But needed is only the city name and the website in the reverse order.
                  So a regular expression replace all is used to transform the data into
                  the format:
              
              |city name|website|
              
                  The regular expression search string means following:
                  
                  % ......... string must be found at start of a line.
                  |start| ... just a simple string with no special meaning.
                  ^(*|^) .... following |start| 0 or more characters can occur of any
                              type except the new line characters CR and LF until the
                              character | is found. This string - the website - is
                              tagged (stored in a temporary variable 1) for the
                              usage in the replace string referenced by ^1.
                  ^(*|^) .... same as above, but now finds the city name and stores
                              this found string in a temporary variable 2 referenced
                              by ^2 in the replace string.
                  end| ...... again just a simple string with no special meaning. !*/
                  Find RegExp "%|start|^(*|^)^(*|^)end|"
                  Replace All "|^2^1"
              /*! The data in file "base_with_www17nov.sql" contains the data now
                  in the required format. Select all the content, copy it into user
                  clipboard 9 and close the file without saving the changes. That
                  changes also the focus to the file "base_without_www_17nov.sql". !*/
                  SelectAll
                  Copy
                  CloseFile NoSave
              /*! Paste the content of file "base_with_www17nov.sql" now at top of
                  file "base_without_www_17nov.sql". The cursor is now at the end
                  of the inserted data. If the cursor is not at start of the first
                  line of the real content of "base_without_www_17nov.sql", the last
                  line of file "base_with_www_17nov.sql" has had no line termination.
                  So insert a line termination + character # in an additional line.
                  Otherwise insert just a line with character #. !*/
                  Top
                  Paste
                  IfColNumGt 1
                      "
                      #
                      "
                  Else
                      "#
                      "
                  EndIf
              /*! This regex replace creates a copy of every city name at start
                  of every line in file "base_without_www_17nov.sql". See end of
                  the macro for the reason why this is done. !*/
                  Find RegExp "%^(|*|*|*|*|*|*|*|*|*|*|*|*|*|^)^(*^)|"
                  Replace All "^2^1^2|"
                  Top
              /*! Now run from top of the file in a loop the required replaces. The loop
                  is exited when the cursor reaches the special line inserted before
                  starting with character #. This line is deleted when reached. !*/
                  Loop
                      IfCharIs "#"
                          DeleteLine
                          ExitLoop
                      EndIf
              /*! Inside the loop first select the |city name| of the current line copied
                  from the first file "base_with_www17nov.sql" and copy this string into
                  user clipboard 9. !*/
                      Find RegExp "%|*|"
                      Copy
              /*! Stop selection mode, move cursor back to start of the line, start
                  selection mode and move cursor to end of the line. Now the whole
                  line   |city name|website|   is selected. !*/
                      EndSelect
                      Key HOME
                      StartSelect
                      Key END
              /*! Use a non regular expression replace to find all occurences of the
                  just copied   |city name|   with the dummy text   website|   following
                  and replace all occurences from current cursor position to end of file
                  with the current selection. So with this method all cities present in
                  file "base_with_www17nov.sql" will have then also the website string
                  in file "base_without_www_17nov.sql". !*/
                      Find "^cwebsite|"
                      Replace All "^s"
              /*! Replace all does not change the cursor position. So the cursor is still
                  at the end of the selected line from first file which is no longer
                  needed and therefore deleted. That stops also the selection mode and
                  moves the cursor to start of the next line. Of course the cursor is
                  always at top of the file after deleting a line. !*/
                      DeleteLine
                  EndLoop
              /*! This regex replace replaces all city names in the city column with the
                  original city names at start of the line to avoid changes in the city
                  names caused by different spelling of the city names in the first file. !*/
                  Find RegExp "%^(*|^)^(*|*|*|*|*|*|*|*|*|*|*|*|*|^)*|"
                  Replace All "|^2^1"
              /*! All lines from the first file are processed and the special marker
                  line is deleted too. So now only the original lines of the second
                  file are present, but with the added websites. Clear user clipboard 9
                  to free RAM and switch back to the Windows clipboard. !*/
                  ClearClipboard
                  Clipboard 0
              Best regards from an UC/UE/UES for Windows user from Austria

              9
              NewbieNewbie
              9

                Nov 17, 2008#7

                YOU ARE MAGIC :P

                This is a great help to learn

                Thank you so much

                I will work on tomorrow morning ;) and let you know after

                We don't need to you use something special in the regular expressions?

                Because the ANSI characters with hex code 0xC0 and higher could result in some problems.

                  Nov 18, 2008#8

                  Hello Mister Magic ;-)

                  I have got a strange result with this macro because :
                  Action is :
                  Delete |start| and |end| in file : base_with_www17nov.sql
                  after all informations from file : base_without_www_17nov.sql are paste in file : base_with_www17nov.sql
                  that's it not more....

                  The goal is :
                  Search in file: without_www_17nov.sql colum |cities| and find the same city in file: base_with_www17nov.sql
                  if the same citiy was found import from file: base_with_www17nov.sql colum |www| and insert into file: base_without_www_17nov.sql into colum|website|
                  else
                  if the same city was not found
                  result= pass over

                  Maybe I didn't use ultraEdit in the right way ? because this macro doesn't gave me the result that I was waiting.

                  6,675585
                  Grand MasterGrand Master
                  6,675585

                    Nov 18, 2008#9

                    Special ANSI characters are no problem for the regular expression replaces. They are not modified in any way.

                    According to your description it looks like you have forgotten to check the macro property Continue if search string not found for this macro. Do this now by opening Macro - Delete Macro/Modfiy Properties, select the macro and check the property. It is important that the macro continues even if a city specified in file base_with_www17nov.sql is not found in file base_without_www_17nov.sql.

                    I opened again the files "base_without_www_17nov.sql" and "base_with_www17nov.sql" (with the 3 lines corrected). And only these 2 files were opened in UltraEdit. "base_with_www17nov.sql" was the active one - had the focus as I run my macro. The result was file "result_17nov.sql". A file content compare with "base_without_www_17nov.sql" showed me that the websites were correct inserted - see Compare_result.png. But some city names are slightly modified because of different case of some letters.

                    So I modified my macro again - see the today inserted red lines in my last macro. I have explained already once the technique to keep the city names in file "base_without_www_17nov.sql" even if they are in file "base_with_www_17nov.sql" written slightly different.

                    I opened again the files "base_without_www_17nov.sql" and "base_with_www17nov.sql" (with the 3 lines corrected). And only these 2 files were opened in UltraEdit again. "base_with_www17nov.sql" was the active one - had the focus as I run my modified macro. The result was file "better_result_17nov.sql". A file content compare with "base_without_www_17nov.sql" showed me that the websites were correct inserted - see Compare_better_result.png. But this time no city name is modified in any way.

                    So the macro works perfect and produces what you want if I have understood you correct. The method used is not exactly the method you suggest, but my approach is easier for UltraEdit and faster.

                    The 2 result files and the 2 PNG screenshots are packed in the attached ZIP file (already deleted).
                    Best regards from an UC/UE/UES for Windows user from Austria

                    9
                    NewbieNewbie
                    9

                      Nov 18, 2008#10

                      I have updated your new macro and I've got result :D
                      (the result from you is always much better than I'm waiting, many thanks I have seen your attached files with .png file)

                      Just a small problems, it doesn't recognize these F....... French accents

                      Example:
                      from file: base_with_www17nov.sql (In this file, colum |cities| must be "slave" and not "master" in the final result)
                      row 3 : |start|http://www.cc-belley-bas-bugey.com/fr/a ... p|Ambl[b]e[/b]on|end|

                      from file: base_without_www_17nov.sql (in this file, colum |cities| should be "master" and do not change for the final result)
                      row 5 : |id|5|space1|1|space2|Ain|space2|Rhône Alpes|space3|01006|space4|01300|space1|Ambléon|website|timestamp|end|

                      The macro doesn't make changes in this case and for many row with accents.

                      The thing that I can do, is making a new file (base_with_www17nov.sql) with everything in uppercase - see attached file (already deleted).
                      pls let me know if you prefer or if it doesn't help and do not make change for the accent's problem

                      Many thanks in advance for your help

                      6,675585
                      Grand MasterGrand Master
                      6,675585

                        Nov 19, 2008#11

                        Sorry, but 'e' is a different character than 'é'. The macro ignores differences in case, so 'A' = 'a', 'B' = 'b', etc. for ASCII characters. But the ANSI character 'é' has the hexadecimal byte code 0xE9 and 'e' has the hex code 0x65. For a computer program these bytes are different and therefore the city name is not equal and so "Ambléon" is not equal "Ambleon". Language specific "similarities" are not handled by the search and replace function.

                        But I found a solution for this problem. In the updated macro hopefully all special ANSI characters in the file "base_with_www17nov.sql" are replaced by their ASCII equivalents before copying the content into file "base_without_www_17nov.sql".

                        In file "base_without_www_17nov.sql" after copying all city names to start of every line also hopefully all special ANSI characters in the city names in the real city column are replaced by their ASCII equivalents. The city names at start of every line are not modified.

                        So now the loop searching for a city name in the correct column and replace it with city name + website works with only ASCII characters and therefore it does not matter any more if a city name in the 2 files uses different spellings for characters with accents.

                        If I compare now file "better_result_17nov.sql" from yesterday with the new result file "even_better_result_17nov.sql", I can see that 35 additional cities have now a website URL in the new result file. (ZIP archive with the 2 files already deleted.)

                        Don't forget to enable macro property Continue if search string not found for the new macro.
                        Best regards from an UC/UE/UES for Windows user from Austria

                        9
                        NewbieNewbie
                        9

                          Nov 19, 2008#12

                          Hello Mister Magic :)

                          YOU ARE THE BEST, you gave me so much help.... thanks a lot.

                          Your macro is working very well.

                          the file: base_with_www17nov.sql has 156 row and your macro has past 153 rows in file: base_without_www_17nov.sql

                          Only 3 is missing and I know why.
                          Because for example
                          from the file: base_with_www17nov.sql row 1 colum |Amberieu Bugey|
                          from the file: base_without_www_17nov.sql row 3 |Ambérieu en Bugey|
                          |Amberieu Bugey| is different than |Ambérieu en Bugey|

                          Of course we cannot correct this problem with one macro :D

                          Sorry to ask you something more :oops:
                          Do you have in your mind a solution or one possibility to get as a "report" for the row doesn't pasted from the file base_with_www17nov.sql?

                          The file base_with_www17nov.sql has in reality 10118 rows and I don't have any idea how to check the rows doesn't pasted :( in the file base_with_www17nov.sql = 36490 rows

                          If I have a solution to get as a "report" I will do it by hand myself.

                          Thanks again

                          6,675585
                          Grand MasterGrand Master
                          6,675585

                            Nov 20, 2008#13

                            I have made some changes on the ANSI to ASCII replaces to include more special ANSI characters (hopefully correct because I'm not familiar with these characters) and made all replaces now case-sensitive except one, which must ingore the case of letters. Case-sensitive searches are faster.

                            I have added also a few lines to collect in clipboard 8 which cities from file "base_with_www17nov.sql" were not found in file "base_without_www_17nov.sql". After the main loop a new file is opened and these "not found strings" are pasted into this file. If all cities were found, the new file will be immediately closed without saving. So no new file with strings at end of the macro means all cities found and their websites inserted. Otherwise you can see which cities from "base_with_www17nov.sql" were not found and check these city names.
                            Best regards from an UC/UE/UES for Windows user from Austria

                            9
                            NewbieNewbie
                            9

                              Nov 22, 2008#14

                              Guten Tag Her Magic ;-)

                              The Bug report give me 734 errors :( It's not a lot with one file with 11 000 rows, but replace myself by hand 734 rows is not easy....

                              Fortunately you have found a solution to make this report.

                              É give us a problem, it doesn't want to do : Find MatchCase RegExp "[ÈÉÊË]" Replace All "E"

                              Other problems (it's my fault :oops: I forgot to tell you this)

                              Example the city |Aiglun| is 4 time in the file : base_without_www_17nov.sql and one time in the file: base_with_www17nov.sql
                              This is means |Aiglun|website| was pasted 4 time.

                              It's like this for many other cities

                              We didn't use the zipcode for have a macro more simple but unfortunately we should use it

                              Now the row from base_with_www17nov.sql will be like this:
                              |start|http://www.ville-aiglun.fr|Aiglun|06910|end|

                              And the file base_without_www_17nov.sql doesn't change and will stay the same
                              |id|1556|space1|4|space2|Alpes de Hautes Provence|space2|Provence Alpes Cote Azur|space3|04001|space4|04510|space1|Aiglun|website|timestamp|end|
                              |id|1933|space1|6|space2|Alpes Maritimes|space2|Provence Alpes Cote Azur|space3|06001|space4|06910|space1|Aiglun|website|timestamp|end|
                              (Please do not confuse the colum in this color. It's absolutly not a zip code)

                              But we shouldn't use only the zip code, because one zipcode can be the same for 3 or 5 different cities, it means the macro must search |city| must be = |zipcode| if not = do not replace.

                              I tried to change your macro myself, I have understood what you did, but these is too much parameter inside, I didn't get success :(

                              Many thanks in advance for your futur new help.

                              Have a nice day

                              6,675585
                              Grand MasterGrand Master
                              6,675585

                                Nov 22, 2008#15

                                MonWeb wrote:É give us a problem, it doesn't want to do : Find MatchCase RegExp "[ÈÉÊË]" Replace All "E"
                                I can't reproduce here any problem. For testing I have inserted 'É' in both files on a city and the macro replaced it in both files with an 'E'. You have to find out by yourself what could be the problem. Run the case-sensitive regular expression replace with the UltraEdit engine manually (Ctrl+R) to find out what the problem is.

                                And the macro is now nearly back to what I have written first. It uses again zip code + city name + website to identify the lines in file 2 which should be replaced by zip code + city name + URL of the city. Of course I could not really test it because base_with_www17nov.sql does not contain any zip code data. So I inserted always the same zip code for all cities and run the macro. The macro produced the correct result for the city with the correct zip code and all other cities were listed in the report file.
                                Website_Copy.zip (1.82 KiB)   286
                                The ZIP archive contains the updated source code and the macro which uses again the zip code.
                                Best regards from an UC/UE/UES for Windows user from Austria

                                Read more posts (1 remaining)