How to join Korean line with Latin line?

How to join Korean line with Latin line?

8
NewbieNewbie
8

    Jun 22, 2017#1

    Hi.

    I am just wondering if I can do it with the replace command. I have a file which lists a lot of scientific and common names of mushrooms. The common names are written in the Korean language and then the scientific names.

    For example:

    Code: Select all

    #@%@#$#$   // <- Let's say this is Korean.
    Lycoperdon echinatum Pers
    
    %%$!%$%$  // <- Let's say this is Korean.
    Boletus violaceofuscus Chiu
    This pattern continues.

    I just want to put all the scientific names right after the common names like this:

    Code: Select all

    #@%@#$#$ Lycoperdon echinatum Pers
    %%$!%$%$  Boletus violaceofuscus Chiu
    Can I do it with the replace command?

    If not, how can I do it?

    Any help will be greatly appreciated.

    Thank you so much.

    6,681583
    Grand MasterGrand Master
    6,681583

      Jun 23, 2017#2

      In other words you do not want to move a line up or down, but you want to join each odd line with next even line.

      Run command Trim trailing spaces and next a Perl regular expression Replace All using backreferences with the search string ^(.+)(?:\r?\n|\r)[\t ]* and using "\1 " as replace string without the double quotes just added here to show you that the replace string ends with a single space character or a tab character or whatever you want between each two joined lines.

      Leading spaces/tabs on even lines are removed also automatically by this Perl regular expression replace because of [\t ]*.

      (?:\r?\n|\r) is an OR expression in a non-capturing group to match a DOS (carriage return + line-feed), or UNIX (just line-feed) or MAC (just carriage return) line termination. You have not posted unfortunately which line termination type your file has although displayed in status bar of UltraEdit at bottom of main window for active file.

      Of course this regular expression works only if the file contains on each odd line Korean text and on each even line Latin text.
      Best regards from an UC/UE/UES for Windows user from Austria

      19476
      MasterMaster
      19476

        Jun 23, 2017#3

        I would like to see a real sample because the Find regex should be able to distinguish between Korean and Latin text to join only the desired lines.

        8
        NewbieNewbie
        8

          Jun 23, 2017#4

          Fleggy, will this be enough?

          Code: Select all

          가시말불버섯
          Lycoperdon echinatum Pers
          가지색그물버섯
          Boletus violaceofuscus Chiu
          갓버섯
          큰갓버섯 Parasol mushroom
          Macrolepiota procera 
          개덕다리겨울우산버섯
          Polyporellus squamosus (Huds.) P. Karst
          Thank you very much.

          19476
          MasterMaster
          19476

            Jun 23, 2017#5

            Great :)
            So would be enough to simply join a Korean line with the following Latin line?
            Your sample after transformation:

            Code: Select all

            가시말불버섯 Lycoperdon echinatum Pers
            가지색그물버섯 Boletus violaceofuscus Chiu
            갓버섯
            큰갓버섯 Parasol mushroom
            Macrolepiota procera
            개덕다리겨울우산버섯 Polyporellus squamosus (Huds.) P. Karst
            Does it look good for you?

            EDIT: What encoding do you prefer? UTF-8 or UTF-16?

              Jun 23, 2017#6

              Hi Evanesce,

              please, try the following Perl regex:

              The whole first line must be Korean, the following line must be Latin:

              Find what: ^([\x{0100}-\x{FFFF}]++)$\r\n([\t\x{0020}-\x{00FF}]++)$
              Replace with: \1 \2

              Or if you want to join any line starting Korean with the following Latin line:

              Find what: ^([\x{0100}-\x{FFFF}]++[\t\x{0020}-\x{00FF}]*+)$\r\n([\t\x{0020}-\x{00FF}]++)$
              Replace with: \1 \2

              Hope that will help you.

              BR, Fleggy

              8
              NewbieNewbie
              8

                Jun 23, 2017#7

                Mofi, I am just wondering if there is any Regex prob you can't help others with.

                Well, I have never studied Regex thoroughly before.

                I wanted to study it but one day I asked myself like "Do I need to study it over? No. It's hard and I don't even have the time to study the hard stuff. I am not a computer programmer! Finding the code I need online will be enough."
                That's all I have done so far.

                But somehow I always wanted to study the regex b/c I knew the it is REAL MAGIC!!!
                ...
                Just borrowed 2 books from a public library this evening.


                Hope to learn a lot from them. Looks like they are all translated books.

                You know what?
                Sometimes I want to spend a few hours helping others like you.
                That would be so great and fun!

                Anyway, thanks a million, Mofi!

                  Jun 23, 2017#8

                  The file looks like this:
                  Every 2nd (or 3rd or 4th) line has scientific name(s).
                  That's because some mushrooms have more than 2 common and scientific names.

                  Code: Select all

                  가시말불버섯
                  Lycoperdon echinatum Pers
                  가지색그물버섯
                  Boletus violaceofuscus Chiu
                  갓버섯
                  큰갓버섯 Parasol mushroom
                  Macrolepiota procera
                  개덕다리겨울우산버섯
                  Polyporellus squamosus (Huds.) P. Karst
                  The 3 lines 5 to 7 are one set. 2 common names and 2 scientific names.

                  When I followed Mofi's instructions and running the replace multiple times, it looks like this with everything on one long line:

                  Code: Select all

                  가시말불버섯Lycoperdon echinatum Pers가지색그물버섯Boletus violaceofuscus Chiu갓버섯큰갓버섯 Parasol mushroomMacrolepiota procera개덕다리겨울우산버섯Polyporellus squamosus (Huds.) P. Karst
                  I want it to look like this:

                  Code: Select all

                  가시말불버섯 Lycoperdon echinatum Pers
                  가지색그물버섯 Boletus violaceofuscus Chiu
                  갓버섯 큰갓버섯 Parasol mushroomMacrolepiota procera 갓버섯, 큰갓버섯
                  개덕다리겨울우산버섯 Polyporellus squamosus (Huds.) P. Karst
                  On line 3 I can put a comma between the two later using one more regular expression replace.

                  How can I do that?

                  The common name(s) and scientific name(s) and then, the same pattern in the next line.
                  There can be more than two common or scientific names.

                  I think the best way to handle this is to put the Unicode and ASCII code together.

                  How can I do that?

                  19476
                  MasterMaster
                  19476

                    Jun 23, 2017#9

                    Hi Evanesce,
                    1. Join all adjacent ASCII lines (replace CR/LF with <comma><space> if previous and following letters are ASCII):

                      Find what: "(?<=[\x{0020}-\x{00FF}])\r\n(?=[\x{0020}-\x{00FF}])"
                      Replace with: ", "
                    2. Join all adjacent Korean lines (replace CR/LF with <space> if previous and following symbols are UNICODE):

                      Find what: "(?<=[\x{0100}-\x{FFFF}])\r\n(?=[\x{0100}-\x{FFFF}])"
                      Replace with: " "
                    3. Join remaining Korean and ASCII lines:

                      Find what: "^([\x{0100}-\x{FFFF}]++[\t\x{0020}-\x{00FF}]*+)$\r\n([\t\x{0020}-\x{00FF}]++)$"
                      Replace with: "\1 \2"
                    I delimited all regular expressions with double quotes to make all whitespaces visible. Do not use these double quotes in replace.

                    Here is my test after the three replaces above.

                    Input:

                    Code: Select all

                    가시말불버섯
                    Lycoperdon echinatum Pers
                    가지색그물버섯
                    Boletus violaceofuscus Chiu
                    갓버섯
                    큰갓버섯 Parasol mushroom
                    Macrolepiota procera
                    Another name just for testing
                    개덕다리겨울우산버섯
                    Polyporellus squamosus (Huds.) P. Karst
                    Output:

                    Code: Select all

                    가시말불버섯 Lycoperdon echinatum Pers
                    가지색그물버섯 Boletus violaceofuscus Chiu
                    갓버섯 큰갓버섯 Parasol mushroom, Macrolepiota procera, Another name just for testing
                    개덕다리겨울우산버섯 Polyporellus squamosus (Huds.) P. Karst
                    BR, Fleggy

                    8
                    NewbieNewbie
                    8

                      Jun 23, 2017#10

                      Wow, this is what I want! Unbelievably amazing! The RegExp is Art! Wow,.. wOw... woW...

                      I have never seen things like \x{4 digit number}. \x{0020} means space.

                      I can't thank you enough.

                      BTW, have you read this book?



                      I mean I am just wondering how you learned the RegExp.

                      Well, when I see a person like you, I feel so humble.
                      And I mutter to myself. "What kind of a person is he? He is on a totally different level." lol
                      Never mind!

                      I want to say thank you for everyone who gave me help. I am going to buy UE Suite or UEStudio unlimited upgrades in August.

                      If no one helps the novice like me, there is no reason to buy any.

                      The UE forum is really helpful. 5/5!!!

                      6,681583
                      Grand MasterGrand Master
                      6,681583

                        Jun 24, 2017#11

                        How to learn regular expressions in Perl syntax is like learning anything else:
                        1. Reading - no - really studying books like Mastering Regular Expressions by O´Reilly which is very good or information on websites like those referenced in Find/Replace/Regular Expressions forum announcement topic Readme for the Find/Replace/Regular Expressions forum.
                        2. Using read information by applying them with not giving up on mistakes, errors or wrong results until solved.
                        3. Studying more and using more over months to become better and better.
                        4. Using tools making the work easier if available at all. For example UltraEdit has a very simple regular expression builder inside, click on appropriate button(s) in Find/Replace window. But for regular expressions with Perl syntax there are even more powerful tools helping users to create complex regular expressions for complex tasks which can be integrated in UE/UES via a user tool configuration, see also Readme for the Find/Replace/Regular Expressions forum.
                        5. After several months of nearly daily usage of read information a person has usually advanced knowledge.
                        6. After several years of nearly daily advanced usage and not stopping reading, learning and applying a person becomes an expert in whatever the person is doing.
                        My UltraEdit/Unix regular expression skills are at expert level. I know everything about those legacy regular expression engines in UltraEdit. But my Perl regular expressions skills are just at advanced level. I do not need to solve often really complex tasks using a Perl regular expression find/replace on becoming ever a real expert on Perl regular expressions. But I learn at least once per month something new for me about Perl regular expressions. Fleggy is one reason for increasing my knowledge about Perl regular expressions as nothing is better than seeing how a complex regular expression find/replace works on a practical example. Thanks Fleggy.
                        Best regards from an UC/UE/UES for Windows user from Austria

                        19476
                        MasterMaster
                        19476

                          Jun 24, 2017#12

                          You're welcome, Mofi :)