How to join Korean line with Latin line?

evanesce · Jun 22, 2017#12017-06-22T16:36+00:00

Hi.

I am just wondering if I can do it with the replace command. I have a file which lists a lot of scientific and common names of mushrooms. The common names are written in the Korean language and then the scientific names.

For example:

Code: Select all

#@%@#$#$   // <- Let's say this is Korean.
Lycoperdon echinatum Pers

%%$!%$%$  // <- Let's say this is Korean.
Boletus violaceofuscus Chiu

This pattern continues.

I just want to put all the scientific names right after the common names like this:

Code: Select all

#@%@#$#$ Lycoperdon echinatum Pers
%%$!%$%$  Boletus violaceofuscus Chiu

Can I do it with the replace command?

If not, how can I do it?

Any help will be greatly appreciated.

Thank you so much.

Mofi · Jun 23, 2017#22017-06-23T05:59+00:00

In other words you do not want to move a line up or down, but you want to join each odd line with next even line.

Run command Trim trailing spaces and next a Perl regular expression Replace All using backreferences with the search string ^(.+)(?:\r?\n|\r)[\t ]* and using "\1 " as replace string without the double quotes just added here to show you that the replace string ends with a single space character or a tab character or whatever you want between each two joined lines.

Leading spaces/tabs on even lines are removed also automatically by this Perl regular expression replace because of [\t ]*.

(?:\r?\n|\r) is an OR expression in a non-capturing group to match a DOS (carriage return + line-feed), or UNIX (just line-feed) or MAC (just carriage return) line termination. You have not posted unfortunately which line termination type your file has although displayed in status bar of UltraEdit at bottom of main window for active file.

Of course this regular expression works only if the file contains on each odd line Korean text and on each even line Latin text.

fleggy · Jun 23, 2017#32017-06-23T15:20+00:00

I would like to see a real sample because the Find regex should be able to distinguish between Korean and Latin text to join only the desired lines.

evanesce · Jun 23, 2017#42017-06-23T15:43+00:00

Fleggy, will this be enough?

Code: Select all

가시말불버섯
Lycoperdon echinatum Pers
가지색그물버섯
Boletus violaceofuscus Chiu
갓버섯
큰갓버섯 Parasol mushroom
Macrolepiota procera 
개덕다리겨울우산버섯
Polyporellus squamosus (Huds.) P. Karst

Thank you very much.

fleggy · Jun 23, 2017#52017-06-23T16:11+00:00

Great

So would be enough to simply join a Korean line with the following Latin line?
Your sample after transformation:

Code: Select all

가시말불버섯 Lycoperdon echinatum Pers
가지색그물버섯 Boletus violaceofuscus Chiu
갓버섯
큰갓버섯 Parasol mushroom
Macrolepiota procera
개덕다리겨울우산버섯 Polyporellus squamosus (Huds.) P. Karst

Does it look good for you?

EDIT: What encoding do you prefer? UTF-8 or UTF-16?

Jun 23, 2017#62017-06-23T16:39+00:00

Hi Evanesce,

please, try the following Perl regex:

The whole first line must be Korean, the following line must be Latin:

Find what: ^([\x{0100}-\x{FFFF}]++)$\r\n([\t\x{0020}-\x{00FF}]++)$
Replace with: \1 \2

Or if you want to join any line starting Korean with the following Latin line:

Find what: ^([\x{0100}-\x{FFFF}]++[\t\x{0020}-\x{00FF}]*+)$\r\n([\t\x{0020}-\x{00FF}]++)$
Replace with: \1 \2

Hope that will help you.

BR, Fleggy

evanesce · Jun 23, 2017#72017-06-23T16:50+00:00

Mofi, I am just wondering if there is any Regex prob you can't help others with.

Well, I have never studied Regex thoroughly before.

I wanted to study it but one day I asked myself like "Do I need to study it over? No. It's hard and I don't even have the time to study the hard stuff. I am not a computer programmer! Finding the code I need online will be enough."
That's all I have done so far.

But somehow I always wanted to study the regex b/c I knew the it is REAL MAGIC!!!
...
Just borrowed 2 books from a public library this evening.

Hope to learn a lot from them. Looks like they are all translated books.

You know what?
Sometimes I want to spend a few hours helping others like you.
That would be so great and fun!

Anyway, thanks a million, Mofi!

Jun 23, 2017#82017-06-23T17:32+00:00

The file looks like this:
Every 2nd (or 3rd or 4th) line has scientific name(s).
That's because some mushrooms have more than 2 common and scientific names.

Code: Select all

가시말불버섯
Lycoperdon echinatum Pers
가지색그물버섯
Boletus violaceofuscus Chiu
갓버섯
큰갓버섯 Parasol mushroom
Macrolepiota procera
개덕다리겨울우산버섯
Polyporellus squamosus (Huds.) P. Karst

The 3 lines 5 to 7 are one set. 2 common names and 2 scientific names.

When I followed Mofi's instructions and running the replace multiple times, it looks like this with everything on one long line:

Code: Select all

가시말불버섯Lycoperdon echinatum Pers가지색그물버섯Boletus violaceofuscus Chiu갓버섯큰갓버섯 Parasol mushroomMacrolepiota procera개덕다리겨울우산버섯Polyporellus squamosus (Huds.) P. Karst

I want it to look like this:

Code: Select all

가시말불버섯 Lycoperdon echinatum Pers
가지색그물버섯 Boletus violaceofuscus Chiu
갓버섯 큰갓버섯 Parasol mushroomMacrolepiota procera 갓버섯, 큰갓버섯
개덕다리겨울우산버섯 Polyporellus squamosus (Huds.) P. Karst

On line 3 I can put a comma between the two later using one more regular expression replace.

How can I do that?

The common name(s) and scientific name(s) and then, the same pattern in the next line.
There can be more than two common or scientific names.

I think the best way to handle this is to put the Unicode and ASCII code together.

How can I do that?

fleggy · Jun 23, 2017#92017-06-23T18:01+00:00

Hi Evanesce,

Join all adjacent ASCII lines (replace CR/LF with <comma><space> if previous and following letters are ASCII):

Find what: "(?<=[\x{0020}-\x{00FF}])\r\n(?=[\x{0020}-\x{00FF}])"
Replace with: ", "
Join all adjacent Korean lines (replace CR/LF with <space> if previous and following symbols are UNICODE):

Find what: "(?<=[\x{0100}-\x{FFFF}])\r\n(?=[\x{0100}-\x{FFFF}])"
Replace with: " "
Join remaining Korean and ASCII lines:

Find what: "^([\x{0100}-\x{FFFF}]++[\t\x{0020}-\x{00FF}]*+)$\r\n([\t\x{0020}-\x{00FF}]++)$"
Replace with: "\1 \2"

I delimited all regular expressions with double quotes to make all whitespaces visible. Do not use these double quotes in replace.

Here is my test after the three replaces above.

Input:

Code: Select all

가시말불버섯
Lycoperdon echinatum Pers
가지색그물버섯
Boletus violaceofuscus Chiu
갓버섯
큰갓버섯 Parasol mushroom
Macrolepiota procera
Another name just for testing
개덕다리겨울우산버섯
Polyporellus squamosus (Huds.) P. Karst

Output:

Code: Select all

가시말불버섯 Lycoperdon echinatum Pers
가지색그물버섯 Boletus violaceofuscus Chiu
갓버섯 큰갓버섯 Parasol mushroom, Macrolepiota procera, Another name just for testing
개덕다리겨울우산버섯 Polyporellus squamosus (Huds.) P. Karst

BR, Fleggy

evanesce · Jun 23, 2017#102017-06-23T21:36+00:00

Wow, this is what I want! Unbelievably amazing! The RegExp is Art! Wow,.. wOw... woW...

I have never seen things like \x{4 digit number}. \x{0020} means space.

I can't thank you enough.

BTW, have you read this book?

I mean I am just wondering how you learned the RegExp.

Well, when I see a person like you, I feel so humble.
And I mutter to myself. "What kind of a person is he? He is on a totally different level." lol
Never mind!

I want to say thank you for everyone who gave me help. I am going to buy UE Suite or UEStudio unlimited upgrades in August.

If no one helps the novice like me, there is no reason to buy any.

The UE forum is really helpful. 5/5!!!

Mofi · Jun 24, 2017#112017-06-24T09:32+00:00

How to learn regular expressions in Perl syntax is like learning anything else:

Reading - no - really studying books like Mastering Regular Expressions by O´Reilly which is very good or information on websites like those referenced in Find/Replace/Regular Expressions forum announcement topic Readme for the Find/Replace/Regular Expressions forum.
Using read information by applying them with not giving up on mistakes, errors or wrong results until solved.
Studying more and using more over months to become better and better.
Using tools making the work easier if available at all. For example UltraEdit has a very simple regular expression builder inside, click on appropriate button(s) in Find/Replace window. But for regular expressions with Perl syntax there are even more powerful tools helping users to create complex regular expressions for complex tasks which can be integrated in UE/UES via a user tool configuration, see also Readme for the Find/Replace/Regular Expressions forum.
After several months of nearly daily usage of read information a person has usually advanced knowledge.
After several years of nearly daily advanced usage and not stopping reading, learning and applying a person becomes an expert in whatever the person is doing.

My UltraEdit/Unix regular expression skills are at expert level. I know everything about those legacy regular expression engines in UltraEdit. But my Perl regular expressions skills are just at advanced level. I do not need to solve often really complex tasks using a Perl regular expression find/replace on becoming ever a real expert on Perl regular expressions. But I learn at least once per month something new for me about Perl regular expressions. Fleggy is one reason for increasing my knowledge about Perl regular expressions as nothing is better than seeing how a complex regular expression find/replace works on a practical example. Thanks Fleggy.

fleggy · Jun 24, 2017#122017-06-24T10:33+00:00

You're welcome, Mofi