Different matches between UltraEdit and regex101

Gabarito · PostSep 12, 2024#12024-09-12T00:59+00:00

I ran into a very strange situation.

After copying and pasting text from a web page, I realized that I would need to do some regular expression tweaking to fix the problem.

The page is this:

https://www.letras.mus.br/michel-legran ... ducao.html

When I copy the lyrics of the song and its translation and paste it into UltraEdit, I have the original lyrics linked to the last letter of the translation, all on the same line.

It should be right for it to come below the translation.

Like this:

Code: Select all

Como uma pedra que se atira
Comme une pierre que l'on jette

Na corrente de um riacho
Dans l'eau vive d'un ruisseau

E que deixa atrás dela
Et qui laisse derrière elle

But it's coming like this:

Code: Select all

Como uma pedra que se atiraComme une pierre que l'on jette
Na corrente de um riachoDans l'eau vive d'un ruisseau
E que deixa atrás delaEt qui laisse derrière elle

I'm struggling to find a regular expression that fixes this.
I came up with this one:

Code: Select all

^([A-Z])([a-z|\s]*)([A-Z])

Which captures the first and second group up to the first capital letter, to keep that part on the first line.
Then it captures the third and fourth group up to the end of the line to throw them on the line below.
I would write the rest of the expression after I had the groups defined correctly.

If I test it on regex101.com, it appears to be fine: https://regex101.com/r/OlzOly/
But UltraEdit find another match.

This expression of mine is capturing everything up to the apostrophe (').

Code: Select all

Como uma pedra que se atiraComme une pierre que l

It ignores the second capital letter.
What's wrong?

And I realized that I have to foresee another situation, when there is an apostrophe.

I would also have another question:
Why do certain web pages not respect line breaks?

Mofi · PostSep 12, 2024#22024-09-12T05:16+00:00

You have most likely not checked the option Match case which is very important here as otherwise [A-Z] and [a-z] matches the same set of characters: all ASCII letters independent on case.

A better Perl regular search expression would be ^[A-Z][^\r\nA-Z]+?\K(?=[A-Z]) and \r\n as replace expression. The replace option Match case must be checked for this Perl regular expression replace.

\K Resets the start location of $0 to the current text position: in other words everything to the left of \K is "kept back" and does not form part of the regular expression match.

(?=[A-Z]) is a positive look-ahead to check without matching (selecting) if the next character is an upper case ASCII letter and not a carriage return or line feed.

Gabarito wrote:Why do certain web pages not respect line breaks?

The reason is that the authors of those web pages do not know the HTML specification. The whitespace characters carriage return and line feed are interpreted like a normal space on parsing an HTML file and a sequence of multiple normal spaces, horizontal tabs, carriage returns and line feeds is reduced to a single normal space. The exception is text with multiple normal spaces and newline characters inside a preformatted text, i. e. text within <pre> and </pre>. A text with line breaks inside a normal paragraph within <p> and </p> and other elements must be formatted with the tag <br> (HTML) or <br /> (XHTML).

Hint: When viewing such a web page with obviously wrong formatted text and want to copy text with the right formatting, it often helps pressing Ctrl+U in the web browser to get a window with source code of the HTML file opened in the browser, search for the first line of the text in the source code window, select the text to copy in source code window and copy the selected source code text to the clipboard. The source code window contains the text most often as pasted into the HTML file with the newline characters interpreted according to HTML specification as normal whitespace and therefore removed on displaying the text as described above.

The referenced source page of the text is very bad HTML formatted. It has lots of HTML syntax mistakes. That can be seen on saving the HTML file as is, open the saved HTML file in UltraEdit or UEStudio and run HTML Tidy. The browsers must automatically correct 124 mistakes (number of output HTML Tidy warnings) on parsing this HTML file. It ignores also standard rules for document writing. The usage of heading level 3 for a text which is definitely not a heading at all just to get the text displayed larger in the browser window is awful. There should be used a paragraph with the appropriate CSS attribute font-size to get the text displayed larger and not a heading level 3.

Gabarito · PostSep 12, 2024#32024-09-12T09:11+00:00

Everything was very well explained.
Yes, indeed, I had not checked the Match case checkbox.
And I was not even aware of the need to use the "kept-back" or "look-ahead" features.

Regarding the HTML page code, the enigma has finally been clarified.
I have come across this type of problem before and could never understand how the page displayed the line break, but the text copied and pasted into an editor came without it.

Thread SOLVED.

Thank you very much, Mofi, for the detailed explanations.

fleggy · PostSep 12, 2024#42024-09-12T11:41+00:00

Hi,

I would prefer this Perl regexp because of UTF-8 characters and not only A-Z (\l = any lower character, \u = any upper character)

F: \l(\r\n)?\K(?=\u)
R: \r\n

BR, Fleggy

Gabarito · PostSep 12, 2024#52024-09-12T12:03+00:00

Thank you, Fleggy.

Your expression works very well too.

I would say it works even better, because it puts new line between two sets of phrases.
And not only that, but also because there are cases where the letter is capitalized before the end of the phrase and it should not be broken at that point.

Like here, where "Saturno" is upper character:

Code: Select all

Com seus cabelos de estrelasAvec ses chevaux d'étoiles
Como um anel de SaturnoComme un anneau de Saturne
Um balão de carnavalUn ballon de carnaval

And it becomes like this

Code: Select all

Com seus cabelos de estrelas
Avec ses chevaux d'étoiles

Como um anel de Saturno
Comme un anneau de Saturne

Um balão de carnaval
Un ballon de carnaval

PostSep 12, 2024#62024-09-12T12:15+00:00

I would still ask you both something more.

How to invert original and translation phrases?
I mean, how to have this?

Code: Select all

Com seus cabelos de estrelasAvec ses chevaux d'étoiles
Como um anel de SaturnoComme un anneau de Saturne
Um balão de carnavalUn ballon de carnaval

...and end with this?

Code: Select all

Avec ses chevaux d'étoiles
Com seus cabelos de estrelas

Comme un anneau de Saturne
Como um anel de Saturno

Un ballon de carnaval
Um balão de carnaval

fleggy · PostSep 12, 2024#72024-09-12T12:23+00:00

If you remove bold markers then this should work

F: ^(.+?\l)(\r\n)?(\u.+)
R: $3$2\r\n$1\r\n

BR, Fleggy

Gabarito · PostSep 12, 2024#82024-09-12T12:28+00:00

Excellent!
Perfect!

Thank you.

fleggy · PostSep 12, 2024#92024-09-12T12:33+00:00

BTW It can be simplified to:

F: ^(.+?\l)(\u.+)
R: $2\r\n$1\r\n

And Match case is not needed...

Gabarito · PostSep 12, 2024#102024-09-12T12:59+00:00

fleggy wrote: ↑
Sep 12, 2024
BTW It can be simplified to:

F: ^(.+?\l)(\u.+)
R: $2\r\n$1\r\n

And Match case is not needed...

Why do you use "$" instead of "\"?

Excuse me for asking:
May you share your e-mail? How to contact you?
Is it allowed to share email here at this forum?

fleggy · PostSep 12, 2024#112024-09-12T13:54+00:00

I rather use $ in replacements because you are not limited to max 9 groups (\1 .. \9). You are virtually unlimited using $. I successfully tested groups like $531 and similar high numbers. Unfortunately $ is not allowed as backreference in a Perl regexp itself. At least I don't know the way how to use it. But you can use \gNN or \g{NN}.
Here is an artificial example how to use such groups (group 11, group 10 and group 1 followed by zero)

(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)\g11\g10\g{1}0

matches

X23456789sZZsX0oooo

BTW I'd prefer not to publish my email, sorry
Fleggy