Different matches between UltraEdit and regex101

Different matches between UltraEdit and regex101

1032
Power UserPower User
1032

    0:59 - Sep 12#1

    I ran into a very strange situation.

    After copying and pasting text from a web page, I realized that I would need to do some regular expression tweaking to fix the problem.

    The page is this:

    https://www.letras.mus.br/michel-legran ... ducao.html

    When I copy the lyrics of the song and its translation and paste it into UltraEdit, I have the original lyrics linked to the last letter of the translation, all on the same line.

    It should be right for it to come below the translation.

    Like this:

    Code: Select all

    Como uma pedra que se atira
    Comme une pierre que l'on jette
    
    Na corrente de um riacho
    Dans l'eau vive d'un ruisseau
    
    E que deixa atrás dela
    Et qui laisse derrière elle
    But it's coming like this:

    Code: Select all

    Como uma pedra que se atiraComme une pierre que l'on jette
    Na corrente de um riachoDans l'eau vive d'un ruisseau
    E que deixa atrás delaEt qui laisse derrière elle
    I'm struggling to find a regular expression that fixes this.
    I came up with this one:

    Code: Select all

    ^([A-Z])([a-z|\s]*)([A-Z])
    Which captures the first and second group up to the first capital letter, to keep that part on the first line.
    Then it captures the third and fourth group up to the end of the line to throw them on the line below.
    I would write the rest of the expression after I had the groups defined correctly.

    If I test it on regex101.com, it appears to be fine: https://regex101.com/r/OlzOly/
    But UltraEdit find another match.

    This expression of mine is capturing everything up to the apostrophe (').

    Code: Select all

    Como uma pedra que se atiraComme une pierre que l
    It ignores the second capital letter.
    What's wrong?

    And I realized that I have to foresee another situation, when there is an apostrophe.

    I would also have another question:
    Why do certain web pages not respect line breaks?

    6,686585
    Grand MasterGrand Master
    6,686585

      5:16 - Sep 12#2

      You have most likely not checked the option Match case which is very important here as otherwise [A-Z] and [a-z] matches the same set of characters: all ASCII letters independent on case.

      A better Perl regular search expression would be ^[A-Z][^\r\nA-Z]+?\K(?=[A-Z]) and \r\n as replace expression. The replace option Match case must be checked for this Perl regular expression replace.

      \K Resets the start location of $0 to the current text position: in other words everything to the left of \K is "kept back" and does not form part of the regular expression match.

      (?=[A-Z]) is a positive look-ahead to check without matching (selecting) if the next character is an upper case ASCII letter and not a carriage return or line feed.
      Gabarito wrote:Why do certain web pages not respect line breaks?
      The reason is that the authors of those web pages do not know the HTML specification. The whitespace characters carriage return and line feed are interpreted like a normal space on parsing an HTML file and a sequence of multiple normal spaces, horizontal tabs, carriage returns and line feeds is reduced to a single normal space. The exception is text with multiple normal spaces and newline characters inside a preformatted text, i. e. text within <pre> and </pre>. A text with line breaks inside a normal paragraph within <p> and </p> and other elements must be formatted with the tag <br> (HTML) or <br /> (XHTML).

      Hint: When viewing such a web page with obviously wrong formatted text and want to copy text with the right formatting, it often helps pressing Ctrl+U in the web browser to get a window with source code of the HTML file opened in the browser, search for the first line of the text in the source code window, select the text to copy in source code window and copy the selected source code text to the clipboard. The source code window contains the text most often as pasted into the HTML file with the newline characters interpreted according to HTML specification as normal whitespace and therefore removed on displaying the text as described above.

      The referenced source page of the text is very bad HTML formatted. It has lots of HTML syntax mistakes. That can be seen on saving the HTML file as is, open the saved HTML file in UltraEdit or UEStudio and run HTML Tidy. The browsers must automatically correct 124 mistakes (number of output HTML Tidy warnings) on parsing this HTML file. It ignores also standard rules for document writing. The usage of heading level 3 for a text which is definitely not a heading at all just to get the text displayed larger in the browser window is awful. There should be used a paragraph with the appropriate CSS attribute font-size to get the text displayed larger and not a heading level 3.
      Best regards from an UC/UE/UES for Windows user from Austria

      1032
      Power UserPower User
      1032

        9:11 - Sep 12#3

        Everything was very well explained.
        Yes, indeed, I had not checked the Match case checkbox.
        And I was not even aware of the need to use the "kept-back" or "look-ahead" features.

        Regarding the HTML page code, the enigma has finally been clarified.
        I have come across this type of problem before and could never understand how the page displayed the line break, but the text copied and pasted into an editor came without it.

        Thread SOLVED.

        Thank you very much, Mofi, for the detailed explanations.

        19476
        MasterMaster
        19476

          11:41 - Sep 12#4

          Hi,

          I would prefer this Perl regexp because of UTF-8 characters and not only A-Z (\l = any lower character, \u = any upper character)

          F: \l(\r\n)?\K(?=\u)
          R: \r\n

          BR, Fleggy

          1032
          Power UserPower User
          1032

            12:03 - Sep 12#5

            Thank you, Fleggy.

            Your expression works very well too.

            I would say it works even better, because it puts new line between two sets of phrases.
            And not only that, but also because there are cases where the letter is capitalized before the end of the phrase and it should not be broken at that point.

            Like here, where "Saturno" is upper character:

            Code: Select all

            Com seus cabelos de estrelasAvec ses chevaux d'étoiles
            Como um anel de SaturnoComme un anneau de Saturne
            Um balão de carnavalUn ballon de carnaval

            And it becomes like this

            Code: Select all

            Com seus cabelos de estrelas
            Avec ses chevaux d'étoiles
            
            Como um anel de Saturno
            Comme un anneau de Saturne
            
            Um balão de carnaval
            Un ballon de carnaval

              12:15 - Sep 12#6

              I would still ask you both something more.

              How to invert original and translation phrases?
              I mean, how to have this?

              Code: Select all

              Com seus cabelos de estrelasAvec ses chevaux d'étoiles
              Como um anel de SaturnoComme un anneau de Saturne
              Um balão de carnavalUn ballon de carnaval
              

              ...and end with this?

              Code: Select all

              Avec ses chevaux d'étoiles
              Com seus cabelos de estrelas
              
              Comme un anneau de Saturne
              Como um anel de Saturno
              
              Un ballon de carnaval
              Um balão de carnaval
              

              19476
              MasterMaster
              19476

                12:23 - Sep 12#7

                If you remove bold markers then this should work

                F: ^(.+?\l)(\r\n)?(\u.+)
                R: $3$2\r\n$1\r\n

                BR, Fleggy

                1032
                Power UserPower User
                1032

                  12:28 - Sep 12#8

                  Excellent!
                  Perfect!


                  Thank you.

                  19476
                  MasterMaster
                  19476

                    12:33 - Sep 12#9

                    BTW It can be simplified to:

                    F: ^(.+?\l)(\u.+)
                    R: $2\r\n$1\r\n

                    And Match case is not needed...

                    1032
                    Power UserPower User
                    1032

                      12:59 - Sep 12#10

                      fleggy wrote:
                      12:33 - Sep 12
                      BTW It can be simplified to:

                      F: ^(.+?\l)(\u.+)
                      R: $2\r\n$1\r\n

                      And Match case is not needed...
                      Why do you use "$" instead of "\"?

                      Excuse me for asking:
                      May you share your e-mail? How to contact you?
                      Is it allowed to share email here at this forum?

                      19476
                      MasterMaster
                      19476

                        13:54 - Sep 12#11

                        I rather use $ in replacements because you are not limited to max 9 groups (\1 .. \9). You are virtually unlimited using $. I successfully tested groups like $531 and similar high numbers. Unfortunately $ is not allowed as backreference in a Perl regexp itself. At least I don't know the way how to use it. But you can use \gNN or \g{NN}.
                        Here is an artificial example how to use such groups (group 11, group 10 and group 1 followed by zero)

                        (.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)\g11\g10\g{1}0

                        matches

                        X23456789sZZsX0oooo

                        BTW I'd prefer not to publish my email, sorry
                        Fleggy