Different matches between UltraEdit and regex101

Different matches between UltraEdit and regex101

1042
Power UserPower User
1042

    0:59 - 7 days ago#1

    I ran into a very strange situation.

    After copying and pasting text from a web page, I realized that I would need to do some regular expression tweaking to fix the problem.

    The page is this:

    https://www.letras.mus.br/michel-legran ... ducao.html

    When I copy the lyrics of the song and its translation and paste it into UltraEdit, I have the original lyrics linked to the last letter of the translation, all on the same line.

    It should be right for it to come below the translation.

    Like this:

    Code: Select all

    Como uma pedra que se atira
    Comme une pierre que l'on jette
    
    Na corrente de um riacho
    Dans l'eau vive d'un ruisseau
    
    E que deixa atrás dela
    Et qui laisse derrière elle
    But it's coming like this:

    Code: Select all

    Como uma pedra que se atiraComme une pierre que l'on jette
    Na corrente de um riachoDans l'eau vive d'un ruisseau
    E que deixa atrás delaEt qui laisse derrière elle
    I'm struggling to find a regular expression that fixes this.
    I came up with this one:

    Code: Select all

    ^([A-Z])([a-z|\s]*)([A-Z])
    Which captures the first and second group up to the first capital letter, to keep that part on the first line.
    Then it captures the third and fourth group up to the end of the line to throw them on the line below.
    I would write the rest of the expression after I had the groups defined correctly.

    If I test it on regex101.com, it appears to be fine: https://regex101.com/r/OlzOly/
    But UltraEdit find another match.

    This expression of mine is capturing everything up to the apostrophe (').

    Code: Select all

    Como uma pedra que se atiraComme une pierre que l
    It ignores the second capital letter.
    What's wrong?

    And I realized that I have to foresee another situation, when there is an apostrophe.

    I would also have another question:
    Why do certain web pages not respect line breaks?

    6,672577
    Grand MasterGrand Master
    6,672577

      5:16 - 7 days ago#2

      You have most likely not checked the option Match case which is very important here as otherwise [A-Z] and [a-z] matches the same set of characters: all ASCII letters independent on case.

      A better Perl regular search expression would be ^[A-Z][^\r\nA-Z]+?\K(?=[A-Z]) and \r\n as replace expression. The replace option Match case must be checked for this Perl regular expression replace.

      \K Resets the start location of $0 to the current text position: in other words everything to the left of \K is "kept back" and does not form part of the regular expression match.

      (?=[A-Z]) is a positive look-ahead to check without matching (selecting) if the next character is an upper case ASCII letter and not a carriage return or line feed.
      Gabarito wrote:Why do certain web pages not respect line breaks?
      The reason is that the authors of those web pages do not know the HTML specification. The whitespace characters carriage return and line feed are interpreted like a normal space on parsing an HTML file and a sequence of multiple normal spaces, horizontal tabs, carriage returns and line feeds is reduced to a single normal space. The exception is text with multiple normal spaces and newline characters inside a preformatted text, i. e. text within <pre> and </pre>. A text with line breaks inside a normal paragraph within <p> and </p> and other elements must be formatted with the tag <br> (HTML) or <br /> (XHTML).

      Hint: When viewing such a web page with obviously wrong formatted text and want to copy text with the right formatting, it often helps pressing Ctrl+U in the web browser to get a window with source code of the HTML file opened in the browser, search for the first line of the text in the source code window, select the text to copy in source code window and copy the selected source code text to the clipboard. The source code window contains the text most often as pasted into the HTML file with the newline characters interpreted according to HTML specification as normal whitespace and therefore removed on displaying the text as described above.

      The referenced source page of the text is very bad HTML formatted. It has lots of HTML syntax mistakes. That can be seen on saving the HTML file as is, open the saved HTML file in UltraEdit or UEStudio and run HTML Tidy. The browsers must automatically correct 124 mistakes (number of output HTML Tidy warnings) on parsing this HTML file. It ignores also standard rules for document writing. The usage of heading level 3 for a text which is definitely not a heading at all just to get the text displayed larger in the browser window is awful. There should be used a paragraph with the appropriate CSS attribute font-size to get the text displayed larger and not a heading level 3.
      Best regards from an UC/UE/UES for Windows user from Austria

      1042
      Power UserPower User
      1042

        9:11 - 7 days ago#3

        Everything was very well explained.
        Yes, indeed, I had not checked the Match case checkbox.
        And I was not even aware of the need to use the "kept-back" or "look-ahead" features.

        Regarding the HTML page code, the enigma has finally been clarified.
        I have come across this type of problem before and could never understand how the page displayed the line break, but the text copied and pasted into an editor came without it.

        Thread SOLVED.

        Thank you very much, Mofi, for the detailed explanations.

        19376
        MasterMaster
        19376

          11:41 - 7 days ago#4

          Hi,

          I would prefer this Perl regexp because of UTF-8 characters and not only A-Z (\l = any lower character, \u = any upper character)

          F: \l(\r\n)?\K(?=\u)
          R: \r\n

          BR, Fleggy

          1042
          Power UserPower User
          1042

            12:03 - 7 days ago#5

            Thank you, Fleggy.

            Your expression works very well too.

            I would say it works even better, because it puts new line between two sets of phrases.
            And not only that, but also because there are cases where the letter is capitalized before the end of the phrase and it should not be broken at that point.

            Like here, where "Saturno" is upper character:

            Code: Select all

            Com seus cabelos de estrelasAvec ses chevaux d'étoiles
            Como um anel de SaturnoComme un anneau de Saturne
            Um balão de carnavalUn ballon de carnaval

            And it becomes like this

            Code: Select all

            Com seus cabelos de estrelas
            Avec ses chevaux d'étoiles
            
            Como um anel de Saturno
            Comme un anneau de Saturne
            
            Um balão de carnaval
            Un ballon de carnaval

              12:15 - 7 days ago#6

              I would still ask you both something more.

              How to invert original and translation phrases?
              I mean, how to have this?

              Code: Select all

              Com seus cabelos de estrelasAvec ses chevaux d'étoiles
              Como um anel de SaturnoComme un anneau de Saturne
              Um balão de carnavalUn ballon de carnaval
              

              ...and end with this?

              Code: Select all

              Avec ses chevaux d'étoiles
              Com seus cabelos de estrelas
              
              Comme un anneau de Saturne
              Como um anel de Saturno
              
              Un ballon de carnaval
              Um balão de carnaval
              

              19376
              MasterMaster
              19376

                12:23 - 7 days ago#7

                If you remove bold markers then this should work

                F: ^(.+?\l)(\r\n)?(\u.+)
                R: $3$2\r\n$1\r\n

                BR, Fleggy

                1042
                Power UserPower User
                1042

                  12:28 - 7 days ago#8

                  Excellent!
                  Perfect!


                  Thank you.

                  19376
                  MasterMaster
                  19376

                    12:33 - 7 days ago#9

                    BTW It can be simplified to:

                    F: ^(.+?\l)(\u.+)
                    R: $2\r\n$1\r\n

                    And Match case is not needed...

                    1042
                    Power UserPower User
                    1042

                      12:59 - 7 days ago#10

                      fleggy wrote:
                      12:33 - 7 days ago
                      BTW It can be simplified to:

                      F: ^(.+?\l)(\u.+)
                      R: $2\r\n$1\r\n

                      And Match case is not needed...
                      Why do you use "$" instead of "\"?

                      Excuse me for asking:
                      May you share your e-mail? How to contact you?
                      Is it allowed to share email here at this forum?

                      19376
                      MasterMaster
                      19376

                        13:54 - 7 days ago#11

                        I rather use $ in replacements because you are not limited to max 9 groups (\1 .. \9). You are virtually unlimited using $. I successfully tested groups like $531 and similar high numbers. Unfortunately $ is not allowed as backreference in a Perl regexp itself. At least I don't know the way how to use it. But you can use \gNN or \g{NN}.
                        Here is an artificial example how to use such groups (group 11, group 10 and group 1 followed by zero)

                        (.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)\g11\g10\g{1}0

                        matches

                        X23456789sZZsX0oooo

                        BTW I'd prefer not to publish my email, sorry
                        Fleggy

                        1042
                        Power UserPower User
                        1042

                          17:23 - 7 days ago#12

                          fleggy wrote:
                          13:54 - 7 days ago
                          I rather use $ in replacements because you are not limited to max 9 groups (\1 .. \9). You are virtually unlimited using $.Eu não sabia disso.
                          👍

                          fleggy wrote:
                          13:54 - 7 days ago
                          BTW I'd prefer not to publish my email, sorry
                          Fleggy
                          All right.

                          I don't feel comfortable sharing my email either.
                          But I can do that for a while while you take note of it.

                          [deleted]

                          If you want to stay in touch, please send me a message in the next few hours.

                          When I receive it, I'll come back here and delete my email from the post.
                          If I don't receive a confirmation within the next 2 hours, I'll also go back and delete the email.

                          Thank you.