Possible bug? - File with long lines

Possible bug? - File with long lines

1032
Power UserPower User
1032

    May 24, 2020#1

    Hi.

    I'm using UltraEdit Text/Hex Editor (x64) Version 25.20.0.88.

    I'm trying to use the following search/replace with Regular Expression (Perl):
    Search        (<video muted="1" preload="auto".*?src=")(.*?)(\.mp4.*?</video>)
    Replace        <img src="\2.jpg">

    It's because I want to convert this:
    <video muted="1" preload="auto" style="" class="_ox1 _21y0" data-video-width="261" data-video-height="465" id="u_37_3" src="video-1572541365.mp4" width="261" height="465"></video>

    into this:

    <img src="video-1572541365.jpg">


    It's OK to some parts of the file, but, suddenly, UltraEdit starts to behave unpredictable, selecting fully wrong text.
    I don't know if it's an error from my expression or if it's some bug of the program.

    If the HTML file is small and/or has short lines, expression is found all the time.
    But if the file has long lines with too many columns (characters), the problem shows up.

    To provide a sample for your analysis, I put two files attached.
    There are some personal dialogues on file #1.
    If I remove those dialogues, the problem won't show up anymore (file #2).

    I use this to remove personal dialogues:

    Search        (<span class="_3oh- _58nk">)(.*?)(</span>)
    Replace        \1\3

    To remove things like this:

    <span class="_3oh- _58nk">Some personal text</span>
    becomes
    <span class="_3oh- _58nk"></span>

    I'd request to remove the attached file #1 after your analysis.


    So, I'd like to know if I'm doing wrong or if it's actually a program bug.

    Thank you.



    Edited:
    Because of some mysterious cause, file #2 started to show the same problem.
    For this reason, I myself had removed file #1.
    sample #2 - Teste de bug (sem nomes e sem imagem).zip (30.83 KiB)   0

    6,686585
    Grand MasterGrand Master
    6,686585

      May 26, 2020#2

      Yes, on using UltraEdit v25.20.0.88 and running this replace (or find) in single step mode, i.e. with first searching for next occurrence, viewing the found and selected text and then replace it, the displayed selection is wrong after the first two replaces. However, running from top of file Replace all results in correct behavior as far as I could see with inserting after this replace newline characters in input and output file to be able to compare the files and look on the nine differences caused by correct Replace all.

      So UltraEdit has problems to display the file contents correct for the HTML file with more than 750 KB without any newline character which is extremely unusual. I have never seen such an HTML file in 20 years and in my opinion no browser should display such HTML files to force the file producer to format it useful.

      The currently latest UltraEdit v27.00.0.24 shows the same behavior. Replace all works fine, but a step by step replace is working incorrect after second replace.

      You can report this issue to  IDM support by email. But I suppose it will be rated as an issue with very low priority because of such a text file is really unusual and a nightmare for every text editor.

      BTW: I recommend not using capturing/marking groups for parts of found string which are not back-referenced at all and avoid .*? on searching/replacing something in an HTML file as much as possible to avoid matching too much.

      PS: I really hate webpages which embed binary data inside the HTML file which are usually stored in separate files like images and fonts referenced with a link. The reason is that I use an internet connection on which I have to pay for each MB. So I have disabled loading images and fonts in configuration of my browser for all websites except some specific websites configured manually different by me. But I have to pay for the download of binary data embedded in HTML file which are not displayed at all. The internet connection speed is also often very low (less than 20 KB/s) and so I have to wait a long time for getting displayed a page which shows just a few bytes of text, but contains several hundred KB of not used binary data because of embedding the data in HTML file instead of storing them in separate files linked from HTML file. I ban all websites which use such techniques to never load pages from them again.
      Best regards from an UC/UE/UES for Windows user from Austria

      1032
      Power UserPower User
      1032

        May 26, 2020#3

        Mofi wrote: However, running from top of file Replace all results in correct behavior as far as I could see with inserting after this replace newline characters in input and output file to be able to compare the files and look on the nine differences caused by correct Replace all.
        You're right.
        Replace all does the job without errors.

        Mofi wrote:So UltraEdit has problems to display the file contents correct for the HTML file with more than 750 KB without any newline character which is extremely unusual. I have never seen such an HTML file in 20 years and in my opinion no browser should display such HTML files to force the producer to format it useful.
        That's right.
        I am very disappointed to deal with this type of file.
        I usually compare my editions with Beyond Compare, but it seems unable to manage this either.
        The sample is a small piece of a 9 MB HTML file.
        This file is result of Firefox SaveAs action.
        But it can display the right look of it, after load it.
        It's completely crazy, but I found single lines with thousands of columns in it.
        It's very annoying to work on.

        Mofi wrote:The currently latest UltraEdit v27.00.0.24 shows the same behavior. Replace all works fine, but a step by step replace is working incorrect after second replace.

        You can report this issue to  IDM support by email. But I suppose it will be rated as an issue with very low priority because of such a text file is really unusual and a nightmare for every text editor.
        A nightmare...
        LOL.

        Mofi wrote:BTW: I recommend not using capturing/marking groups for parts of found string which are not back-referenced at all and avoid .*? on searching/replacing something in an HTML file as much as possible to avoid matching too much.
        Good advice.
        But I'm not an expert of RegExp.
        May you please suggest how to match that string I used?

        Mofi wrote:PS: I really hate webpages which embed binary data inside the HTML file which are usually stored in separate files like images and fonts referenced with a link. The reason is that I use an internet connection on which I have to pay for each MB. So I have disabled loading images and fonts in configuration of my browser for all websites except some specific websites configured manually different by me. But I have to pay for the download of binary data embedded in HTML file which are not displayed at all. The internet connection speed is also often very low (less than 20 KB/s) and so I have to wait a long time for getting displayed a page which shows just a few bytes of text, but contains several hundred KB of not used binary data because of embedding the data in HTML file instead of storing them in separate files linked from HTML file. I ban all websites which use such techniques to never load pages from them again.
        Another good advice.
        Me too. HTML is supposed to have just tags, scripts and text.
        Images comes as separate files.
        I'll take more attention about that.

        6,686585
        Grand MasterGrand Master
        6,686585

          May 26, 2020#4

          I would have used in this case the search expression:

          <video muted="1" preload="auto"[^>]+?src="([^\t\r\n."]+?)\.mp4"[^>]*?></video>

          The matching replace string would be: <img src="\1.jpg">

          [^>]+? ... searches non-greedy for any character including newline characters which can exist within a tag and being not a closing angle bracket or on finding the string src=". So after a positive match on <video muted="1" preload="auto" matching characters stop at the latest on > of the video tag. So if a video tag starts with <video muted="1" preload="auto", but does not contain src=", the search does not go beyond > of this video tag up to next src=" found anywhere in file.

          [^\t\r\n."]+? ... this expression stops matching characters on either finding a dot - the dot of file extension .mp4 - or on finding a double quote, the double quote which marks end of value of tag src, or a horizontal tab, carriage return or line-feed which no file name can contain. So if the referenced video file does not have file extension .mp4, but for example .mpeg, this expression does not continue searching in file up to next occurrence of .mp4 on same line (in this special case nearly entire file) as .*? would do. So [^\t\r\n."]+? stops matching characters much earlier than .*? would do on video file not having file extension .mp4.

          [^>]*? ... this non-greedy negative character set stops matching characters left to next closing angle bracket which is closing > of the <video tag.

          This Perl regular expression replaces with Replace all also the nine video tags by image tags, but it is also of no help on step by step find and replace on the special HTML file.
          Best regards from an UC/UE/UES for Windows user from Austria

          1032
          Power UserPower User
          1032

            May 27, 2020#5

            Thank you, Mofi, for your detailed explanations and suggestions.
            I'll run some tests and will return later to give feedback.

              May 29, 2020#6

              Your suggestion for RegExp works very well.
              And I could learned more about greedy/non-greedy, back-reference and others.

              Thank you.

              19476
              MasterMaster
              19476

                May 29, 2020#7

                Well, I don't want to look like a nit-picker but if you want to learn more about quantifiers then I suggest this little optimalization using the possessive quantifier:

                <video muted="1" preload="auto"[^>]+?src="([^\t\r\n."]+?)\.mp4"[^>]*?></video>

                <video muted="1" preload="auto"[^>]+?src="([^\t\r\n."]+?)\.mp4"[^>]*+></video>

                The part [^>]*> means: match all characters other then ">" and then match the following ">". So the text matched by [^>]* cannot contain any ">" thus it can be thrown away as a whole during backtracking when the last subpattern </video> does not match. To express such property of a quantifier the possession modifier + is used (*+ or ++ or ?+). Some engines automatically consider the quantifier in patterns like [^X]*X as possessive: [^X]*+X . However I think it is useful to explicitly use possessive quantifier whenever it makes a sense.
                Because how one wise regexp guru said - the really good regexp must fail as soon as possible. If there is no match at the current location, of course ;)

                BR, Fleggy

                1032
                Power UserPower User
                1032

                  May 29, 2020#8

                  Thank you, Fleggy.
                  It's always a good time to learn about RegExps.
                  I took note of your suggestions and will use them too.
                  Have a nice day.
                  🙂