Regular expression to extract URL strings from a XUL file

Regular expression to extract URL strings from a XUL file

1032
Power UserPower User
1032

    Apr 29, 2018#1

    Hi.

    I use custom buttons Firefox extension on my daily work.
    Due to Firefox updates, that extension became obsolete and is being refused by newer versions of the browser.
    It uses a XUL format file to store the URLs.
    I'd like to capture just my bookmarked URLs from it.
    Regular expressions seems to be the best solution, because such file became a little big.

    I tried many variations to grab URL's strings after "loadURI" key without success.

    Here is a piece of file I'm trying to extract only URLs:

    Code: Select all

    <overlay id="custombuttons-profile-overlay" xmlns="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul">
      <toolbarpalette id="BrowserToolbarPalette">
        <toolbarbutton id="custombuttons-button0" label="Google" tooltiptext="Google" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/google3-20x20.jpg" cb-oncommand="loadURI(&quot;http://www.google.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
        
        <toolbarbutton id="custombuttons-button5" label="FreeDictionary" tooltiptext="FreeDictionary" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Dictionary-36x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.thefreedictionary.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
        
        <toolbarbutton id="custombuttons-button10" label="SpeedTest" tooltiptext="SpeedTest" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/SpeedTest-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.speedtest.net/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
        <toolbarbutton id="custombuttons-button11" label="SuaLíngua" tooltiptext="SuaLíngua" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/SuaL%C3%ADngua-16x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://wp.clicrbs.com.br/sualingua/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
        <toolbarbutton id="custombuttons-button14" label="Modem" tooltiptext="Modem" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/M-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://192.168.100.1/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
    </toolbarpalette>
      <toolbarpalette id="MailToolbarPalette"/>
      <toolbarpalette id="MsgComposeToolbarPalette"/>
      <toolbarpalette id="calendarToolbarPalette"/>
      <toolbarpalette id="NvuToolbarPalette"/>
    </overlay>
    So, the output I wish is something like this:

    Code: Select all

    http://www.google.com
    http://www.thefreedictionary.com
    http://www.speedtest.net/
    http://wp.clicrbs.com.br/sualingua/
    http://192.168.100.1/
    Regular expression has to dismiss everything before loadURI(&quot; (included itself) and everything after &quot;)" cb-init (included).
    And send to output everything between them.

    11327
    MasterMaster
    11327

      Apr 29, 2018#2

      Find what: (?<=loadURI\(&quot;).*(?=&)
      Check Highlight all items found
      Check Regular expression and choose Perl

      Press Next.

      Press Ctrl + , (Convert Highlighted to selection command)
      Now you can copy selected to new file via Copy/Paste

      HTH
      It's impossible to lead us astray for we don't care even to choose the way.

      6,686585
      Grand MasterGrand Master
      6,686585

        Apr 30, 2018#3

        Run first a Perl regular expression replace all from top of file with search string ^(?:(?!loadURI\(&quot;).)*$\r?\n and an empty replace string to delete all lines not containing loadURI(&quot;.

        Then run a Perl regular expression replace all from top of file with search string ^.*loadURI\(&quot;(.+?)&quot.*$ and \1 as replace string to delete everything around a URL on a line.

        Delete last line in file not containing a URL if there is such a line because of no line termination at end of file.

        Use command Save As to save the file containing now only a list of URLs with a new name.
        Best regards from an UC/UE/UES for Windows user from Austria

        1032
        Power UserPower User
        1032

          Apr 30, 2018#4

          Excellent!

          Both solutions, Ovg and Mofi, work very well.
          My present topic is much more to learn a lit bit about Regular Expressions than to solve my task.
          My buttonsoverlay.xul file is not so big and I could do the job manually.
          But these ways that you guys suggested are far better.

          Thank you both.

          🙂

          11327
          MasterMaster
          11327

            Apr 30, 2018#5

            I've found that my solution will work for files with less than 3000 links, so Mofi's solution is much better.
            It's impossible to lead us astray for we don't care even to choose the way.

            1032
            Power UserPower User
            1032

              Apr 30, 2018#6

              That's right.
              I realized the same right now too.
              And I was wondering what happened and why your string (?<=loadURI\(&quot;).*(?=&) did not work beyond some point.
              🤔

              11327
              MasterMaster
              11327

                Apr 30, 2018#7

                RegEx work fine, but  UE can't convert too many highlights to selection. I don't know why.
                It's impossible to lead us astray for we don't care even to choose the way.

                1032
                Power UserPower User
                1032

                  Apr 30, 2018#8

                  I found another problem, even with Mofi's solution:
                  My working file has its last line with no LF/CR after closing tag for <toolbarbutton>, sticking together many buttons settings.
                  Like this:

                  Code: Select all

                  <toolbarbutton id="custombuttons-button55" label="uouo" tooltiptext="uouo" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Estad%C3%A3o.png" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.estadao.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button52" label="Proxy Site" tooltiptext="Proxy Site" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Proxy%20Site-16x20.png" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;https://www.proxysite.com/&quot;)&#xA;" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button13" label="GoogleTradutor" tooltiptext="GoogleTradutor" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Google-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://translate.google.com.br/?hl=pt-br&amp;ie=UTF-8&amp;tab=bT#en|pt|&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button7" label="Yahoo Calendar" tooltiptext="Yahoo Calendar" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Calendar-23x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://calendar.google.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button1" label="Youtube" tooltiptext="Youtube" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Youtube-48x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;www.youtube.com&quot;)&#xA;" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                  This is just a piece of it (example). The actual line is much longer than this.


                  So, Mofi's solution won't work on it as it's now. It could grab just the last URL.

                  19476
                  MasterMaster
                  19476

                    Apr 30, 2018#9

                    Hi,

                    try this Perl regex, please. But you must use Notepad++ because there is a bug in UE/UES regex engine I believe.

                    F: (?s)(?:(?<!loadURI\(&quot;).)++(http(?:(?!&quot;).)++)|.*
                    R: \1\r\n

                    BR, Fleggy

                    1032
                    Power UserPower User
                    1032

                      Apr 30, 2018#10

                      Thank you, Fleggy, for your suggestion.

                      Your RegExp string works only on my previous example, if the file has just one line, post #8.
                      And yes, it worked fine using UltraEdit.
                      But it did not work when I load all my XUL file, that has CR/LF between lines, like my first example, post #1.
                      Neither using Notepad++

                      Please, consider using the following example (I added post #8 to post #1):

                      Code: Select all

                      <overlay id="custombuttons-profile-overlay" xmlns="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul">
                        <toolbarpalette id="BrowserToolbarPalette">
                          <toolbarbutton id="custombuttons-button0" label="Google" tooltiptext="Google" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/google3-20x20.jpg" cb-oncommand="loadURI(&quot;http://www.google.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                          
                          <toolbarbutton id="custombuttons-button5" label="FreeDictionary" tooltiptext="FreeDictionary" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Dictionary-36x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.thefreedictionary.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                          
                          <toolbarbutton id="custombuttons-button10" label="SpeedTest" tooltiptext="SpeedTest" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/SpeedTest-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.speedtest.net/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                          <toolbarbutton id="custombuttons-button11" label="SuaLíngua" tooltiptext="SuaLíngua" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/SuaL%C3%ADngua-16x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://wp.clicrbs.com.br/sualingua/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                          <toolbarbutton id="custombuttons-button14" label="Modem" tooltiptext="Modem" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/M-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://192.168.100.1/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                      
                          <toolbarbutton id="custombuttons-button55" label="uouo" tooltiptext="uouo" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Estad%C3%A3o.png" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.estadao.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button52" label="Proxy Site" tooltiptext="Proxy Site" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Proxy%20Site-16x20.png" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;https://www.proxysite.com/&quot;)&#xA;" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button13" label="GoogleTradutor" tooltiptext="GoogleTradutor" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Google-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://translate.google.com.br/?hl=pt-br&amp;ie=UTF-8&amp;tab=bT#en|pt|&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button7" label="Yahoo Calendar" tooltiptext="Yahoo Calendar" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Calendar-23x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://calendar.google.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button1" label="Youtube" tooltiptext="Youtube" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Youtube-48x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;www.youtube.com&quot;)&#xA;" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                      </toolbarpalette>
                      
                        <toolbarpalette id="MailToolbarPalette"/>
                        <toolbarpalette id="MsgComposeToolbarPalette"/>
                        <toolbarpalette id="calendarToolbarPalette"/>
                        <toolbarpalette id="NvuToolbarPalette"/>
                      </overlay>

                      6,686585
                      Grand MasterGrand Master
                      6,686585

                        Apr 30, 2018#11

                        The first I always do on a file containing multiple lines with multiple strings of interest in same line is splitting those lines up into multiple lines. For example search first for loadURI and replace all found occurrences by \r\nloadURI to insert a carriage return + line-feed before each URL to have only one URL per line.

                        Another possibility is adding the script file FindStringsToNewFile.js from Find strings with a regular expression and output them to new file to Script list and then run this UltraEdit script on active XUL file with the search expression loadURI\(&quot;(.+?)&quot and $1 as replace expression.
                        Best regards from an UC/UE/UES for Windows user from Austria

                        1032
                        Power UserPower User
                        1032

                          Apr 30, 2018#12

                          @Fleggy,

                          Please, ignore my previous post (#10).
                          I did some mistake and, yes, your string works very well, even on the complete file.
                          And it works also with UltraEdit too.


                          @Mofi,

                          Thank you for your suggestions.
                          Everything is fine now.

                          19476
                          MasterMaster
                          19476

                            Apr 30, 2018#13

                            Gabarito, what is your version? I tried this regex in UE 25.00.0.68 x64 and UES 18.00.0.10 x64 (Windows 10) and it did not work correctly. Bug reported and confirmed by IDM.

                            Thanks, Fleggy

                            1032
                            Power UserPower User
                            1032

                              Apr 30, 2018#14

                              fleggy wrote: Gabarito, what is your version? I tried this regex in UE 25.00.0.68 x64 and UES 18.00.0.10 x64 (Windows 10) and it did not work correctly. Bug reported and confirmed by IDM.

                              Thanks, Fleggy
                              Now I'm really confuse.

                              As I said at post #10, I could not get correct results using your RegExp string.
                              When I applied the string to the sample from that same post, surprise! It run correctly.

                              Now I'm applying the very same string to that post #10's sample AND to my actual XUL file, little bigger.
                              For the sample, your RegExp string works well, but for the actual file it gives me wrong results.
                              😱

                              My UE version is 24.20.0.62.

                              Results using the sample at post #10 using UltraEdit 24.20.0.62, Windows 7 64bits:

                              Code: Select all

                              http://www.google.com
                              http://www.thefreedictionary.com
                              http://www.speedtest.net/
                              http://wp.clicrbs.com.br/sualingua/
                              http://192.168.100.1/
                              http://www.estadao.com
                              https://www.proxysite.com/
                              http://translate.google.com.br/?hl=pt-br&amp;ie=UTF-8&amp;tab=bT#en|pt|
                              http://calendar.google.com
                              Exactly what I want.

                              19476
                              MasterMaster
                              19476

                                Apr 30, 2018#15

                                Well, I played with my pattern and got a crash. There is definitely a bug in UE/UES regex engine and only God knows which version is not affected.

                                @Mofi: please, could you test this scenario using some older version?
                                Replace all on Gabarito's samples
                                F: (?s)(?:(?<!loadURI).)++\(&quot;(http(?:(?!&quot;\)).)++)
                                R: \1\r\n
                                and then UNDO. I got a crash on my desktop. Thank you

                                Read more posts (7 remaining)