Regular expression to extract URL strings from a XUL file

Regular expression to extract URL strings from a XUL file

1002
Power UserPower User
1002

    Apr 29, 2018#1

    Hi.

    I use custom buttons Firefox extension on my daily work.
    Due to Firefox updates, that extension became obsolete and is being refused by newer versions of the browser.
    It uses a XUL format file to store the URLs.
    I'd like to capture just my bookmarked URLs from it.
    Regular expressions seems to be the best solution, because such file became a little big.

    I tried many variations to grab URL's strings after "loadURI" key without success.

    Here is a piece of file I'm trying to extract only URLs:

    Code: Select all

    <overlay id="custombuttons-profile-overlay" xmlns="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul">
      <toolbarpalette id="BrowserToolbarPalette">
        <toolbarbutton id="custombuttons-button0" label="Google" tooltiptext="Google" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/google3-20x20.jpg" cb-oncommand="loadURI(&quot;http://www.google.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
        
        <toolbarbutton id="custombuttons-button5" label="FreeDictionary" tooltiptext="FreeDictionary" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Dictionary-36x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.thefreedictionary.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
        
        <toolbarbutton id="custombuttons-button10" label="SpeedTest" tooltiptext="SpeedTest" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/SpeedTest-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.speedtest.net/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
        <toolbarbutton id="custombuttons-button11" label="SuaLíngua" tooltiptext="SuaLíngua" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/SuaL%C3%ADngua-16x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://wp.clicrbs.com.br/sualingua/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
        <toolbarbutton id="custombuttons-button14" label="Modem" tooltiptext="Modem" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/M-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://192.168.100.1/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
    </toolbarpalette>
      <toolbarpalette id="MailToolbarPalette"/>
      <toolbarpalette id="MsgComposeToolbarPalette"/>
      <toolbarpalette id="calendarToolbarPalette"/>
      <toolbarpalette id="NvuToolbarPalette"/>
    </overlay>
    So, the output I wish is something like this:

    Code: Select all

    http://www.google.com
    http://www.thefreedictionary.com
    http://www.speedtest.net/
    http://wp.clicrbs.com.br/sualingua/
    http://192.168.100.1/
    Regular expression has to dismiss everything before loadURI(&quot; (included itself) and everything after &quot;)" cb-init (included).
    And send to output everything between them.

    11327
    MasterMaster
    11327

      Apr 29, 2018#2

      Find what: (?<=loadURI\(&quot;).*(?=&)
      Check Highlight all items found
      Check Regular expression and choose Perl

      Press Next.

      Press Ctrl + , (Convert Highlighted to selection command)
      Now you can copy selected to new file via Copy/Paste

      HTH
      It's impossible to lead us astray for we don't care even to choose the way.

      6,675585
      Grand MasterGrand Master
      6,675585

        Apr 30, 2018#3

        Run first a Perl regular expression replace all from top of file with search string ^(?:(?!loadURI\(&quot;).)*$\r?\n and an empty replace string to delete all lines not containing loadURI(&quot;.

        Then run a Perl regular expression replace all from top of file with search string ^.*loadURI\(&quot;(.+?)&quot.*$ and \1 as replace string to delete everything around a URL on a line.

        Delete last line in file not containing a URL if there is such a line because of no line termination at end of file.

        Use command Save As to save the file containing now only a list of URLs with a new name.
        Best regards from an UC/UE/UES for Windows user from Austria

        1002
        Power UserPower User
        1002

          Apr 30, 2018#4

          Excellent!

          Both solutions, Ovg and Mofi, work very well.
          My present topic is much more to learn a lit bit about Regular Expressions than to solve my task.
          My buttonsoverlay.xul file is not so big and I could do the job manually.
          But these ways that you guys suggested are far better.

          Thank you both.

          🙂

          11327
          MasterMaster
          11327

            Apr 30, 2018#5

            I've found that my solution will work for files with less than 3000 links, so Mofi's solution is much better.
            It's impossible to lead us astray for we don't care even to choose the way.

            1002
            Power UserPower User
            1002

              Apr 30, 2018#6

              That's right.
              I realized the same right now too.
              And I was wondering what happened and why your string (?<=loadURI\(&quot;).*(?=&) did not work beyond some point.
              🤔

              11327
              MasterMaster
              11327

                Apr 30, 2018#7

                RegEx work fine, but  UE can't convert too many highlights to selection. I don't know why.
                It's impossible to lead us astray for we don't care even to choose the way.

                1002
                Power UserPower User
                1002

                  Apr 30, 2018#8

                  I found another problem, even with Mofi's solution:
                  My working file has its last line with no LF/CR after closing tag for <toolbarbutton>, sticking together many buttons settings.
                  Like this:

                  Code: Select all

                  <toolbarbutton id="custombuttons-button55" label="uouo" tooltiptext="uouo" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Estad%C3%A3o.png" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.estadao.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button52" label="Proxy Site" tooltiptext="Proxy Site" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Proxy%20Site-16x20.png" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;https://www.proxysite.com/&quot;)&#xA;" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button13" label="GoogleTradutor" tooltiptext="GoogleTradutor" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Google-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://translate.google.com.br/?hl=pt-br&amp;ie=UTF-8&amp;tab=bT#en|pt|&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button7" label="Yahoo Calendar" tooltiptext="Yahoo Calendar" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Calendar-23x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://calendar.google.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button1" label="Youtube" tooltiptext="Youtube" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Youtube-48x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;www.youtube.com&quot;)&#xA;" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                  This is just a piece of it (example). The actual line is much longer than this.


                  So, Mofi's solution won't work on it as it's now. It could grab just the last URL.

                  19176
                  MasterMaster
                  19176

                    Apr 30, 2018#9

                    Hi,

                    try this Perl regex, please. But you must use Notepad++ because there is a bug in UE/UES regex engine I believe.

                    F: (?s)(?:(?<!loadURI\(&quot;).)++(http(?:(?!&quot;).)++)|.*
                    R: \1\r\n

                    BR, Fleggy

                    1002
                    Power UserPower User
                    1002

                      Apr 30, 2018#10

                      Thank you, Fleggy, for your suggestion.

                      Your RegExp string works only on my previous example, if the file has just one line, post #8.
                      And yes, it worked fine using UltraEdit.
                      But it did not work when I load all my XUL file, that has CR/LF between lines, like my first example, post #1.
                      Neither using Notepad++

                      Please, consider using the following example (I added post #8 to post #1):

                      Code: Select all

                      <overlay id="custombuttons-profile-overlay" xmlns="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul">
                        <toolbarpalette id="BrowserToolbarPalette">
                          <toolbarbutton id="custombuttons-button0" label="Google" tooltiptext="Google" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/google3-20x20.jpg" cb-oncommand="loadURI(&quot;http://www.google.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                          
                          <toolbarbutton id="custombuttons-button5" label="FreeDictionary" tooltiptext="FreeDictionary" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Dictionary-36x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.thefreedictionary.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                          
                          <toolbarbutton id="custombuttons-button10" label="SpeedTest" tooltiptext="SpeedTest" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/SpeedTest-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.speedtest.net/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                          <toolbarbutton id="custombuttons-button11" label="SuaLíngua" tooltiptext="SuaLíngua" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/SuaL%C3%ADngua-16x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://wp.clicrbs.com.br/sualingua/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                          <toolbarbutton id="custombuttons-button14" label="Modem" tooltiptext="Modem" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/M-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://192.168.100.1/&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                      
                          <toolbarbutton id="custombuttons-button55" label="uouo" tooltiptext="uouo" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Estad%C3%A3o.png" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://www.estadao.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button52" label="Proxy Site" tooltiptext="Proxy Site" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Proxy%20Site-16x20.png" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;https://www.proxysite.com/&quot;)&#xA;" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button13" label="GoogleTradutor" tooltiptext="GoogleTradutor" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Google-20x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://translate.google.com.br/?hl=pt-br&amp;ie=UTF-8&amp;tab=bT#en|pt|&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button7" label="Yahoo Calendar" tooltiptext="Yahoo Calendar" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Calendar-23x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;http://calendar.google.com&quot;)" cb-init="/*Código de inicialização*/" cb-mode="0"/><toolbarbutton id="custombuttons-button1" label="Youtube" tooltiptext="Youtube" class="toolbarbutton-1 chromeclass-toolbar-additional" context="custombuttons-contextpopup" image="file:///B:/Meus%20documentos/Firefox/%C3%8Dcones/Youtube-48x20.jpg" cb-oncommand="/*CÓDIGO*/&#xA;loadURI(&quot;www.youtube.com&quot;)&#xA;" cb-init="/*Código de inicialização*/" cb-mode="0"/>
                      </toolbarpalette>
                      
                        <toolbarpalette id="MailToolbarPalette"/>
                        <toolbarpalette id="MsgComposeToolbarPalette"/>
                        <toolbarpalette id="calendarToolbarPalette"/>
                        <toolbarpalette id="NvuToolbarPalette"/>
                      </overlay>

                      6,675585
                      Grand MasterGrand Master
                      6,675585

                        Apr 30, 2018#11

                        The first I always do on a file containing multiple lines with multiple strings of interest in same line is splitting those lines up into multiple lines. For example search first for loadURI and replace all found occurrences by \r\nloadURI to insert a carriage return + line-feed before each URL to have only one URL per line.

                        Another possibility is adding the script file FindStringsToNewFile.js from Find strings with a regular expression and output them to new file to Script list and then run this UltraEdit script on active XUL file with the search expression loadURI\(&quot;(.+?)&quot and $1 as replace expression.
                        Best regards from an UC/UE/UES for Windows user from Austria

                        1002
                        Power UserPower User
                        1002

                          Apr 30, 2018#12

                          @Fleggy,

                          Please, ignore my previous post (#10).
                          I did some mistake and, yes, your string works very well, even on the complete file.
                          And it works also with UltraEdit too.


                          @Mofi,

                          Thank you for your suggestions.
                          Everything is fine now.

                          19176
                          MasterMaster
                          19176

                            Apr 30, 2018#13

                            Gabarito, what is your version? I tried this regex in UE 25.00.0.68 x64 and UES 18.00.0.10 x64 (Windows 10) and it did not work correctly. Bug reported and confirmed by IDM.

                            Thanks, Fleggy

                            1002
                            Power UserPower User
                            1002

                              Apr 30, 2018#14

                              fleggy wrote: Gabarito, what is your version? I tried this regex in UE 25.00.0.68 x64 and UES 18.00.0.10 x64 (Windows 10) and it did not work correctly. Bug reported and confirmed by IDM.

                              Thanks, Fleggy
                              Now I'm really confuse.

                              As I said at post #10, I could not get correct results using your RegExp string.
                              When I applied the string to the sample from that same post, surprise! It run correctly.

                              Now I'm applying the very same string to that post #10's sample AND to my actual XUL file, little bigger.
                              For the sample, your RegExp string works well, but for the actual file it gives me wrong results.
                              😱

                              My UE version is 24.20.0.62.

                              Results using the sample at post #10 using UltraEdit 24.20.0.62, Windows 7 64bits:

                              Code: Select all

                              http://www.google.com
                              http://www.thefreedictionary.com
                              http://www.speedtest.net/
                              http://wp.clicrbs.com.br/sualingua/
                              http://192.168.100.1/
                              http://www.estadao.com
                              https://www.proxysite.com/
                              http://translate.google.com.br/?hl=pt-br&amp;ie=UTF-8&amp;tab=bT#en|pt|
                              http://calendar.google.com
                              Exactly what I want.

                              19176
                              MasterMaster
                              19176

                                Apr 30, 2018#15

                                Well, I played with my pattern and got a crash. There is definitely a bug in UE/UES regex engine and only God knows which version is not affected.

                                @Mofi: please, could you test this scenario using some older version?
                                Replace all on Gabarito's samples
                                F: (?s)(?:(?<!loadURI).)++\(&quot;(http(?:(?!&quot;\)).)++)
                                R: \1\r\n
                                and then UNDO. I got a crash on my desktop. Thank you

                                Read more posts (7 remaining)