How to find if there is an upper case letter immediately following a dot in a text file with some exceptions?

How to find if there is an upper case letter immediately following a dot in a text file with some exceptions?

81
Advanced UserAdvanced User
81

    Jul 19, 2017#1

    I'm looking to find if there are any upper case letter(s) immediately following a dot in a file, except it happens to fall inside either/both of the tags <given-names>...</given-names>, <uri>...</uri>

    Sample text:

    Code: Select all

    <title-group>
    <article-title>The coreceptor mutation CCR5&#x0394;32 influences the dynamics of HIV epidemics and is selected for by HIV</article-title>
    </title-group>
    <contrib-group>
    <contrib contrib-type="author">
    <name> <surname>Sullivan</surname> <given-names>Amy D.V</given-names> </name>
    <xref ref-type="author-notes" rid="FN150">&#x002A;</xref>
    </contrib>
    <contrib contrib-type="author">
    <name> <surname>Wigginton</surname> <given-names>Janis</given-names> </name>
    </contrib>
    <contrib contrib-type="author">
    <name> <surname>Kirschner</surname> <given-names>Denise A.K.</given-names> </name>
    <xref ref-type="author-notes" rid="FN151">&#x2020;</xref>
    </contrib>
    </contrib-group>
    <aff>Department of Microbiology and Immunology, University of Michigan Medical School, Ann Arbor, MI 48109-0620 <uri>http://www.Amazon.in/b?node=11962098031</aff>
    <author-notes>
    <body>
    <p>Nineteen million people have died of AIDS since the discovery of HIV in the 1980s.In 1999 alone, 5.4 million people were newly infected with HIV (ref. <xref ref-type="bibr" rid="B1">1</xref> and <ext-link ext-link-type="url" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="http://www.unaids.org/epidemicupdate/report/Epireport.html">http://www.unaids.org/epidemicupdate/report/Epireport.html</ext-link>). (For brevity, HIV-1 is referred to as HIV in this paper.) Sub-Saharan Africa has been hardest hit, with more than 20&#x0025; of the general population HIV-positive in some countries (<xref ref-type="bibr" rid="B2">2</xref>, <xref ref-type="bibr" rid="B3">3</xref>). In comparison, heterosexual epidemics in developed, market-economy countries have not reached such severe levels. Factors contributing to the severity of the epidemic in economically developing countries abound, including economic, health, and social differences such as high levels of sexually transmitted diseases and a lack of prevention programs.However, the staggering rate at which the epidemic has spread in sub-Saharan Africa has not been adequately explained. The rate and severity of this epidemic also could indicate a greater underlying susceptibility to HIV attributable not only to sexually transmitted disease, economics, etc., but also to other more ubiquitous factors such as host genetics (<xref ref-type="bibr" rid="B4">4</xref>, <xref ref-type="bibr" rid="B5">5</xref>).</p>
    <p>To exemplify the contribution of such a host genetic factor to HIV prevalence trends, we consider a well-characterized 32-bp deletion in the host-cell chemokine receptor CCR5, CCR5&#x0394;32. When HIV binds to host cells, it uses the CD4 receptor on the surface of host immune cells together with a coreceptor, mainly the CCR5 and CXCR4 chemokine receptors (<xref ref-type="bibr" rid="B6">6</xref>). Homozygous mutations for this 32-bp deletion offer almost complete protection from HIV infection, and heterozygous mutations are associated with lower pre-AIDS viral loads and delayed progression to AIDS (<xref ref-type="bibr" rid="B7">7</xref>&#x2013;<xref ref-type="bibr" rid="B14">14</xref>). CCR5&#x0394;32 generally is found in populations of European descent, with allelic frequencies ranging from 0 to 0.29 (<xref ref-type="bibr" rid="B13">13</xref>). African and Asian populations studied outside the United States or Europe appear to lack the CCR5&#x0394;32 allele, with an allelic frequency of almost zero (<xref ref-type="bibr" rid="B5">5</xref>, <xref ref-type="bibr" rid="B13">13</xref>). Thus, to understand the effects of a protective allele, we use a mathematical model to track prevalence of HIV in populations with or without CCR5&#x0394;32 heterozygous and homozygous people and also to follow the CCR5&#x0394;32 allelic frequency.</p>
    <p>We hypothesize that CCR5&#x0394;32 limits epidemic HIV by decreasing infection rates, and we evaluate the relative contributions to this by the probability of infection and duration of infectivity. To capture HIV infection as a chronic infectious disease together with vertical transmission occurring in untreated mothers, we model a dynamic population (i.e., populations that vary in growth rates because of fluctuations in birth or death rates) based on realistic demographic characteristics (<xref ref-type="bibr" rid="B18">18</xref>). This scenario also allows tracking of the allelic frequencies over time. This work considers how a specific host genetic factor affecting HIV infectivity and viremia at the individual level might influence the epidemic in a dynamic population and how HIV exerts selective pressure, altering the frequency of this mutant allele.</p>
    <p>CCR5 is a host-cell chemokine receptor, which is also used as a coreceptor by R5 strains of HIV that are generally acquired during sexual transmission (<xref ref-type="bibr" rid="B6">6</xref>, <xref ref-type="bibr" rid="B19">19</xref>&#x2013;<xref ref-type="bibr" rid="B25">25</xref>). As infection progresses to AIDS the virus expands its repertoire of potential coreceptors to include other CC-family and CXC-family receptors in roughly 50&#x0025; of patients (<xref ref-type="bibr" rid="B19">19</xref>, <xref ref-type="bibr" rid="B26">26</xref>, <xref ref-type="bibr" rid="B27">27</xref>). CCR5&#x0394;32 was identified in HIV-resistant people (<xref ref-type="bibr" rid="B28">28</xref>). Benefits to individuals from the mutation in this allele are as follows. Persons homozygous for the CCR5&#x0394;32 mutation are almost nonexistent in HIV-infected populations (<xref ref-type="bibr" rid="B11">11</xref>, <xref ref-type="bibr" rid="B12">12</xref>) (see ref. <xref ref-type="bibr" rid="B13">13</xref> for review). Persons heterozygous for the mutant allele (CCR5 W/&#x0394;32) tend to have lower pre-AIDS viral loads. Aside from the beneficial effects that lower viral loads may have for individuals, there is also an altruistic effect, as transmission rates are reduced for individuals with low viral loads (as compared with, for example, AZT and other studies; ref. <xref ref-type="bibr" rid="B29">29</xref>). Finally, individuals heterozygous for the mutant allele (CCR5 W/&#x0394;32) also have a slower progression to AIDS than those homozygous for the wild-type allele (CCR5 W/W) (<xref ref-type="bibr" rid="B7">7</xref>&#x2013;<xref ref-type="bibr" rid="B10">10</xref>), remaining in the population 2 years longer, on average. Interestingly, the dearth of information on HIV disease progression in people homozygous for the CCR5&#x0394;32 allele (CCR5 &#x0394;32/&#x0394;32) stems from the rarity of HIV infection in this group (<xref ref-type="bibr" rid="B4">4</xref>, <xref ref-type="bibr" rid="B12">12</xref>, <xref ref-type="bibr" rid="B28">28</xref>). However, in case reports of HIV-infected CCR5 &#x0394;32/&#x0394;32 homozygotes, a rapid decline in CD4<sup>&#x002B;</sup> T cells and a high viremia are observed, likely because of initial infection with a more aggressive viral strain (such as X4 or R5X4) (<xref ref-type="bibr" rid="B30">30</xref>) <uri>www.Wordmvp.com/lo</uri>, <uri>istec.com.Au</uri> and <uri>google.com</uri></p>
    <sec>
    <title>The Model</title>
    <p>Estimates for rates that govern the interactions depicted in Fig. <xref ref-type="fig" rid="F1">1</xref> were derived from the extensive literature on HIV. Our parameters and their estimates are summarized in Tables <xref ref-type="table" rid="T2">2</xref>&#x2013;<xref ref-type="table" rid="T4">4</xref>. The general form of the equations describing the rates of transition between population classes as depicted in Fig. <xref ref-type="fig" rid="F1">1</xref> are summarized as follows: <disp-formula id="E1">
    <tex-math id="M1">\documentclass[12pt]{minimal} \usepackage{wasysym} \usepackage[substack]{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage[mathscr]{eucal} \usepackage{mathrsfs} \DeclareFontFamily{T1}{linotext}{} \DeclareFontShape{T1}{linotext}{m}{n} { <-> linotext }{} \DeclareSymbolFont{linotext}{T1}{linotext}{m}{n} \DeclareSymbolFontAlphabet{\mathLINOTEXT}{linotext} \begin{document} $$ \frac{dS_{i,j}(t)}{dt}={\chi}_{i,j}(t)-{\mu}_{j}S_{i,j}(t)-{\lambda}_{\hat {\imath},\hat {},\hat {k}{\rightarrow}i,j}S_{i,j}(t), $$ \end{document} </tex-math>
    </disp-formula>
    </p>
    </sec>
    </body>
    <back>
    <title>Reference</title>
    <ref id="ref1"><mixed-citation><string-name> <surname>Kirschner</surname> <given-names>Denise A.K.</given-names> </string-name>, <string-name> <surname>Bennet</surname> <given-names>D.K.</given-names> </string-name> <source>Insomnia</source>. <volume>2</volume> <uri>https://www.gcflearnfree.org/access2007/</uri></mixed-citation>
    The search should find only two positive matches in the above example namely .I from 1980s.In 1999
    and .H from programs.However.

    I have tried \.\u(?!([^<>]?</given-names>|[^<>]?</uri>)) but it doesn't work and I can't think of any other ways either.

    Can anyone help? :(

    18672
    MasterMaster
    18672

      Jul 19, 2017#2

      Hi Don,

      your regular expression

      Code: Select all

      \.     # match a dot
      \u     # match an upper letter
      (?!                # not followed by
       (                 # either
        [^<>]*+          # optionally any symbol but < or >
        </given-names>   # and closing tag given-names
        |                # or
        [^<>]*+          # optionally any symbol but < or >
        </uri>           # and closing tag uri
       )
      )
      cannot find anything because (as you perhaps see now) you used negative lookahead immediately after the matched upper case letter.

      I have modified your regular expression a little:

      Code: Select all

      (?x)
      \.\u++
      (?=                                        # the rest of text in lookahead
       [^<>]*+                                   # find the closest closing tag
       (?!</given-names>|</uri>)                 # the additive negative lookahead to check ending tag
      )                                          # end of lookahead
      You were very close :)

      BR, Fleggy

      EDIT:
      Your sample contains an incomplete tag <uri>http://www.Amazon.in/b?node=11962098031</aff> so .A is matched as well.

      81
      Advanced UserAdvanced User
      81

        Jul 20, 2017#3

        Thanks Fleggy :mrgreen:

        The incomplete tag <uri>http://www.Amazon.in/b?node=11962098031</aff> was a typo. </aff> should be </uri>

        My bad :lol:

        BTW: What does (?x) mean in the regex?

        18672
        MasterMaster
        18672

          Jul 20, 2017#4

          It means free-spacing mode ON. In other words all whitespaces are ignored (new lines, spaces, tabs) so you can format the expression to be more eligible.

          81
          Advanced UserAdvanced User
          81

            Jul 21, 2017#5

            Hi fleggy,
            Is ++ required in the expression \.\u++, I only need to find if a upper case letter is immediately followed by a dot and what immediately follows that letter is irrelevant.
            So, I used \.\u(?=[^<>]*+(?!</given-names>|</uri>)) in the sample that I posted on this topic and it seems to work. So, is it okay to use this way?

            18672
            MasterMaster
            18672

              Jul 21, 2017#6

              Hi Don,

              of course it is OK. You can even use this modification which matches only the dot (the rest is inside the lookahead):

              \.(?=\u[^<>]*+(?!</given-names>|</uri>))

              BR, Fleggy

              81
              Advanced UserAdvanced User
              81

                Jul 26, 2017#7

                Hi Fleggy,
                Is there an alternative for the expression [^<>]*+ as my old version of UE which I use for my work is showing invalid
                regex. I think the *+ portion is the main issue though.
                Thanks in advance. :mrgreen:

                18672
                MasterMaster
                18672

                  Jul 26, 2017#8

                  Hi Don,

                  replace the part
                  [^<>]*+
                  with
                  (?>[^<>]*)

                  I hope your version supports atomic groups ;)

                  BR, Fleggy

                  81
                  Advanced UserAdvanced User
                  81

                    Jul 27, 2017#9

                    Thanks Fleggy. :mrgreen:
                    Btw, does this modified version(using atomic group) have any issues/restrictions or does it exactly do what the previous
                    expression did? :|

                    18672
                    MasterMaster
                    18672

                      Jul 27, 2017#10

                      Hi Don,

                      Both expressions do exactly the same, don't worry :)

                      BR, Fleggy