Finding an XML/HTML tag whose opening occurs twice or more before the closing tag?

Finding an XML/HTML tag whose opening occurs twice or more before the closing tag?

81
Advanced UserAdvanced User
81

    Feb 28, 2016#1

    Sample

    Code: Select all

    <p>REFERENCES</p>
    <p>Agassiz, Louis, 1840, Etudes sur les glaciers: Neuchatel, privately published, 346 p.</p>
    <p>Andrews, Edmund, 1870, The North American lakes considered as chronometers of post-glacial time: Chicago Acad. Sci. Trans., v. 2, 1870, p. 1-23</p>
    <p>Baker, F. C, 1920, The life of the Pleistocene or glacial period as recorded in the deposits laid down by the great ice sheets: Univ. Illinois Bull., v. 17, no. 41, 476 p.</p>
    <p>Capps, S. R., 1915, <p>An estimate of the age of the last great glaciation in Alaska: Washington Acad. Sci. J., v. 5, p. 108-115</p>
    <p>Chamberlin, T. C, 1877, Geology of eastern Wisconsin, in Geology of Wisconsin, survey of 1873-1877, v. 2: Madison, Commissioners of Public Printing, p. 97-246</p>
    <p>—1878, on the extent and significance of the Wisconsin kettle moraine: Wisconsin Acad. Sci., Arts Lett. Trans., v. 4 (1876-77), p. 201-234</p>
    <p>—1883, Terminal moraine<p> of the second <p>glacial epoch: U.S. Geol. Surv. Ann. Rep. 3, p. 291-402</p>
    <p>—1888, The rock-scorings of the great ice invasions: U.S. Geol. Surv., Ann. Rep. 7, p. 147-248</p>
    <p>—1895, The classification of American glacial deposits: J. Geol., v. 3, p. 270-277</p>
    <p>—1896, Nomenclature of glacial formations: J. Geol., v. 4, p. 872-876</p>
    <p>Conrad, T. A., 1839, Notes on American geology: Amer. J. Sci., v. 35, p. 237-251</p>
    <p>Daly, R. A., 1910, Pleistocene glaciation and the coral reef problem: Amer. J. Sci., v. 30, p. 297-308</p>
    <p>—1915, The glacial-control theory of coral reefs: Amer. Acad. Arts Sci. Proc, v. 51, p. 157-251</p>
    <p>Dana, J. D., 1863, Manual of geology: Philadelphia, Theodore Bliss & Co., 1st ed., 798 p.
    dfsfds
    <p>—1873, On the Glacial and Champlain eras in New England: Amer. J. Sci., v. 5, p. 198-211</p>
    <p>—1875, Manual of geology: New York, Ivison, Blake-man, Taylor & Co., 2d. ed., 828 p.</p>
    <p>—1895, Manual of geology: New York, American Book Co., 4th ed., 1087 p.</p>
    <p>Darton, N. H., 1902, Description of the Norfolk quadrangle: U.S. Geol. Surv. Geol. Atlas, Folio 80, 4 p.</p>
    <p>Dobson, Peter, 1826, <c><p>Remarks on bowlders</p></c>: Amer. J. Sci., v. 10, p. 217-218</p>
    <p>Gale, H. S., 1914, Salines in the Owens, Searles, and Pana-mint basins, southeastern California: U.S. Geol. Surv. Bull. 580, p. 251-323</p>
    <p>Geikie, James, 1874, The great ice age and its relation to the antiquity of man: London, W. Isbister, 575 p.</p>
    <p>—1894, The great ice age and its relation to the antiquity of man: London, Stanford, 3d. ed., 850 p.</p>
    <p>Gilbert, G. K., 1871, On certain glacial and post-glacial phenomena of the Maumee Valley: Amer. J. Sci., v. 1, p. 339-345</p>
    <p>—1890, Lake Bonneville: U.S. Geol. Surv. Monogr. 1, 438 p.</p>
    <p>Gray, Asa, 1878, Forest geography and archaeology: Amer. J. Sci., v. 16, p. 85-94, 183-196</p>
    <p>Hay, O. P., 1914, The Pleistocene mammals of Iowa: Iowa Geol. Surv., v. 42, Ann. Rep. for 1912, p. 1-662</p>
    <p>—1923, 1924, 1927, The Pleistocene of North America and its vertebrated animals . . . : Carnegie Instn. Publ. 322, 322A, 322B (3 v.)</p>
    <p>Hitchcock, C. H., 1878, Surface geology, in The geology of New Hampshire: Concord, v. 3, pt. 3, 340 p.</p>
    <p>Hitchcock, Edward, 1841, First anniversary address before the Association of American Geologists . . . : Amer. J. Sci., v. 41, p. 232-275</p>
    
    I am looking for a regex find which will find me only the strings "<p>...<p>" inside which there is no "</p>" tag present.

    I'm using the Perl expression "<p>.*(?<!</p>)?<p>" which partially does the job, but it doesn't find the multi-line string even after adding "(?s)" before the expression

    Code: Select all

    <p>Dana, J. D., 1863, Manual of geology: Philadelphia, Theodore Bliss & Co., 1st ed., 798 p.
    dfsfds
    <p>
    And it also doesn't work properly on the string

    Code: Select all

    <p>—1883, Terminal moraine<p> of the second <p>
    where I want it to find "<p>—1883, Terminal moraine<p>" and "<p> of the second <p>" separately and not the total.

    Can anyone help me on this :|

    6,604548
    Grand MasterGrand Master
    6,604548

      Feb 28, 2016#2

      The Perl regular expression search string (?s)<p\b(?:.(?!</p>)(?!<p\b))+.(?:</p>)? finds paragraphs with or without </p> at end.

      The Perl regular expression search string (?s)<p\b(?:.(?!</p>)(?!<p\b))+.(?=<p\b) is the one you need to find a paragraph without an end tag before start tag of next paragraph is in file.

      (?s) ... . matches also newline characters.

      <p\b ... a paragraph start tag with or without attributes. Character p must be a complete word verified by \b (word boundary).

      (?:...)+ ... the expression in this non marking group must be applied 1 or more times.

      .(?!</p>)(?!<p\b) ... find any character where next to this character is whether a paragraph end tag or a new paragraph start tag verified by two non matching negative lookaheads.

      . ... the last character before paragraph end / start tag must be additionally matched in any case.

      On first search string only:

      (?:</p>)? ... match also the end tag if being optionally found next.

      On second search string only:

      (?=<p\b) ... the entire search is only positive if after last character of paragraph there is a paragraph start tag verified by a non matching positive lookahead.
      Best regards from an UC/UE/UES for Windows user from Austria