Friday, 7 July 2017

Python, regex and html: match final tag on line



I'm confused about python greedy/not-greedy characters.



"Given multi-line html, return the final tag on each line."



I would think this would be correct:




re.findall('<.*?>$', html, re.MULTILINE)


I'm irked because I expected a list of single tags like:



"", "
    ", "".


My O'Reilly's Pocket Reference says that *? wil "match 0 or more times, but as few times as possible."




So why am I getting 'greedier' matches, i.e., more than one tag in some (but not all) matches?


Answer



Your problem stems from the fact that you have an end-of-line anchor ('$'). The way non-greedy matching works is that the engine first searches for the first unconstrained pattern on the line ('<' in your case). It then looks for the first '>' character (which you have constrained, with the $ anchor, to be at the end of the line). So a non-greedy * is not any different from a greedy * in this situation.



Since you cannot remove the '$' from your RE (you are looking for the final tag on a line), you will need to take a different tack...see @Mark's answer. '<[^><]*>$' will work.


No comments:

Post a Comment

casting - Why wasn&#39;t Tobey Maguire in The Amazing Spider-Man? - Movies &amp; TV

In the Spider-Man franchise, Tobey Maguire is an outstanding performer as a Spider-Man and also reprised his role in the sequels Spider-Man...