When you are using regular expressions to grab text from files, especially from HTML files you often run into a case where the easiest way to find the specif information you are looking for is to use the html tags as qualifiers.
As example:
<STRONG>Here is the text I want to grab</STRONG>
In a case like above the only good qualifier is the HTML tags.
But you do not want a list like
<STRONG>Here is the first occurance</STRONG>
<STRONG>This could be the second occurance</STRONG>
Instead you want only the text inside the specific HTML tags. To accomplish that you should use a advanced regular expression feature called: Zero-width positive lookbehind.
With zero-width positive lookbehind you can use one part as qualifier and another to grab. Let's make a RegEx which handles the above.
(?<=<STRONG>)[a-zA-Z ]*
The magic part of the RegEx is the zero-width positive lookbehind, which specifies that it should look for the occurance <STRONG>. It is done by (?<=plain text). You can only have plain text inside the zero-width positive lookbehind part.
The second part of the RegEx is the matching of any number of characters between a-z and also A-Z plus ' ' (space). As the character < is NOT a part of the matching text then the RegEx will not match the </STRONG>.
I hope this is useful to you.
No comments:
Post a Comment