Thursday, April 10, 2008

How to find the invisible





When you are using regular expressions to grab text from files, especially from HTML files you often run into a case where the easiest way to find the specif information you are looking for is to use the html tags as qualifiers.

As example:

<STRONG>Here is the text I want to grab</STRONG>


In a case like above the only good qualifier is the HTML tags.

But you do not want a list like

<STRONG>Here is the first occurance</STRONG>
<
STRONG>
This could be the second occurance</STRONG>

Instead you want only the text inside the specific HTML tags. To accomplish that you should use a advanced regular expression feature called: Zero-width positive lookbehind.

With zero-width positive lookbehind you can use one part as qualifier and another to grab. Let's make a RegEx which handles the above.

(?<=<STRONG>)[a-zA-Z ]*

The magic part of the RegEx is the zero-width positive lookbehind, which specifies that it should look for the occurance <STRONG>. It is done by (?<=plain text). You can only have plain text inside the zero-width positive lookbehind part.

The second part of the RegEx is the matching of any number of characters between a-z and also A-Z plus ' ' (space). As the character < is NOT a part of the matching text then the RegEx will not match the </STRONG>.

I hope this is useful to you.

No comments: