Maintainers Guild (MG): How to find the invisible

When you are using regular expressions to grab text from files, especially from HTML files you often run into a case where the easiest way to find the specif information you are looking for is to use the html tags as qualifiers.

As example:

Here is the text I want to grab

In a case like above the only good qualifier is the HTML tags.

But you do not want a list like

Here is the first occurance
This could be the second occurance

Instead you want only the text inside the specific HTML tags. To accomplish that you should use a advanced regular expression feature called: Zero-width positive lookbehind.

With zero-width positive lookbehind you can use one part as qualifier and another to grab. Let's make a RegEx which handles the above.

(?<=)[a-zA-Z ]*

The magic part of the RegEx is the zero-width positive lookbehind, which specifies that it should look for the occurance . It is done by (?<=plain text). You can only have plain text inside the zero-width positive lookbehind part.

The second part of the RegEx is the matching of any number of characters between a-z and also A-Z plus ' ' (space). As the character < is NOT a part of the matching text then the RegEx will not match the .

I hope this is useful to you.

Thursday, April 10, 2008

How to find the invisible

No comments:

Blog Archive

Labels