Monday, April 28, 2008

Beginners RegEx - part 2 - negation and zero-width positive lookbehind






We're going to concentrate on negating ranges and a regex feature called zero-width positive lookbehind. Dom't let the words scare you. It is harmless.

Negation is used to tell the regex that it should match everything which is NOT in a range []. Zero-width positive lookbehind is used to get the regex to match characters which will be used to find the stuff we are looking for, but to avoid getting the stuff as the result from our regex. The latter sounds really cryptic, but is quite useful.

Zero-width positive lookbehind. Have a look at the regex below

(?<=.H1.)[^<]*

The first part of the regex is the zero-width positive lookbehind (?<=text). The second is a range [characters] and the last is * which matches zero or more of the preceeding range. Let's take it from the top.

(?<=text) tells that we are looking for text, but if we find a match then we do not want the text "printed" as part of the result. In the case we are looking for .H1. which, if we were searching a html file, would match <H1> (header size 1). Remember that . (dot) matches any character so . (dot) before H1 matches both the < before and the > after the H1. The only real restriction when using (?<=text) is that the text has be of fixed length. So it is not possible to put ? and * inside a (?<=text) because these could change the length of the zero width look behind sub expression.

[^<]* is a little tricky to understand. First of, it is a range we specify. So anything inside the [ and ] is what we are looking for. The * after the range tells us that whatever we find that matches the range can be repeated zero or more times. Inside the range we use 1 metacharacter, the ^.

^ as the very first character in a range means "anything but". So if we had [^a] means that we want to match anything but the letter a. With the [^<] we look for anything but the character <, and as we have a * succeding the range [^<]* we look for zero or more occurances of characters which are NOT <. Practically we get a match on a html string like this:


<H1>here is some header text</H1>


The first part of the full regex finds <H1>, but omits it in the result, the second part fines the text after the first < and matches everything till it reaches a <. So the result is that we get the text inside the html tags <H1> and </H1>.

You can negate more than 1 character when using ranges. A regex like [^<=@] tells us that we want anything besides the characters < and = and finally @.

No comments: