Monday, April 28, 2008

Beginners RegEx - part 3 - metacharacters





In "Beginners RegEx - part 1" we saw some character classes (the [a-z] style) to march any character between a and z. We also saw that you can mix character classes to include more characters by using styles like [a-z0-9] to match every letter from a to z and every digit from 0 to 9. And if we wanted allow the uppercase versions of the letters we'd have to write something like [a-zA-Z0-9].

Today we'll look at some metacharacters \char, which can make life easier when using character classes [ characters ].

Here is a short list of some character class metacharacters

\t = Tabulator
\n = Newline
\r = Carrige return
\s = Any white space character, ie.
space, tab, newline, formfeed etc.
\S = Any character which is NOT a whitespace
\w = Word, which is basically [a-zA-Z0-9_]
\W = Any character which is NOT defined as being a
word
\d = Digit, which is basically [0-9]
\D = Any character which is NOT defined as being a
digit

So now we can construct some tighter regular expressions which eventually we be easier to read, once you grow acustomed to it.

So if we have a situation in a html file which looks like this:


Then we can construct a regular expression which matches the url part of the html by writing the following regex.
http://[\w\s/.-]+
This will give us a result like:

We see a new feature of regular expressions in the regex above. One we haven't discussed yet. It is the + (plus), which tells us to match atleast one or more occurances of the preceeding character class. Opposite to the * which means match zero or more occurances of the preceeding character or character class.

Let's take the regex apart and explain what is going on.


One of the things you may wonder about, hopefully, is that inside the character class we wrote . (dot) to allow marching of the dots between the url name parts w4nd0rn.blogspot.com and the final . just before the type (.html).

Inside a character class [ ] the . (dot) does not mean any character what so ever. In a character class it means just what it is, ie. . (dot).

No comments: