This is a addition to the former article about metacharacters. And a little new feature which allows us to tell how many occurances we want to allow of a centain character.
We have seen that * means zero or more, and that + mean one or more of the preceeding characters.
In this case we will use dates as the taget for our example. Consider some dates like the following.
30 April, 2008All valid dates to my eye, so we want to make a regular expression which assures us that we get such dates if we look for them. Let's outline some of the rules for dates. I will skip the real validation test regarding invalid dates like the 65th February. But merely consider the format.
14 May 2008
30. April 2008
1 - May - 2008 Description
2nd May 2008
Date: 12TH - 4 - 2008
02-05-2008
- Day comes first
- Month comes second
- Year comes last
- Day is 1 or 2 digits long
- Month is letters or a number which 1 or 2 digits long
- Year is 4 digits long
- We have some kind of seperators between day, month and year
Repeats - minimum and maximum
Those are the basic rules for the date formats we wish to accept and match. The first thing we need to assure is that the digit versions of day, month and year do not exceed their natural limitations of 1-2 digits, 1-2 digits and 4 digits. This can be done by setting minimum and maximum.
[0-9]{1,2}
The character class [0-9] must be repeated atleast 1 and maximum 2 times in sequence. So every number between 0 and 99 is good. {minimum,maximum} of the preceeding character or character class. There is a flavour of the {minimum,maximum} specifier. It is used in a case like this:
[0-9]{4}
Repeats - exact count
Using just a {exact count} can be useful when we want to make assure that the year in our dates is exactly 4 digits.
One other thing which could become useful for us is to find the boundry of the dates we wish to match. What we want is to check from the boundry of the date to the ending boundry of the date. Let's consider a simpel version which could match one of our dates.
[0-9]{1,2}-[0-9]{1,2}-[0-9]{4}
Boundries
That one could match 02-05-2008 with success. But what if the date wasn't actually a date, but part of a calculation like 2002-05-20089124, then we would still get a match on some of it, hence 2002-05-20089124. Which is not something we want. For the purpose regular expressions has a metacharacter called \b (boundry). What \b actually does is to distinguish between what is considered a word ( \w = [a-zA-Z0-9_] ) and anything else. So if we re-write the previous regular expression into:
\b[0-9]{1,2}-[0-9]{1,2}-[0-9]{4}\bThen we get the desired effect. Hence we are saying that we have to have some kind of boundry first then some digits and - (hyphen's) and finally a boundry again. This well allow us to not confuse 2002-05-20089124 with a date.
Multiple choices
There is also the issue about the month being spelled instead of written with digits. This means that we will have to adjust out month part a bit to allow either digits (as now [0-9]{1,2}) or letters ( [a-zA-Z]+ ). The subexpression [a-zA-Z]+ says letters a-z and A-Z and we have to have a minimum of 1 letter and maximum unlimited; but only letters. So we adjust our regex alittle again. This time we use the logical OR construction ( thispart | orthispart ). The expression will look like this:
\b[0-9]{1,2}-([a-zA-Z]+|[0-9]{1,2})-[0-9]{4}\b
Metacharacters for digits
Now all the 0-9 becomes a bit confusing with all the 1,2 numbers in there, so we re-write it using the \d (digit) metacharacter.
\b\d{1,2}-([a-zA-Z]+|\d{1,2}+)-\d{4}\b
Seperators between day, month and year
We're getting there. But we still lack a few issues. Namely the seperating characters. We have to accept , comma, . dot, - hyphen, space, tab etc. as seperating character. If you read the previous article you can find that \s is space, tab, newline etc., so a character class like [\s,.-] will cover our possible seperator. And since we need to have atleast 1 seperator between our day, month and year, we tweak the character class to [\s,.-]+ and we modify it to look like below.
\b\d{1,2}[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b
Extensions to day
It's is almost complete we can now have any number of seperators between our day, month and year. So all we need is to allow days like: 1st, 2nd, 3rd,4th etc. We basically need to allowed some text extensions to the day member of our expression. We could do this by a simple subexpression like (st|nd|rd|th). The trick is that the dates may contain the extensions or they may not, so we have to make the subexpression optional. So we add a ? to the subexpression and get (st|nd|rd|th)?. Our expression looks like this:
\b\d{1,2}(st|nd|rd|th)?[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\bSo what do we need to assure ?, basically the only things left is that the dates MAY have some kind of seperation between the day and the extension. Such as 1 st or 2 nd. So lets inject our seperator subexpression [\s,.-] in between and let us make it optional as to zero or more times. We do that with a *, so we get:
\b\d{1,2}[\s,.-]*(st|nd|rd|th)?[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b
Case in-sensitive
What else ?. What if the day was written like 1 St ? There is some case sensitive issues. We could extend our subexpression (st|nd|rd|th)? into (st|nd|rd|th|St|Nd|Rd|Th)? or we could make the whole thing in-case sensitive. That is done by (?i), so now the expression looks like:
\b\d{1,2}[\s,.-]*(?i)(st|nd|rd|th)?[\s,.-]+([a-zA-Z]+|\d{1,2})[\s,.-]+\d{4}\b
And I believe that's is as good as it gets today. Take time to elaborate the regular expression. There is a few nitty gritty things one can do to make it really tight.