Friday, April 25, 2008

Beginners RegEx





Here's a short article which could be called "RegEx for dummies" or beginners. The most important thing about all of this is to read the primer (below).

PRIMER: Regular Expressions (RegEx) is used to find text which matches certain criteria. RegEx does NOT find words or sentences. It finds single characters. So a regular expression like "gin" does not find the word "gin" it finds the letter "g" followed by the letter "i" followed by the letter "n". Make a not of this because it will become much easier this way.

So here is a few examples followed by explanation of some of the meta characters in RegEx.

RegEx: "gin" matches Beginners.

RegEx: "gin?" matches Begixxers. Because the n is optional (? = preceeding character is optional)

RegEx: "(ht|f)tp://" matches both http:// and ftp:// (| = an logical or between the enclosed character sequences)

RegEx: "[1-2][0-9]" matches any number sequence where the 1st digit is 1 or 2 and the following digit is between 0 and 9. So numbers between 10-29. But also 10000 or 294.51 or 14degrees.

RegEx: "/[a-z0-9]/" marches any sequence where a / is followed by letters from a-z or digits from 0 to 9 and then followed by a /

RegEx: "s.x" matches any sequence where a s is followed by ANY character and then followed by a x. So sex, six, sux, sax are all good matches.

RegEx "http:.*\.zip" matches anything followed by after the letters http: 0 or more times and then finally .zip. The * means reperition 0 or more times . means any character \. means the literal dot itself.


So here is a tiny explanation:

? = The preceeding character is optional. Can be there or not
(x|y) = A sub group where either x or y is required. One could make them both optional by (x|y)?, see ?
[a-z0-9/] = A range for 1 character which can be any letter between a-z, 0-9 or a /.
. = Means any character what so ever.
* = Like ? it tells that the preceeding character can be repeated 0 or an unlimited time.
\. = The literal dot itself. It is "escaped" by the \ to tell that we mean . and not the .(any char)

That's all. Download EditPad Pro and load/write some text, use the search (Ctrl+F) and write your first regular expressions to try searching.


















In later articles we shall construct more "game on" kind of regular expressions. And explore tools like GREP which has a slightly different and less featured regex than EditPad Pro.

No comments: