Greppie - Regular Expressions


One of the most powerful features of grep (the utility built into MacOS X that Greppie provides an interface for) is the ability to search using regular expressions. These allow you to search for patterns in text, as opposed to straight text. For example, if you don't know a letter in a word, or if you want all words that begin with "w" and end with "s". Of course, most beginners don't want to have to learn a complex set of commands to do searching, which is why Greppie provides a number of useful searches for you (which are sometimes very complicated, but not with Greppie!).

However, there are always those who want to know more, so here is my attempt at a tutorial on regular expressions.

Regular Expressions Tutorial

We have all searched files for simple words. It's easy to do. If you want to know whether your document contains any information about bears, you simply search for the word "bear" using whatever program is most handy (Greppie, TextEdit, Word). And we would be lucky enough in this case to get anything that talked about bears as well, since "bears" contains "bear".

Anything trickier becomes much more complicated, much more quickly. What if you want to find repeated words in a document (so you don't end up with something like this this.) Or if you want to search for a phone number, but you can't remember the number, you're in deep trouble!

This is where regular expressions come in handy. Regular expressions allow you to search not just for text in documents, but patterns, as well. So, while you couldn't search for a phone number without knowing the phone number, you could search for the pattern of a phone number. In the US, it might look something like (###) ###-####, where # can be any number.

However, making regular expressions that are very flexible can also be very difficult, which is why Greppie includes numerous useful regular expressions. In this tutorial, we won't delve into making complex regular expressions, but will instead just cover the basics; how to search for missing characters, how to search for numbers and letters, etc.

Metacharacters

First, we must understand that in order to search for patterns as well as text, we must have a way of telling the program that we are searching for patterns. Using regular expressions, this is done through defining special characters, called "metacharacters". The characters

\  ^  $  .  [  ]  *  +  ?  (  )  |

all have special meaning in regular expressions. If the checkbox "Use regular expressions" in the options drawer is not checked, any of these characters can be used in a search, as normal. However, if we are using regular expressions, and we want to search for one of these characters, we must prefix it with the \ character. So, to search for "$3", our search term would be "\$3".

Matching single characters

If these characters have special meaning, what are these special meanings? We'll start with the simplest, the '.' character, which represents any character in the given space. The table below shows how we can use this character in our search:

Search string: message.
Matches Doesn't match
message5
messageA
messages
message
massage

Note that the last item in the doesn't match list is "massage" with an 'a', which I put in to show that the '.' character only matches in the place where you put it. If you wanted to search for "message" and "massage", you would use "m.ssage". Note that you can use as many dots in an expression as you care to, and even use two in a row.

Matching multiple characters

The character '+' means match 1 or more of the previous character, the character '?' means match 0 or 1 of the previous character, while the character '*' means match 0 or more of the previous character. These can get very confusing since they all do similar things, but they are very distinct. For simple searching, '?' and '*' have the same effect (they are useful when you are doing replace operations).

Search string: message1+
Matches Doesn't match
message1
message111
message
message2
Search string: message1?
Matches Doesn't match
message
message1
message2
massage
messag

In the second set of examples, note how the pattern matches "message2" because it contains zero '1' characters since '?' matches zero or more of the previous character.

Who gets the last word

The characters '^' and '$' represent the beginning and end of a line (paragraph), respectively. You place these characters at the beginning or end of the expression you want to match, as shown in the table.

Search string: ^message
Matches Doesn't match
message me today
messages get sent
send a message to me
send a message
 
Search string: message$
Matches Doesn't match
send a message
new message
message me today
send a message to me

The character "\b" represents the edge (beginning or end) of a word.

Search string: age\b;
Matches Doesn't match
message
massage
age
messages
message4

Ranges of characters

The characters '[' and ']' represent a range of characters (for programmers, this looks suspiciously like array access). You can specify all of the characters in a range specifically, such as "[abcdef]", or you can use '-' to specify a range of characters, "[a-f]".

Search string: message[123]
Matches Doesn't match
message1
message2
message3
message
message4

A nifty trick is that you can use this expression to search for things that don't contain the characters you specify by starting the list with the '^' character.

Search string: message[^123]
Matches Doesn't match
message4
messageA
message
message2
message3

There are also special character sets defined that can be very useful. If you want all letters and numbers, you might have to type something like:

	[a-zA-Z0-9]

As your search string, which would work, but only in English. Because these kind of searches are often performed, there are some pre-defined sets of characters that you can use. The table below shows these sets (for ASCII English characters):

[:alnum:] The alphanumeric set includes all letters (uppercase and lowercase) as well as all numbers.
[:alpha:] The alphabet, containing all letters (uppercase and lowercase), but not the numbers.
[:digit:] The numbers 0-9.
[:graph:] Any character that, when put on screen, is actually visible. This would include letters, numbers, symbols, but not whitespace (such as space or return).
[:lower:] Any lowercase letter.
[:punct:] Any punctuation, such as comma (',') or period ('.').
[:space:] Any character taking space on the screen, but not actually printing anything. Includes spaces, tabs and returns.
[:upper:] Any uppercase letter.

These special character sets are used inside a set of square braces ('[' and ']'), so they end up with double braces when used in an expression:

	[[:space:]]

Combining expressions example

So now that we know some basic regular expression syntax, how do we combine them to get something useful? Well, at the beginning of the document, I mentioned something about finding repeated words, so let's try and figure that one out first.

Obviously we can't just type in that we're looking for "this this" because we could be looking for "is is", so this is the perfect area for using regular expressions. Well, first we have to figure out how we might find a word. We know that there are characters for finding the start and end of a word, so those are likely to be useful. But somehow we have to express that there is something in the middle of the word, not just the start and end of a word.

Since we don't know what characters we are looking for, it goes to reason that we'll have to use the '.' character, which stand for any character. And since we are looking for words of any length, we are probably going to use '+' since it looks for one or more of something, and we want words with one or more letters.

Now, how do we put them together? It would make sense to look for the beginning of a word first, so "\<" should be the beginning of our expression. Then we want to look for characters ('.'), and 1 or more of them ('+'). Then we want to look for the end of a word ("\>"). So, now we have the expression:

	\<.+\>

This certainly doesn't look very pretty, but we know that it finds words. Now the question is how do we find duplicate words? Well, it might make sense at first just to put that search expression again with a space in between:

	\<.+\> \<.+\>

But unfortunately, this will match any place where there are two words in a row. And we want two of the same words.

I tricked you because I haven't told you how to do what you need to do! It turns out there is a way to reference the result of your last expression in your expression (this is how regular expressions get so complicated so quickly!!!). If you put an expression in parentheses, "()", and then use the expression \n, you can refer to the results of the nth parenthesized expression (if your head is exploding, don't worry, it's hard to grasp.) As an example, let's see how we would detect a repeating word:

	(\<.+\>) \1

So first we search for a word, then a space, then "\1", which means the result of the last expression, which is the word. So we search for word, space, same word. And voila! We have found places in the document where words are repeated. This is not a complete solution, though, because if there are two spaces in between the words, it wouldn't match, or if you had the text "a mother mothering her child", it would match, which it shouldn't. Greppie includes a duplicate word search function that takes these subtleties into account.

Greppie: Home | Search | Features | Regular Expressions | Screen Shots