A brief introduction to Regular Expressions

A regular expression is a set of characters, a string of characters if you will, that specify a pattern. Ever used the grep command? It makes use of them. The ‘grep’ command is very handful when one needs to look for certain things inside a text file, or looking for some specific pattern from another command output.

In this brief introduction to regular expressions you will see how they can help you finding specific patterns. For example all the setences begining with the ‘A’ character, any number of certain number of digits, selecting all the numbers and skipping all the letters of a file or output, etc. Once found these patterns you can do things with them such as putting them to a file, reusing them in another command via a pipe or even modify them (or bits of them) with another command such as sed.

If you find the articles in Adminbyaccident.com useful to you, please consider making a donation.

Use this link to get $200 credit at DigitalOcean and support Adminbyaccident.com costs.

Get $100 credit for free at Vultr using this link and support Adminbyaccident.com costs.

Mind Vultr supports FreeBSD on their VPS offer.

Regular expressions can also be found inside programming languages. You can find it in Java, Visual Basic, Javascript, C, Perl, Python, PHP,… A long list. They are a necessary tool to have whether one is a system administrator or a programmer. It will help both use cases tremendously. Because of their nature they can be found in very different tools, aside from the already UNIX mentioned ones, you can find regular expressions inside word processors, but in databases too.

One key to understand regular expressions and make good use of them is a distinction between metacharacters and literals. Whaaaat? Think of metacharacters as modifiers, the collection of characters than mean something, a verb if you will. Literals are just that: literals. They mean what the regular convention we’ve learnt at school, so the character ‘a’ is still an ‘a’, and the numeral ‘1’ is still a ‘1’. So the combination of both make regular expressions very useful, although the metacharacters are the real key since those are the ones pulling the lever.

Computers in general are stupid machines, very useless pieces of metal that have no purpose without software on. Software makes the computer useful and software is written by humans. We are teaching, we instruct, we command computers through software. I say this because to a literate human a blank space is usually interpreted as the end of a word and the beginning of another. Or the end of a number and the beginning of the next. It is the clue for separation. Computers do not have that sense and the very basic software to make them run doesn’t have that either. Yes it tells them there are spaces, blanks, but for a computer that makes no difference than any other type of character. It’s just a character to it. Then it comes software such as regular expressions and languages that make use of them.

But why do you bring this ‘computers are clueless’ thing right now? ‘grep’ and similar utilities do the pattern match search line by line. They can’t make a distinction of words. In fact for them the only known boundary is the line. The content of the line is indifferent to them. It’s just a bunch of characters, may those be numbers, symbols, letters or blank spaces. But to you and I characters have a meaning and the boundaries between them are important. The word ‘cat’ is different than ‘implication’ although the latter contains the former. Cat. Impli-cat-ion. See? Regular expressions allow us to make such distinction so we get only get the animal for ‘cat’ if that is what we are looking for. * 1

I’ve already mentioned metacharacters. And dumb computers. Let’s imagine we’ve got a text file and we need to sort out all the words beginning with letter ‘a’. Enter anchors. Anchors are those metacharacters that fix a position in a line or in a word/s (yes, you can do that) and allow a more detailed search. For example the ‘^’ , also named ‘caret’, character can be used to find characters at the beginning of the line. The ‘$’, also known as dollar, can be used to find characters at the end of the line. There are also character classes such as ‘-’ (dash) so you can make use of them when specifying ranges like the numbers between zero and nine: [0-9]. Using the dash you don’t need to specify all the numbers one by one like: [0123456789]. Or the whole alphabet so you will contempt with [a-z]. Be aware character classes can modify the behaviour of metacharacters. So while this expression ‘^A’ means ‘match every line beginning with a capital A’, another expression such as ‘[^A]’ would mean ‘not what matches capital A on any line’.

As a reminder: computers look at characters individually. You can tell them to make groups of them, but still they are clueless. In the expression ‘^cat’ the computer will look at the beginning of the line because of the metacharacter ‘^’ and then will look for the literals ‘c’, then ‘a’, and finally ‘t’. But this is at the beginning of the line, or if we use the ‘$’ metacharacter at the end of the line. Words? You might wask. Enter metasequences. For example if you want to look for the beginning of a word you will use ‘\<’, you will use ‘>\’ for the end of a word, and ‘\<word\>’ to look for the word ‘word’, excluding all other words that may contain the word ‘word’ such as ‘wordiness’. You can think of the formulas ‘\<’ and ‘>\’ as anchors for words. Before we mentioned the ‘^’ metacharacter as an anchor for the beginning of the line and the ‘$’ as the anchor for the end of the line. As an analogy the ‘\<’ formula, or metasequence, is an anchor for the beginning of a word, and the ‘>\’ is the anchor for the end of the word.

Other metacharacters are the dot, represented as a simple ‘.’ dot, the bar ‘|’ also known around as ‘pipe’ but it doesn’t work as such here, and the parentheses, represented as ‘()’. Dots are used to match any kind of character, so if you are looking for an unknown character in a specific position inside a line, a word or number, the dot can be very useful. For example, dates are represented with forward slashes in between the numbers, but dashes are also used. What to do in this case is important and a bit tricky because dashes are metacharaters too as well as forward slashes. You can use dots to identify both when you’re looking for a specific date. For the case of bars they just work as separators, conjunctions if you will such as ‘or’ in the spoken word. So for example you are looking for two different words and you just need to match either of those, you can use the bar in that case such as: ‘cat|dog’. Parentheses, among other uses, can be used to contain those sets of words one have to match and are separated by bars as for example ‘(dog|cat|rabbit)’.

Following in this brief introduction to regular expressions a few examples:

We’ve got a simple file called shopping which is a list of items we want to buy. When using the ‘cat’ command we’ll get the whole content of the file displayed.

[albert@Workstation ~/awk]$ cat shopping

eggs

cucumber

tomatoes

onions

soap

rice

olive oil

toilet paper

garbage bags

ham

meat

potatoes

Tuna

[albert@Workstation ~/awk]$

Let’s use some simple examples for anchors. Let’s say we want the words of that list that begin with letter ‘r’. We’ll use the metacharacter ‘^’ to do so.

[albert@Workstation ~/awk]$ grep '^r' shopping

rice

[albert@Workstation ~/awk]$

Notice the only word here has been ‘rice’ which is the only one in our file that begins with ‘r’.

Let’s now use the same anchor type again and let’s look for lines starting by the letter ‘t’.

[albert@Workstation ~/awk]$ grep '^t' shopping

tomatoes

toilet paper

[albert@Workstation ~/awk]$

It all seems right but as you can see at the bottom of the list there was some ‘Tuna’ to buy but it didn’t come out. This is because the character ‘t’ is not the same as ‘T’, because the system is case sensitive and the grep command also is case sensitive. For this particular case in here we have two solutions. The easy one is to issue the ‘grep’ command followed by the ‘-i’ flag which makes it case insensitive and will then find both ‘t’ and ‘T’ as well.

[albert@Workstation ~/awk]$ grep -i '^t' shopping

tomatoes

toilet paper

Tuna

[albert@Workstation ~/awk]$

But you can of course make use of the power of regular expressions and use a formula like follows:

[albert@Workstation ~/awk]$ grep '^[tT]' shopping

tomatoes

toilet paper

Tuna

[albert@Workstation ~/awk]$

We’ve told the ‘grep’ program to look either for ‘t’ or ‘T’ at the beginning of the line.

Let’s now use another anchor, and we’ll now look for characters at the end of the line. For that we’ll use the ‘$’ metacharacter. And this time we’ll look for the lines ended in ‘s’.

[albert@Workstation ~/awk]$ grep 's$' shopping

eggs

tomatoes

onions

garbage bags

potatoes

[albert@Workstation ~/awk]$

Notice I said lines, because before bags there is another word. See, when using these anchors, the output will result in the line containing what we were looking for, not just the word at the end.

As a bonus we can now talk about another UNIX utility named ‘nl’. Or in other words: number line. This is not a part of regular expressions but it may expose how useful they are.

[albert@Workstation ~/awk]$ nl shopping | grep 's$'

1 eggs

3 tomatoes

4 onions

9 garbage bags

12 potatoes

[albert@Workstation ~/awk]$

By using nl we have firts numbered all the lines in the file. We have then ‘piped’ the output with a pipe (represented by the bar ‘|’ symbol) and then we have searched for a specific pattern using the ‘grep’ command and a regular expression formula in the form of an end line anchor by using the ‘$’ anchor. This allows us to know where do certain things sit on a file. See this numbering perfectly matches the words position if they hadn’t been piped and passed through grep:

[albert@Workstation ~/awk]$ nl shopping

1 eggs

2 cucumber

3 tomatoes

4 onions

5 soap

6 rice

7 olive oil

8 toilet paper

9 garbage bags

10 ham

11 meat

12 potatoes

13 Tuna

[albert@Workstation ~/awk]$

Let’s move on and let’s work with other metacharacters. Now instead of using our shopping list file we’ll make use of one of the paragraphs you’re already read in this very article, the one marked with an asterisk followed by the number 1 at the end of it. But I’ve added three more lines to show you just one little other thing.

[albert@Workstation ~/awk]$ nl catfile

1 But why do you bring this 'computers are clueless' thing right now?

2 'grep' and similar utilities do the pattern match line by line.

3 They can't make a distinction of words.

4 In fact for them the only known boundary is the line.

5 The content of the line is indifferent to them.

6 It's just a bunch of characters, may those be numbers, symbols, letters or blank spaces.

7 But to you and I, characters have a meaning and the boundaries between them are important.

8 The word cat is different than implication although the latter contains the former.

9 cat

10 implication

11 category

12 See?

13 Regular expressions allow us to make such distinction so we get only the animal for cat if that is what we are looking for.

14 The importance of this is great.

15 Therefore you may want to learn this forever.

[albert@Workstation ~/awk]$

We will now make use of the ‘\<’ metacharacter so we can find specific beginnings of words. Remember until now we’ve looked at the beginning and at the end of a line. We’ll do now with words instead of lines.

Let’s look for lines that contain words starting with ‘ch’.

[albert@Workstation ~/awk]$ grep -i '\<ch' catfile

It's just a bunch of characters, may those be numbers, symbols, letters or blank spaces.

But to you and I, characters have a meaning and the boundaries between them are important.

[albert@Workstation ~/awk]$

The word starting by ‘ch’, which is the pattern we are matching, here is ‘character’.

Let’s now do the same but in reverse. Instead of looking for words beginning with ‘ch’ let’s see the ones which end with ‘ch’.

[albert@Workstation ~/awk]$ grep -i 'ch\>' catfile

'grep' and similar utilities do the pattern match line by line.

It's just a bunch of characters, may those be numbers, symbols, letters or blank spaces.

Regular expressions allow us to make such distinction so we get only the animal for cat if that is what we are looking for.

[albert@Workstation ~/awk]$

Now let’s go for the ‘little thing’ I said I wanted to show you. As a reminder, in this ‘catfile’ there are some words like: cat, implication and category, which all contain the character string ‘c·a·t’. So if we just looked for the pattern ‘c·a·t’ all we’d have to do is:

[albert@Workstation ~/awk]$ grep cat catfile

The word cat is different than implication although the latter contains the former.

cat

implication

category

Regular expressions allow us to make such distinction so we get only the animal for cat if that is what we are looking for.

[albert@Workstation ~/awk]$

Now if we want to look for the words starting by ‘c·a·t’ on each line we issue the following:

[albert@Workstation ~/awk]$ grep -i '\<cat' catfile

The word cat is different than implication although the latter contains the former.

cat

category

Regular expressions allow us to make such distinction so we get only the animal for cat if that is what we are looking for.

[albert@Workstation ~/awk]$

Notice the word implication is not highlighted here, neither it appears twice which it did before between ‘cat’ and ‘category’. It does not appear because what we were looking for was lines which contained words starting with the specific string ‘c·a·t’. Notice too the word ‘category’ is also hightlighted. This is because we were chasing words starting by the string ‘c·a·t’.

Let’s now look for the lines containing the string ‘c·a·t’ only.

[albert@Workstation ~/awk]$ grep -i '\<cat\>' catfile

The word cat is different than implication although the latter contains the former.

cat

Regular expressions allow us to make such distinction so we get only the animal for cat if that is what we are looking for.

[albert@Workstation ~/awk]$

Not even category shows off now since we’ve been more specific and we only looked for the specific string only with the boundaries delimited. With the help of the metacharacters ‘\<’ and ‘\>’ we have contained the string. Without this contention we would have seen 5 lines of output as in the first example for this specific case. With the contention only 3 lines are shown, the ones containing the word ‘cat’.

Moving on now to make use of other metacharacters. We can now make use of the bar ‘|’ so we can look for the lines where a series of items we want to search. I have an email file where some information appears but I only want to display very specific parts of it such as: from, subject and to.

[albert@Workstation ~/awk]$ egrep -i '^(From|Subject|To): ' mail

From: Charlie Root <root@Soviet>

To: root@Soviet

Subject: Soviet daily run output

[albert@Workstation ~/awk]$

So here we have the desired content of that line. We not Charlie Root sent an email to the root user with the daily run output of the Soviet node. Disclaimer: I use funny node names, some to piss off right wingers, some to make left wingers believe I am one of them. Bad jokes aside we have also made use of the parentheses to mark the boundaries of that word list.

We’ll now make use of the negated character classes. This is very simple. We have a file with a list of words. All but one of those words contain ‘qu’. Let’s first take a look at the content of the file:

[albert@Workstation ~/awk]$ cat wordlist

Iraq

Iraqui

Iraquian

miqura

quasida

quintar

quoph

zaquum

Qantas

[albert@Workstation ~/awk]$

We will now look to negate all the words containing that ‘q+u’ formula, making the ‘u’ character the negated one, but allowing the presence of the ‘q’ character combined with others that are not ‘u’.

[albert@Workstation ~/awk]$ grep -i 'q[^u]' wordlist

Qantas

[albert@Workstation ~/awk]$

As you can see the only line containing a word that combines the ‘q’ character with some other that is not ‘u’ is the last one where you can read ‘Qantas’. Notice the word Iraq does not appear either since the condition was the ‘q’ character followed by some other one but ‘u’. Iraq does not contain any other character after the ‘q’.

Let’s see another example but with the same concept. We’ve got now a file containing dates for they are all with different formulas. You know, people use dots, dashes or forward slashes to mark the boundaries for days, months and years. The file looks like this:

[albert@Workstation ~/awk]$ cat dates

15.12.1980

23-02-1984

15/01/2010

[albert@Workstation ~/awk]$

Let’s imagine we want to show all the lines containing dates but not those marked with dots.

[albert@Workstation ~/awk]$ grep '[^0-9.]' dates

23-02-1984

15/01/2010

[albert@Workstation ~/awk]$

If we hadn’t added the number sequence ‘0-9’ the dot would’ve meant nothing since it’s interpreted as any character and the caret wouldn’t’ve had any effect.

[albert@Workstation ~/awk]$ grep '[^.]' dates

15.12.1980

23-02-1984

15/01/2010

[albert@Workstation ~/awk]$

One last example. We’ll make use of a similar dates file but with the same date repeated three times all using different conventions, dots, dashes and forward slashes.

[albert@Workstation ~/awk]$ cat dates2

15.12.1980

15-12-1980

15/12/1980

[albert@Workstation ~/awk]$

Let’s now get as output the ones with dots. One may try it first by just specifying the dot. This is the result:

[albert@Workstation ~/awk]$ grep '.' dates2

15.12.1980

15-12-1980

15/12/1980

[albert@Workstation ~/awk]$

Wrong. Remember the dot is the substitution of any character. It’s like an asterisk ‘*’, also known as star or wild card. Therefore if we just put one dot it will just take out everything in the file. We need another way.

[albert@Workstation ~/awk]$ grep '[.]' dates2

15.12.1980

[albert@Workstation ~/awk]$

By using the character class [], also known as square brackets, we can change the behaviour of the dot which now doesn’t mean any character but to show us the line containing dots.

The same method applies to get the dashed lines.

[albert@Workstation ~/awk]$ grep '[-]' dates2

15-12-1980

[albert@Workstation ~/awk]$

But what if we want two of them? How about dots and dashes only?

[albert@Workstation ~/awk]$ grep '[-.]' dates2

15.12.1980

15-12-1980

[albert@Workstation ~/awk]$

Yet again the character class square bracket eliminates the dot as a ‘whatever character’ and only looks for lines that contain dots.

Let’s add another few lines to our dates2 file. We’ll add some combinations with letters, dots, dashes, and some lines with only letters, not any number. It looks like this:

[albert@Workstation ~/awk]$ cat dates2

15.12.1980

15-12-1980

15/12/1980

December 15th 1980

December-15th-1980

December/15th/1980

December.15th.1980

December fifteenth nineteen eighty

December-fifteenth nineteen-eighty

December/fifteenth nineteen/eighty

December.fifteenth nineteen.eighty

[albert@Workstation ~/awk]$

We can now play a bit more. Let’s say we only want the lines containing numbers and dots.

[albert@Workstation ~/awk]$ grep '[0-9.]' dates2

15.12.1980

15-12-1980

15/12/1980

December 15th 1980

December-15th-1980

December/15th/1980

December.15th.1980

December.fifteenth nineteen.eighty

[albert@Workstation ~/awk]$

¿Dashes and numbers only?

[albert@Workstation ~/awk]$ grep '[0-9-]' dates2

15.12.1980

15-12-1980

15/12/1980

December 15th 1980

December-15th-1980

December/15th/1980

December.15th.1980

December-fifteenth nineteen-eighty

[albert@Workstation ~/awk]$

So with this I believe there are enough simple examples in use to conclude this brief introduction to regular expressions.

If you find the articles in Adminbyaccident.com useful to you, please consider making a donation.

Use this link to get $200 credit at DigitalOcean and support Adminbyaccident.com costs.

Get $100 credit for free at Vultr using this link and support Adminbyaccident.com costs.

Mind Vultr supports FreeBSD on their VPS offer.