An Introduction to Regular Expressions

From Wiki1

Jump to: navigation, search

Contents

Introduction

Regular expressions (regex) are patterns that define sets of strings. Informally, we can think about regular expressions as instructions for for identifying and possibly replacing strings.

Consider the following sentence: "Catastrophically, the cat almost fell on the concrete, but fortunately his owner managed to catch him in time." There are two examples of regular expressions in the sentence:

  • The string "cat". This specifies the set: {"cat", "catastrophic", "catch"}
  • The string " cat ", as a an isolated word( delimited by white space characters). This specifies the set {"cat"}

There are several uses for regular expressions when it comes to analyzing data in R:

1. Manipulating text data (e.g. selecting all rows where a column matches a given expression; parsing a hyperlink; renaming fields).

2. Manipulating model objects (e.g. dropping a subset of predictors from a linear model).

3. Navigating R functions (e.g. find all functions that have a given argument).

Using regular expressions often leads to shorter, more flexible code.

Syntax

A regular expression is a string composed of characters and metacharacters. Metacharacters are special characters reserved for the purpose of describing patterns:

$ * + . ? [ ] ^ { } | ( ) \

The meanings of the different metacharacters are tabulated below.

Metacharacter Description Example
. Match everything except empty string
> wald(fit0, ":.*:") #test all 3-way and higher-order interactions in a model 
 ? Match preceding item at most once
> sub("^[[:space:]]?", "", "  a string") #trim 1 initial white space
[1] " a string"
* Match preceding item any number of times
> sub("^[[:space:]]*", "", "  a string") #trim all initial white space
[1] "a string" 
+ Match preceding item at least once
> grep("([[:alpha:]]+)", c("word", "1", "/"), val=T) #match all words
[1] "word"
^ Match empty string at the BEGINNING of a line. If used in a character class (below), match anything EXCEPT following item.
> grep("^m", c("m$x1", "m$x2", "m3", "xm", "y"), val=T)
[1] "m$x1" "m$x2" "m3"
$ Match the empty string "" at the END of a line.
> sub(' +$', '', " a string     ")  ## trim white space at end
[1] " a string" 
| OR (infix) operator. regex1 | regex2 matches regex1 or regex2
 gsub("^ *| *$", '', "   a string     ")  #remove white space at beginning and end
[1] "a string"
(, ) Match groups of characters
 > gsub("([[:alpha:]]+)[[:space:]]\\1", "\\1", "the the cat") #delete repeating words
[1] "the cat jumped" 
[, ] Character class brackets - Match strings containing any of the characters in the brackets.
 >grep("[aieou]", c("cat", "tlk", "x"), value=TRUE) #match words with vowels
[1] "cat 

The second last example is particularly interesting. The pattern we matched was a word (alphabetic characters repeated once or more) followed by a space, followed by the same word. To specify this pattern, we used the special character "\\1", which denotes "the string matched by the first set of parentheses". (Similarly, if we have two sets of parentheses, "\\2" denotes the string matched by the second set, and so on).

Sometimes we want to treat a metacharacter as a regular character. To do this, precede it with "\\". For example,

> grep("$", c("m$x1", "m$x2", "m3"), val=T) #match everything with empty string at end
[1] "m$x1" "m$x2" "m3"  
> grep("\\$", c("m$x1", "m$x2", "m3"), val=T)  #match everything with a $
[1] "m$x1" "m$x2"

One exception - To treat "\" as a regular character, use "\\\\".

Character classes can be used to define a rich set of patterns. Here are some useful examples:

Expression Meaning
"[[:digit:]]" Digits
"[[:alpha:]]" Alphabetic characters
"[[:space:]]" Any white space character
"[[:punct:]]" Punctuation Characters
"[[:album:]]" Alpha numeric characters ( [:alpha:] and [:digit:] )
"[0-9]" Digits
"[a-z]" Lower-case letters
"[A-Z]" Upper-case letters
"[a-zA-Z]" Alphabetic characters
"[^a-zA-Z]" Non-alphabetic characters
"[a-zA-Z0-9]" Alphanumeric characters
"[]$*+.?[^{|(\\#%&~_/<=>✬!,:;❵\")}@-]" Punctuation Characters

Beyond the Basics

R has two engines for processing regular expressions: (1) GNU Extended Regular Expressions (default), and (2) Perl-like regular expressions (set perl=T ).

While we can do a lot with the default engine, using perl-like regexes can provide additional functionality. For example, we can selectively convert between lower and upper case:

> txt <- "a test of capitalizing"
> gsub("(\\w)(\\w*)", "\\U\\1\\L\\2", txt, perl=TRUE) #capitalize the first letter of every word
[1] "A Test Of Capitalizing"

Note that here "\\w" is a word, which could be equivalently obtained by using alpha:+.

More advanced examples will be added here in the near future.

References

Regular Expressions as used in R Official R help page. Not very user friendly.

Introduction to String Matching and Modification in R Using Regular Expressions by Svetlana Eden

Regular Expressions in R by Roger Peng (lecture form computing for data analysis course)

Programmer's Niche: Little Bits of String by Thomas Luley (in R News Vol 3/3 Dec 2003).

Personal tools