ISBN 10: 0-596-52812-4/ISBN 13: 9780596528126 [Pages: 542]
by Jeffrey E. F. Friedl
Reviewed by Robert Delwood
The mundane text search has been long overlooked. DOS users remember the DIR command to find files (such as DIR *.txt), Windows users have the Start >Search feature, and Microsoft Word, the Find dialog. These are all convenient operators but generally are limiting in that you have to know exactly what to look for. A generalized pattern search is not always possible. The scope of the problem becomes larger with increasingly huge disks and file proliferation. How do you find information across many file locations and formats, especially when you don’t know the specific text? That’s where regular expressions (also called regex) come in. These are extremely flexible search or replace methods capable of finding text patterns in a file or files.
As a simple example, you’d like to make sure that a team member’s name, Jeffrey, is spelled correctly each time. Forms you might want to look for would include Jeff, Jeffrey, Jeffery, Geoffery, and Geoffrey. In Word, you could do a find, or likely five finds, for a single file and still end up with mismatches such as jeffreyi or Jeffers. First, you want to search all the files on the Web server. Later, you want to change all those occurrences to Jeffrey. Finally, you want to make sure that Jeffrey always appears as his full name Jeffrey Smith. The simplest requests are impractical with 32,000 files, and the last three requests are not possible. A regex can do this. For a more complex example, you need to check all your files for matching and <p> statements. These are real world example of problems that need to be solved. The last example can be solved with a regex such as: % perl -0ne ‘print $ARGVn if s/<p>//ig != s/<\p>//ig *
This statement likely doesn’t make sense right now, but it does show the power of a single line. The examples are limitless. You can search for date patterns, turn URLs into links (a common enough task to get its own name: urlify), change HTML tags, or extract an IP address. Regex is an invaluable skill, and you may wonder how you ever got along without it. TiVo users already know this feeling.
A series of books attempt to demystify these tools. Mastering Regular Expressions (third edition) by Jeffrey E. F. Friedl tries to make sense of these. Part tutorial and part cookbook, it’s mostly a story introducing a regex state of mind. You can get new perspectives for creating solutions with a deeper understanding. This is consistent with the author’s “teach a man to fish” approach. Regular Expression: Pocket Reference by Tony Stubblebine is a small reference book (which does fit into a shirt pocket) that concisely lists the syntax and features 11 common implementations including Perl, .NET, Java, the vi editor, and shell tools.
There are several things to understand first.
Regex is not a singular entity. It is a generalized concept with individual implementations such as Unix, POSIX, .NET, Java, Perl, Visual Basic, and VBScript, among others. Each has its own version and differs slightly.
- You will need an application to use regex. Windows has many downloadable versions (many of them free). eGrep is a common tool available at http://www.gnu.org/directory/grep.html. Windows has two older command line versions still available: FIND and FINDSTR.
- Regexes are programming languages ranging from simple to complex. Programmer-writers and programmers can make the most of these tools, but all users can write their own expressions. If nothing else, eGrep is a better replacement for the Windows Search.
You can start with simple forms of finding exact text such as “cat,” which finds all words with the letters c-a-t in that order. This includes cat, catalogue, and vacation. Regexes are not word based. It’s better to think in terms of patterns than words.
Matching exact text is only the starting point. You can find one of among a group of characters by using brackets (). <H> finds HTML headings 1 (<H1>), 2 (<H2>), and 3 (<H3>), but no others. A dash indicates a range <H[1-3]> and has the same results. You can find groups of letters by using parentheses. The Jeffrey problem from above can be solved a number of ways including (Geo|Je)ff(re|er)y. A caret excludes characters. <H[^4-9]> finds all headings except 4 through 9.
Taking pattern matching one step further, the dot character (.) finds any sequence of characters (this is similar to the * symbol in DOS). Editors sloshing through Word-generated HTML may appreciate those matches within a span statement.
Repetition quantifiers match multiple occurrences. The asterisk (*) matches zero or more of the immediately proceeding characters (it would be all right not to find any matches). For instance, the <HR> tag can take several forms, including having superfluous spaces, such as <HR●> (● represents a space character). A basic search of <HR> would not find those instances. A better search, <HR●*>, matches any number of trailing spaces. <HR> tag can also take the form <HR SIZE=14> (along with any number of extra spaces). The best search, <HR(●+SIZE●*=●*[0-9]+)?●*>, catches any HR statement regard of the SIZE value or spacing inside the tag. The more you know about the target pattern, the better searches you can construct. Of course, there is a book’s worth of additional options.
The book’s 515 are divided into three distinct parts:
The first three chapters address the semantics and creation of individual statements and will have the most interest to readers new to regex. Reading these chapters should give new users a solid foundation and give them the ability to write their own searches.
The next three chapters contain details about how the expressions work. This information about the search engines aims at optimizing the statements for speed and accuracy. Experienced users may find these three chapters the most applicable because they explain engine types, matching hierarchies, anchors, and order of precedence.
The last four chapters are specific to an implementation such as .NET, PHP, Java, and Perl.
It’s hard to write a book for a wide range of users. In many cases, it doesn’t satisfy any set of readers. However, Mastering Regular Expressions does succeed in communicating different goals to different users. In short, new users get a proper introduction to the power and versatility of regex. Experienced users will be able to optimize their searches.