by Robert Delwood, STC Houston Webmaster
Locating text within files has always been a problem. Windows has the search feature, but even in Vista, it has several limitations. It returns only a file list, and the search cannot find only whole words. That is, a search for east also finds feast. In many instances, you might want to search for patterns. For example, you might want to find document names like document A-54.6 but you don’t know the exact numbers. The good news is that generalized pattern matching (called wildcards or regular expressions) is already available on many computers.
Regular Expressions
With regular expressions, you can match text patterns rather than exact text. The Windows search feature supports only the * and ? wildcards. The asterisk matches any number of characters, and the question mark matches a single character. Searching for DIR File?.txt finds File1.txt, FileA.txt, but not File10.txt or FileAB.txt.
However, expanding this capability enables a remarkably versatile search. Two common implementations are Microsoft Word’s wildcard Find feature and the command interpreter (previously know as DOS), FINDSTR.
Microsoft Word’s Wildcards
Wildcards are Word’s implementation of regular expressions. To use them, open the Find and Replace dialog box, click More, and select Use Wildcards. After you select the wildcard option, Match Case and Find Whole Words Only options become unavailable, which is an indication that pattern matching, and not word matching, is being used. This difference might confuse you the first few times. For example, the wildcard pattern th? matches the, tho (as part of the word thought), and th (a space). This kind of matching sounds marginally useful, and perhaps it is, but it shows users are still thinking in terms of words. For example, of the familiar ? and * wildcards, the asterisk is more powerful. The search for c*t matches cat, but without word boundaries, it also matches can fly unless weat from the phrase as fast as it can fly unless weather grounds the plane. Therefore, the asterisk alone might not be enough limitation. To start thinking in terms of patterns, use a combination of the notations.
You can gain control by limiting the search to specific sets of characters by using backets, [ ], to define character sets. The search for document [ABCDE] would make a case-sensitive match of document A and document D, but not document F. You can shorten the range by using the hyphen—for example, [A-E]. You can exclude text with the negation character (^, the caret), as in [^F-Z], which excludes document F-Z. Parentheses specifiy an exact set to match. The search for (ing) matches all instances of ing. To match text at the beginning or end of a word, use < or >, respectively. The search (ing)> matches only words ending in ing.
Also, repetition options find patterns occurring zero, one, or more times. Use curly brackets, { }, to enclose the number of times to find the preceding character or pattern. For example, <10{2}> searches for two occurrence of the zero, as in 100 but not 1000. The brackets are necessary to find whole word occurrences; otherwise, patterns occurring with the number would also be found. You can expand this search by adding a maximum number of repetitions to find. For example, <[0-9]{1,4}> matches all numbers between 0 and four repetitions of the single digit, or 9999.
An HTML Example
You might notice that Word produces atrocious HTML code. Wildcards can help you to create a clean file in a relatively few steps. Following is an example Word 2003 .htm file containing only “Hello, Word.”
Open the htm document in browser, display the HTML code, and copy it into a Word document. The code is used as text in this case. The relevant text may look similar to the following and is typical of the class and span proliferation.
<p class=MsoNormal style=’margin-bottom:12.0pt;line-height:150%’><span
style=’font-size:8.5pt;line-height:150%;font-family:Verdana’>Hello, Word.<o:p></o:p></span></p>
To clean up the code, use the following steps:
- Launch Word, and access the Find and Replace dialog box.
- Select Use Wildcards.
-
To clean up the p tags, find <p*> and replace with <p>.
-
To remove the span tags, find <span*>, and replace with nothing.
-
To remove the /span tags, find </span>, and replace with nothing.
You can also remove head, div, and other annoying tags. This example shows the versatility of the asterisk, especially with the p and span tags.
Command Interpreter, FINDSTR
Word’s wildcards are useful for finding strings within a file, but for finding strings within more than one file, especially closed files, the FINDSTR (Find String) utility is still useful. DOS is not dead, and the capabilities of FINDSTR are not present in other Windows functions.
To use the FINDSTR utility:
-
Open the command window: choose Start -> Run, type CMD , and press Return.
- At the DOS prompt, type FINDSTR /? to display the Help for the utility, which describes each command, its parameters, and its syntax.
Command-line formats require only two parameters: the string to find and the files to search.
A basic search might be for literal text, such as seven-point Roman font HTML tags in any HTML file along the search path:
findstr /m /c:”<span style=””””font:7.0pt ‘Times New Roman’; “”””>” *.htm
The /m option specifies to display the names of files containing a match, the /c specifies a literal string search, and the double sets of quotation marks indicate that quotation marks are actually in the search string.
The versatility comes with regular expressions. FINDSTR’s notations are slightly different from Word’s. Brackets, negated brackets, and ranged brackets are the same as Word’s wildcards. Beginning- and end-of-word brackets are almost the same: < and >. However, to match any character, use the period, .; to repeat one or more occurrences, use the asterisk, *. Neither have minimum or maximum options. To make this search more general and find any occurrences regardless of font size, use the following form:
findstr /r “font:.pt ‘Times New Roman'” *.htm
The /r specifies a wildcard search, meaning the period before the pt is a wildcard.