Microsoft Regex
A Guide to Find and Replace with Wildcards in Microsoft World
Search String Algorithms for “Find and Replace”
Searching for specific wording or concepts in text is a common scenario when editing or drafting documents, legal or otherwise. Common search techniques generally involve hitting CTRL + f
and typing in a word you think will most likely be in the run of text you hope to find. Hopefully this word or phrase only appears in the text run you want and no where else in the document, as hitting return
will select the first occurrence of the phrase in the document. Subsequent searches will go through the matches in order.
If you are using software such as Microsoft Word, you may also have the option to replace the found selection, either one selection at a time or all selections at once. Ideally you would remember or find the exact text you would need to edit, but that is not always the case. Similarly, when attempting to replace all, it would be preferable if all cases followed the exact same character pattern. The ideal scenario, however, is all that the traditional Find and Replace algorithm, found as default in Word and many similar programs, can handle.
In the 1950s, before Microsoft and ubiquitous personal computing, researchers and mathematicians were already thinking of solutions to exactly this problem. This is when mathematician Stephen Cole Kleene formalized the first description of a regular language, essentially a language that follows a finite number of rules. With the introduction of Unix, the base operating system behind basically everything that isn’t Windows, regex (short for “regular expression”) utilized Kleene’s theorem to develop what has become a standard way of thinking about searching and replacing information in plain text files.
But what is so special about regex? Simply, regex allows the user to go far beyond simple find and replace for direct string. Use of anchors allows you to find only runs that start or end with certain words. Quantifiers allow you to find sets of letters that appear zero, one, or multiple times. OR operators allow you to search for words with multiple spellings, such as (mouse|mice). Lookaheads and lookbehinds even let you select text that precedes or follows certain other text.
One of the most important concepts, however, is capture groups. Capture groups allow you to collect one or more runs of text from a search and replace them independently. To learn more about regex, you can pick up the Regular Expressions Cookbook from O’Reilly or play around with an online regex tester, of which there are several.
Microsoft is Special
As noted in the above section, Microsoft did not evolve from Unix and, thus, does not necessarily utilize Unix-based regex paradigms throughout its programs. This is especially true of the Office suite. The idea of regex is so useful in drafting and editing, though, that Microsoft could not get by with just using a simple find and replace algorithm in it’s Office suite. Enter Wildcards. First, this article will show you how to enable them, then it will show you how to use them in searches.
Turn on Wildcards
In newer versions of Microsoft Word, hitting the standard CTRL + f
will bring up the Navigation Pane, a panel normally docked to the left of the document. This will have a search box and three tabs: Headings, Pages, and Results. This is a great pane if you want to see surrounding context for a simple search quickly, but it hides the real power of Word’s Wildcard find and replace. Instead, try CTRL + h
, which will bring up the old find and replace control. Alternatively, the Find and Replace buttons on the home tab will open this form. They are to the right of the Styles gallery.
Now that the form is up, look for a button on the bottom left labeled More > >
. This expands the find and replace form to show a series of checkboxes, as well as some options you can use to either find words with specific formatting or insert formatting along with the replacement text. Halfway through the list of checkboxes on the left column, is “Use wildcards.” Selecting this option will grey out several of the other options, but the correct syntax in the find textbox will allow you to replace that functionality as well as expand on it. You’re now working with Word’s version of regex.
Using the Syntax
Back in the Find textbox, you will want to start your query. Traditionally, you would put the single word or phrase you were looking for, but with Wildcards you can now use OR operators, symbols to mark the start or end of words, or repetition markers. Let’s look at an example.
In the figure below, I have the plain text of Shakespeare’s Sonnets. There’s plenty of words here, but let’s look for one that may be a bit old.
The Navigation Pane highlights wherever “thou” appears, but what if we only wanted to find where it starts a line?
We’ve found the new line starting with ‘Thou’ but missed ’thou’ four lines ahead because that line starts with ‘But.’ Now we can replace “Thou” with something a bit more familiar. But we don’t want to include the line break, right? Not a problem.
In the Find textbox we used (^l)Thou
to create what is called a capture group around the symbol representation of a line break. In the Replace textbox with used \1You
to insert the captured group and then replace the rest of the found text with “You.”
That’s great, but what if we want to replace everywhere ’thou’ does not start a line? We can simply indicate that we want everything that does not start with a line break.
As a last item, you want to highlight wherever ’the,’ it’s iambic abbreviation ’th’,’ or ’thee’ is throughout the text.
As you can see, using a mix of OR (['e]
and [a-z]
), one or many ({1,}
), and Not operators ([!a-z]
) allows you to find all words that start with ’th’ and end with either an apostrophes or any number of ’e’s. If you had preferred, you could have replaced the ending Not operator ([!a-z]
) with the word barrier operator (>
) and achieved the same result.
The unfortunate side effect of the syntax is that the resulting string of characters may look incomprehensible before you grow used to using it. This is one area where Word did not deviate from regex. Rest assured, though, that with time and practice, the legibility of Wildcard searches greatly improves.
Limitations
As you saw with the syntax example finding where “Thou” starts a new line, Word cannot perform real negative lookbehinds. The work around of this is capturing the “out” group and inserting it back by referencing it with a backslash and the group number in the Replace textbox.
Another fundamental flaw of Wildcard find and replace is its inability to process alternation. For example, searing for ‘a’ or ’the’ cannot be handled using (a|the)
as it would be with regex. This is because each operation, aside from the “or many” operations, is associated with a single character. The second character in ’the’ would not match with ‘a’ as ‘a’ has no second character. Instead, you would have to search for ‘a’ and ’the’ separately.
Wildcard find and replace also cannot handle nested capture groups.
Operators List
Now that you’ve had a brief overview of how Wildcards work in Word, you probably want a cheat sheet for each operator. Here it is. Happy Wildcarding.
Character | Operation | Description |
---|---|---|
? |
Any Character | Returns any character that may be inserted with a standard keyboard |
[-] |
Character in Range | [A-z] returns any letter |
< |
Beginning of a Word | <ed* returns ’education’ but not ‘started’ |
> |
End of a Word | *ed> returns ‘started’ but not ’education’ |
() |
Expression | Used to create capture groups, which can be inserted in Replace text with \ and the number of their group from left to right |
[!] |
Not | Selects any characters that is not in the group |
{m,n} |
Number of Occurrence | Selects where the previous character is repeated between m and n times |
@ |
Previous 1 or More | Selects where the previous character is repeated one or more times |
* |
Zero or More | Selects zero or more of any character |
^t |
Tab | Returns the white spaces that is inserted when the Tab key is hit |
^^ |
Caret | Return the where ^ is entered in the text |
^n |
Column Break | Returns a manually inserted column break, which is most commonly used when drafting something with multiple columns like a newspaper article |
^+ |
Em-Dash | Returns the extended dash character – |
^= |
En-Dash | Returns the extended dash character — |
^l |
Manual Line Break | Returns the manual break that is inserted when the user hits SHIFT + return |
^m |
Page/Section Break | Returns a manually inserted page or section break |
^~ |
Non-Breaking Hyphen | Returns only those hyphens that will not break across a line |
^s |
Non-Breaking Space | Returns only those spaces that will not break across a line |
^- |
Optional Hyphen | The optional hyphen only appears when the word in which it is inserted will break across a line. This returns that character. |
^13 |
Line Return | Returns the manual carriage return (what happens when a user hits return normally) |
\ |
Escape Character | To be used in front of any character that would normally be used in Wildcard syntax (\? will search for a question mark in the text) |
Note: many of these and other special characters can be found at the bottom of the find and replace dialog by clicking the button marked ‘Special.’