Autotagging with Regular Expressions (Regex)

Regular Expression Matching (Regex)

When we need to convert plain text or other digital text files into XML, we look for strategies to convert patterns into markup. For example, there may be clear signals in the text to show us divisions between sections (as in chapter breaks in a book, or act and scene divisions in a play), and we might be able to tell from patterns of line breaks where paragraph divisions fall. To help us identify, match, and locate all of these in a file at once (instead of one at a time), we use regular expressions, which are basically patterns to match strings of text. There are many slightly different varieties of regular expressions used in different coding and programming environments, and we will be using one of these that is standard for our XML editing work and the <oXygen/> editor we are using.

We use regular expression matching in what we call up-conversion from text to XML, and we also use it sometimes when we write XSLT to transform XML-to-XML, when we need to add markup based on particular patterns we can locate in the text. (For example, we might find that all the dates in a document are written in the same format and wrapped in square brackets, and we can quickly use regular expression matching to distinguish dates from other kinds of square-bracketed material by identifying the brackets and a pattern of numbers and hyphens. We locate and alter those dates with regular expressions either while coding an XML file or in up-converting a plain text file.)

In <oXygen/>, look at the Find/Replace window, select the checkbox next to “Regular Expressions” in the Options menus, and try typing a backslash character ( \ ) into the Find window to bring up a short scrollable list of regular expression patterns. There are many others we can use, and we tend to look these up and deploy them as needed (rather than memorizing a long list). We use this handy Regular Expressions Info Quick Start Guide very frequently, and it’s a great place for you to start learning and looking up regular expression patterns. The regex expressions we are listing on this page are those we use frequently in our projects. There are other convenient listings online, such as The Regular Expression Library at RegExLib.com , or Wikipedia’s Regular Expression page which may also be helpful. In the next section, we’ll discuss some basic starting points and procedures we commonly use in our up-conversion work.

Autotagging: Up-conversion from Plain Text

When we begin converting text files to XML, we start in the <oXygen/> window, and we try to show all the special formatting characters in the document. In <oXygen/>, go to Options -> Preferences -> Editor: Whitespaces: and mark to Show TAB and SPACE marks.

We then go to the Find/Replace window (CTRL+F on a PC computer, or on the “Find” dropdown menu), and do the following:

Select Case sensitive
Select Wrap around
Select Regular expression
Important: At least at first, we suggest you deselect "Dot matches all." The “dot” represents any character, and it can be very powerful or a little unwieldy! When “Dot matches all” is selected, it includes newline characters, and so if you wrote .+ to match more than one character, it could match an entire document, what we call a greedy match. When we deselect “dot matches all,” it matches any character within a line, and is typically easier to maneuver! That said, there will be times that "Dot matches all" is useful, in combination with other expressions.

We typically do the following in the Find/Replace window, first working on changing special characters not permitted in XML content, working with ampersands first. (The order is important here, because you don’t want to change ampersands twice when you’re working on the angle bracket characters (if you have them). If you do the angle brackets first, you then leave those new ampersand characters designating the left and right brackets open for conversion when you only want the real ampersands by themselves. Make sense?)

Change & to &
Then change < to < and > to >
Look for ways to condense multiple blank lines, but only after analyzing your document and determining which ones should be kept as markers of, say, section breaks: We typically look for something like this, hunting for “newline” characters, \n:
\n{3,} or \n\n\n+ in the Find window, and replace with \n\n, or whatever makes sense to you!
While it may make the most sense to save this for last, you will need to (manually) add a root element to surround everything and make an XML file.

Useful Regex Pattern Symbols:

\n =new line character (in RegEx) Example: replace \n with </item>\n<item>
\t = select tab
\s = selects any white-space character (including tabs and new lines). In the Replace window, use the space-bar to insert spaces.
\d = select digit
\D = select non-digit (note upper-case)
\w = select word (or alphanumeric) character, either a letter or a number
\W = select non-word character (note upper-case)
^ = beginning of line.
$ =end of a line
. = the dot: Matches any character except new line. Selects any character within a line as long (as long as you do NOT check “dot matches all” in Find & Replace. If “dot matches all,” this will select line breaks too.)

Indicating How Many, Either | Or, and Character Sets [ ]:

? = used after a character, picks up zero or 1 of it: so colou?r matches both “color” and “colour”
* =used after a character, picks up zero or more of it: (the character may or may not be there, and maybe there’s more than one of it). So \w\s\d* picks up a letter followed by a space, as well as a letter followed by a space and a number.
+ =used after a character, picks up 1 or more of it: For example, \d+ picks up either one or more digits, 2 and 25 and 65746, etc.
| = (the pipe): selects one OR the other: grey|gray or gr(e|a)y are each patterns that will match either grey or gray.
[ ] matches any ONE character enclosed. Example: [0-9] will select the first single digit from 0-9 that it finds. [IVXLC]+ is handy for picking up one or more Roman Numerals, but be careful because this will also pick up "I" when it’s not a Roman Numeral but the first-person pronoun: (I, as in myself). [^IVXLC] will select anything but these characters.

Escaping Regex's Special Characters (When You Need To Find a Square Bracket, Period, Asterisk, Question Mark, Etc.)

Because characters like square brackets, asterisks, and question marks have special meaning in regular expressions, in order to search for a literal square bracket, asterisk, or question mark, you need to escape the regex character by using a backslash ( \ ). The following characters need to be escaped with a backslash if you need to find the literal character in your text:

the backslash itself: ( \ )
the caret ( ^ )
the dollar sign ( $ )
the pipe ( | )
the dot ( . )
the question mark ( ? )
the asterisk ( * )
the opening and closing parentheses ( ( and ) )
the opening square bracket ( [ ), and the opening curly brace ( { )

So, for example, in order to search for a string of alphanumeric characters followed by a literal period, we would write the following expression:

\w+\.

The "backslash w plus" looks up any one or more alphaumeric characters, and the backslash dot looks for the literal period. This might look a little confusing at first, since we use the backslash to introduce specific kinds of regular expression characters (\d, \w, etc.). It might help to think of using the backslash as an escape character whenever you need to locate a character that means something special on its own in regular expressions.

How to Use Parenthetical Grouping in the Find Window and Select Groups with Backreferences in the Replace Window:

When we group patterns in the Find window with parentheses, we can use backreferences to select parenthetical groupings by number in the Replace window. We apply a set of capturing parentheses to isolate some parts of a pattern we find, if we want to exclude the rest when we go to replace.

( ) matches and captures all text enclosed. Groups a collection of characters together in the “Find” window so you can select it in the “Replace” window. We presume here that you set these parenthetical groups side by side, rather than nest them inside each other, so that the groupings read from left to right.
\1 =under “Replace with”, this represents the first instance of text captured using ( ), above, under “Text to find”.
\2 =second ( ) instance captured, as above
\3 =third ( ) instance captured, as above, etc...
\0 =capture the entire match regardless of parentheses.

Note that you can use backreferences in any order, and repeat them as needed when you are making replacements, so you can thoroughly remix the regex patterns you’ve grouped! For examples of backreferencing, see the Regular-expressions.info page on the subject.

For example, I’ve just gone hunting through our Georg Forster voyage file to see if I can find all the references to days that take this verbal form: the 23rd of April (or the 15th, the 2nd, or the 3rd of whatever month and/or year). Let’s say I wanted to isolate only the numbers and not the letters (as in, simply, 23, 15, 2, 3), and wrap those in an element I’ll call <day>, and then I also want to keep the rest of that text to immediately follow? What I want to do is change this form: 23rd, into this: <day>23</day>rd . That’s a perfect opportunity to use parenthetical grouping in Find and Replace, like this:

Find window: (\d\d*)([a-z]+)
Notice how we’re applying parentheses here to isolate the numerical portion, and then a second set to surround the lower-case character set.
Replace window: <day>\1</day>\2
Here, I indicate that the “day” element is to sit around the first parenthetical grouping I’ve isolated: just the numbers. Then I give the second parenthetical grouping that’s going to sit right outside. This works in my markup to help me hold only the numerical portion of the date inside a handy XML element.

Note that you might want to use parentheses for reasons other than capturing and backreferencing. For example, you might group a series of options marked with vertical pipes ( | ) inside a parenthetical group in order to set this group of options apart from the rest of your non-optional regex pattern. In this case, you’re using non-capturing parentheses, but you can hold capturing parentheses inside, and when you refer to them, you still refer to them working from left to right, from inside the non-capturing parentheses. This can get a little complicated, and we refer you to the Regular-Expressions.info page on "Branch Reset Groups" for details and examples.

Thinking Your Way Through an Autotagging Challenge:

There’s no single one way to do autotagging on a file: There are always options! Here are some hints:

When you begin, one of the things you do is analyze the structure of the document (do a “document analysis”) to notice what regular patterns you can find. You don’t want to be working on this line by line from the top to the bottom, because the point of autotagging is to collect all the related kinds of things across the whole document. Instead, the big decision you need to make is whether to work from the outside in, or the inside out.
In other words, do you try to capture all the big outer elements first (the ones that hold most of the other elements inside), and then work your way in? Or go the other way, and start from the inside elements (all the items inside the lists, for example)? Either approach can work, and much depends on the patterns you spot as you analyze your text file.
Sometimes you “munge” your file accidentally and need to take steps backward, or start over with a fresh copy of the file--that has happened to us! It can be frustrating--take a break and try it again. (It’s also very rewarding when you get it just right!)
Try a close-open strategy: Quite often, the place where you open a new element is ALSO the place where an old element closes. Can you do two things at once? Look for opportunities to close a tag when you open a new one (or vice versa).
When you work on autotagging, you usually do some work at the top and/or bottom of your file to change or eliminate a few things at the start or toward the end of your process. For example, if you try the close-open strategy to indicate at the start of a <list> element where the previous <list> ended, you’d write the code like this: </list><list>[regex pattern here]. When you’ve made your replacements, you’ll always have an extra closing </list> tag ahead of your first <list> element, but you can easily just manually delete this one rogue tag when you’re cleaning up your file.
When up-converting to XML, think about whether you really need or want to preserve things in your text files that function as pseudo-markup, that is, things that functioned like structural markup to indicate things like quotations, section divisions, separators between paragraphs. XML tags can be used to mark all these things, and you can apply HTML and CSS later to add dividers as you wish when you publish this in electronic form. But keep in mind as you analyze and convert your documents that you don’t really need to preserve formatting for the sake of preserving it. Remember that you want your XML markup (your tags themselves) to hold meaningful information about the structure and content of your document, so you do not really need to include the pseudo-markup in the original text. Systematically removing that pseudo-markup is part of your up-conversion process.

Some useful patterns:

(a|b) a or b
x{2,} two or more x’s
p{3} Exactly 3 p’s
q{3,} 3 or more q’s
B{3,5} 3, 4 or 5 B’s
^(.+)$ Since a caret ( ^ ) indicates the start of a line, and the dollar sign ( $ ) indicates the end of a line, and the .+ indicates the presence of some characters inside, this pattern selects lines that contain text (and ignores any lines that are empty). You could run a Replace to work with the capturing parentheses and wrap that content inside an element that makes sense (like <item>). In the Replace window, we’d write <item>\1</item> to tag the text inside the line.
^[IVX]+\. .+$ =beginning of a line, any roman numeral less than 50, exactly one literal period, exactly one literal space character, then all characters up to the end of the line
\s\s Find any sequence of two white-space characters (space, tab, new-line). If you’re running a Find and Replace, you might replace these multiple white-space characters with a single \s, or use the spacebar.
Replacing line breaks:: Match the \n (or newline character) in order to "consume" and replace a linebreak. It won’t work to try to replace ^ and $, which indicate the start and end of lines, because these are not characters that can be replaced; they are merely anchors or indicators.
Read about how to write a Lookahead and Lookbehind regex, to look for a pattern of something ahead or behind of a character, or something that is NOT ahead or behind a character. Read about it and look at examples on the Regular-Expressions.info guide to "Lookaround."

Regular Expressions in XPath and XSLT

There are XPath functions dedicated to matching and converting regular expressions: These include the following:

matches(): This takes two arguments: you designate a first string, and then a second that indicates a particular pattern you’re trying to find inside it. For example, if you were looking in all the paragraphs of a document coded with <p> to find any paragraphs that contain at least a single digit):
//p[matches(., "\d")]

Remember, the dot in the XPath indicates that you’re looking at the string of text inside each paragraph in turn, and that is the first string. Then the second string is the regular expression pattern \d, which indicates a pattern to search for any numerical digit inside the string of text in the paragraph.

Note: There are three other related XPath functions that work like matches(), only these work on literal strings, not regex patterns. We include them here because you may find them useful to think about in connection with matches():
- contains(): Tests whether the first string contains a particular literal string. To adapt our example above, say we are looking for all the paragraphs that contain a mention of the specific year 1995. We’d use contains() much like we’d write matches(), but this time using the literal characters.
  //p[contains(., "1995")]
  
  (Note: You can actually write matches() to look for a literal string as well as a regex pattern, since one kind of regex actually is a literal string. So, of these two, matches() is the more adaptable XPath function, and contains() can only match on literal strings.)
- starts-with(): Tests whether the first string starts with a particular literal string.
- ends-with(): Tests whether the first string ends with a particular literal string.
replace(): This function has three parts in its parenthetical expression: replace(string, regex, replacement-string), and works like this, for example, if we wanted to go look in any <author> element for capital letters, and replace them all with literal asterisk characters:
//author/replace(., "[A-Z]", "*")

Here, the regex pattern is described in the middle expression to define the pattern we’re looking for, and it’s a defined character set: This says, look for any single character from the set [A-Z] and replace it with a “splat” or an asterisk. When I ran this XPath expression on our ForsterGeorgComplete.xml file, I converted Forster, Georg in an author tag to *orster, *eorg. (Fortunately this was just a tester XPath, and it didn’t change the string of text in my file, just in the XPath results window.)
tokenize(): This one is extremely handy for fine-tuning XML markup: We use the tokenize() function for a sort of surgical precision in our documents, to break patterns into parts (or “tokens”), by dividing on a particular regex pattern: tokenize(string, regex-pattern), and the output breaks my string into parts that I can grab and work with. For example, I’ll go looking for <author> elements again to grab their text, and tokenize it on white space, defined as a regex pattern by \s+:
//author/tokenize(., "\s+")

When I run this in the XPath window, I return (among other things), a list that separates “George” from “Forster.”. (When we tokenize on white space, it’s a good idea to work in the option for one or more spaces, in case we have a line break as well as a space character separating two parts of a thing.)

In XSLT, there is an element, xsl:analyze-string that we use for manipulating regular expressions, and you can read more about it in the Michael Kay book if you have it, or on the Obdurodon site’s tutorial on using analyze-string.