Regular Expression Matching (Regex)

When we need to convert plain text or other digital text files into XML, we look for strategies to convert patterns into markup. For example, there may be clear signals in the text to show us divisions between sections (as in chapter breaks in a book, or act and scene divisions in a play), and we might be able to tell from patterns of line breaks where paragraph divisions fall. To help us identify, match, and locate all of these in a file at once (instead of one at a time), we use regular expressions, which are basically patterns to match strings of text. There are many slightly different varieties of regular expressions used in different coding and programming environments, and we will be using one of these that is standard for our XML editing work and the <oXygen/> editor we are using.

We use regular expression matching in what we call up-conversion from text to XML, and we also use it sometimes when we write XSLT to transform XML-to-XML, when we need to add markup based on particular patterns we can locate in the text. (For example, we might find that all the dates in a document are written in the same format and wrapped in square brackets, and we can quickly use regular expression matching to distinguish dates from other kinds of square-bracketed material by identifying the brackets and a pattern of numbers and hyphens. We locate and alter those dates with regular expressions either while coding an XML file or in up-converting a plain text file.)

In <oXygen/>, look at the Find/Replace window, select the checkbox next to “Regular Expressions” in the Options menus, and try typing a backslash character ( \ ) into the Find window to bring up a short scrollable list of regular expression patterns. There are many others we can use, and we tend to look these up and deploy them as needed (rather than memorizing a long list). We use this handy Regular Expressions Info Quick Start Guide very frequently, and it’s a great place for you to start learning and looking up regular expression patterns. The regex expressions we are listing on this page are those we use frequently in our projects. There are other convenient listings online, such as The Regular Expression Library at RegExLib.com , or Wikipedia’s Regular Expression page which may also be helpful. In the next section, we’ll discuss some basic starting points and procedures we commonly use in our up-conversion work.

Autotagging: Up-conversion from Plain Text

When we begin converting text files to XML, we start in the <oXygen/> window, and we try to show all the special formatting characters in the document. In <oXygen/>, go to Options -> Preferences -> Editor: Whitespaces: and mark to Show TAB and SPACE marks.

We then go to the Find/Replace window (CTRL+F on a PC computer, or on the “Find” dropdown menu), and do the following:

We typically do the following in the Find/Replace window, first working on changing special characters not permitted in XML content, working with ampersands first. (The order is important here, because you don’t want to change ampersands twice when you’re working on the angle bracket characters (if you have them). If you do the angle brackets first, you then leave those new ampersand characters designating the left and right brackets open for conversion when you only want the real ampersands by themselves. Make sense?)

  1. Change & to &amp;
  2. Then change < to &lt; and > to &gt;
  3. Look for ways to condense multiple blank lines, but only after analyzing your document and determining which ones should be kept as markers of, say, section breaks: We typically look for something like this, hunting for “newline” characters, \n:
    \n{3,} or \n\n\n+ in the Find window, and replace with \n\n, or whatever makes sense to you!
  4. While it may make the most sense to save this for last, you will need to (manually) add a root element to surround everything and make an XML file.

Useful Regex Pattern Symbols:

Indicating How Many, Either | Or, and Character Sets [ ]:

Escaping Regex's Special Characters (When You Need To Find a Square Bracket, Period, Asterisk, Question Mark, Etc.)

Because characters like square brackets, asterisks, and question marks have special meaning in regular expressions, in order to search for a literal square bracket, asterisk, or question mark, you need to escape the regex character by using a backslash ( \ ). The following characters need to be escaped with a backslash if you need to find the literal character in your text:

So, for example, in order to search for a string of alphanumeric characters followed by a literal period, we would write the following expression:

\w+\.

The "backslash w plus" looks up any one or more alphaumeric characters, and the backslash dot looks for the literal period. This might look a little confusing at first, since we use the backslash to introduce specific kinds of regular expression characters (\d, \w, etc.). It might help to think of using the backslash as an escape character whenever you need to locate a character that means something special on its own in regular expressions.

How to Use Parenthetical Grouping in the Find Window and Select Groups with Backreferences in the Replace Window:

When we group patterns in the Find window with parentheses, we can use backreferences to select parenthetical groupings by number in the Replace window. We apply a set of capturing parentheses to isolate some parts of a pattern we find, if we want to exclude the rest when we go to replace.

Note that you can use backreferences in any order, and repeat them as needed when you are making replacements, so you can thoroughly remix the regex patterns you’ve grouped! For examples of backreferencing, see the Regular-expressions.info page on the subject.

For example, I’ve just gone hunting through our Georg Forster voyage file to see if I can find all the references to days that take this verbal form: the 23rd of April (or the 15th, the 2nd, or the 3rd of whatever month and/or year). Let’s say I wanted to isolate only the numbers and not the letters (as in, simply, 23, 15, 2, 3), and wrap those in an element I’ll call <day>, and then I also want to keep the rest of that text to immediately follow? What I want to do is change this form: 23rd, into this: <day>23</day>rd . That’s a perfect opportunity to use parenthetical grouping in Find and Replace, like this:

Note that you might want to use parentheses for reasons other than capturing and backreferencing. For example, you might group a series of options marked with vertical pipes ( | ) inside a parenthetical group in order to set this group of options apart from the rest of your non-optional regex pattern. In this case, you’re using non-capturing parentheses, but you can hold capturing parentheses inside, and when you refer to them, you still refer to them working from left to right, from inside the non-capturing parentheses. This can get a little complicated, and we refer you to the Regular-Expressions.info page on "Branch Reset Groups" for details and examples.

Thinking Your Way Through an Autotagging Challenge:

There’s no single one way to do autotagging on a file: There are always options! Here are some hints:

  1. When you begin, one of the things you do is analyze the structure of the document (do a “document analysis”) to notice what regular patterns you can find. You don’t want to be working on this line by line from the top to the bottom, because the point of autotagging is to collect all the related kinds of things across the whole document. Instead, the big decision you need to make is whether to work from the outside in, or the inside out.

    In other words, do you try to capture all the big outer elements first (the ones that hold most of the other elements inside), and then work your way in? Or go the other way, and start from the inside elements (all the items inside the lists, for example)? Either approach can work, and much depends on the patterns you spot as you analyze your text file.

  2. Sometimes you “munge” your file accidentally and need to take steps backward, or start over with a fresh copy of the file--that has happened to us! It can be frustrating--take a break and try it again. (It’s also very rewarding when you get it just right!)
  3. Try a close-open strategy: Quite often, the place where you open a new element is ALSO the place where an old element closes. Can you do two things at once? Look for opportunities to close a tag when you open a new one (or vice versa).
  4. When you work on autotagging, you usually do some work at the top and/or bottom of your file to change or eliminate a few things at the start or toward the end of your process. For example, if you try the close-open strategy to indicate at the start of a <list> element where the previous <list> ended, you’d write the code like this: </list><list>[regex pattern here]. When you’ve made your replacements, you’ll always have an extra closing </list> tag ahead of your first <list> element, but you can easily just manually delete this one rogue tag when you’re cleaning up your file.
  5. When up-converting to XML, think about whether you really need or want to preserve things in your text files that function as pseudo-markup, that is, things that functioned like structural markup to indicate things like quotations, section divisions, separators between paragraphs. XML tags can be used to mark all these things, and you can apply HTML and CSS later to add dividers as you wish when you publish this in electronic form. But keep in mind as you analyze and convert your documents that you don’t really need to preserve formatting for the sake of preserving it. Remember that you want your XML markup (your tags themselves) to hold meaningful information about the structure and content of your document, so you do not really need to include the pseudo-markup in the original text. Systematically removing that pseudo-markup is part of your up-conversion process.

Some useful patterns:

Regular Expressions in XPath and XSLT

There are XPath functions dedicated to matching and converting regular expressions: These include the following:

In XSLT, there is an element, xsl:analyze-string that we use for manipulating regular expressions, and you can read more about it in the Michael Kay book if you have it, or on the Obdurodon site’s tutorial on using analyze-string.