Consult the following resources as you work with Regular Expressions:
The text we’ll be using as input for the first regex homework assignment is a plain-text version of Shakespeare’s sonnets produced by Project Gutenberg, which you can download from our site here: shakeSonnets.txt.
.txt
extension, and you might rename this as YourName_Regex1_sonnets.txt
.)
Step File
*.txt
) or markdown (*.md
) file and not something you write in a word processor (not a Microsoft Word document) so you do not have to struggle with autocorrections of the regex patterns you are recording.Your goal is to produce an XML version of the Shakespeare Sonnets file by using the search-and-replace techniques we discussed in class, and record each step you take in a plain text or markdown file so others can reproduce exactly what you did. (You may, in a real-life project situation, need to share the steps you take in up-converting plain text documents to XML, and share that on your GitHub repo in GitHub’s markdown (the same that we write on the GitHub Issues board), and in that case you would save the file with a .md
extension, like the instructions files we pushed into Class Examples for Regex_upConversion on the DHClass-Hub).
Your up-converted XML output should look something like http://dh.obdurodon.org/shakespeare-sonnets.xml. That is, each sonnet should be its own element, each line should be tagged separately, and the roman numerals should be encoded in a useful way (we’ve used attributes, but you could also put them in a child element).
Your Steps file
needs to be detailed enough to indicate each step of your process: what regular expression patterns you attempted to find, and what expressions you used to replace them. You might record the number finds you get and even how you fine-tuned your steps when you were not finding everything you wanted to at first. Note: we strongly recommend copying and pasting your find and replace expressions into your Steps file instead of retyping them (since it is easy to introduce errors that way).
There are several ways to get to the target output, but the starting points are standard:
First of all, for any up-conversion of plain text, you must check for the special reserve characters: the ampersand &
and the angle brackets <
and >
. You need to search for those and, if they turn up,
replace them with their corresponding XML entities, so that these will not interfere with well-formed XML markup.
Search for: | Replace with: |
---|---|
& | & |
< | < |
> | > |
Note that you need to process the special XML reserve characters in the correct order. Why is it important that you search and replace the &
first?
Don’t worry about the title and author at the top of the file just yet. You will eventually tag them by hand, and we recommend just doing that at the end of the up-conversion process. You’ll be using <oXygen/>’s global Find-and-Replace tool to tag the sonnets, and if you leave the title and author in place while you do that, you’ll wind up tagging them incorrectly. That isn’t a problem as long as you remember to fix them manually at the end. Or you could remove them now to another file to paste them back in at the end of the regex autotagging process.
To perform regex searching, you need to check the box labeled Regular expression
at the bottom of the <oXygen/> find-and-replace dialog box, which you open with
Control-f (Windows) or Command-f (Mac). If you don’t check this box, <oXygen/>
will just search for what you type literally, and it won’t recognize that some
characters in regex have special meaning. You don’t have to check anything else yet. Be
sure that Dot matches all
is unchecked, though; we’ll explain why below.
The non-blank lines all begin with space characters: there are two spaces before most
lines (the Roman numerals and the first twelve lines of each sonnet) and four spaces
before the last two lines of every sonnet. Those spaces are presentational formatting,
and not part of the content of the text, and since we don’t need them in order to tag
the text, we’ll start by deleting them. The regex to match a space character is just a
space character, and you can match one or more space characters by using the plus sign
repetition indicator. To match one or more instances of the letter X
, you would
use a regex like X+
. To match one or more instances of a space character,
just replace the X
with a space.
You don’t want to remove all space characters, though; you just want to remove the ones
at the beginning of a line. You can do that by using the caret metacharacter, which
anchors a match so that it succeeds only at the beginning of a line. For example, if the
regex X+
matches one or more instances of X
, the regex
^X+
matches one or more instances of X
only at the beginning of line. You can use this information to match one or
more space characters at the beginning of a line and replace them with nothing, that is,
delete them.
We can always choose whether to work with blank lines or not. For our purposes in this exercise, we do not need them, so you can delete them if you’d like, or you can leave them in place to enhance the legibility. To delete them, you need to match a blank line, and the easiest way to do that is to match two new line characters in a row and replace them with a single new line character. The regex for a new line character is \n
. Try it.
We can create our markup either from the outside in (document, then sonnet, then divide
the sonnet into Roman numeral and lines) or from the inside out (lines and Roman
numeral, then wrap those in a sonnet, then wrap all of the sonnets in a document).
Either strategy can be made to work, but we generally find it easier to work from the
inside out. (When we work from outside in, it’s easy to wind up incorrectly
wrapping <line>
tags around the <sonnet>
start and
end tags, etc.)
We’ll start by tagging every line as a <line>
. This will erroneously
tag the Roman numerals as if they were lines of poetry, which they aren’t, but since we're using the inside-out method, we are just planning to correct those Roman numeral lines later.
We don’t want to tag any blank lines (if we left them in), though, so we need a regex that
matches only lines that have characters in them. Check your <oXygen/> Find / Replace setup: make sure that Dot matches all
is unchecked! In this mode only, the dot (.
) matches any character except a new line, which means that we can use the plus sign
repetition indicator to match one or more instances of any character except a new line
(that is, .+
). By default regex selects the longest possible match, so even
though just two characters on a line will match the pattern, when we run it it will
always match the entire line. Since the dot matches any character except a new line, the
regex will match each line individually, that is, it won’t run over a new line and
continue the same match. Try it and examine the results. Now check Dot matches
all
, run Find all, and look at those results. Notice that the match no longer
stops at the end of the line, and since you want to tag each line individually, you need
to uncheck that box to revert to the normal, default behavior.
A human might think of our task as wrap every line in
, but regex has a find-and-replace view of the world, so a regex way to
think about it would be <line>
tagsmatch every line, delete it, and replace it with itself
wrapped in
. That is, regex doesn’t think about
leaving the line in place and inserting something before and after it; it thinks about
matching the line, deleting it, and then putting it back, but with the addition of the
desired tags. The regex selects and matches each full line, but how do we write what we
selected into the replacement string? The answer is that the sequence <line>
tags\0
in
the replacement pattern means the entire regex match
, and you can use that to
write the matched line back into the replacement, but wrapped in
<line>
tags. Try it.
The Roman numerals are now erroneously tagged as if they were lines of poetry, and in our
sample output at http://dh.obdurodon.org/shakespeare-sonnets.xml we want them to be attribute
values. To start that process we need to think about how to distinguish a Roman numeral
line from a real line of poetry. Since there are 154 sonnets, a Roman numeral line is a
line that contains one or more instances of I
, V
, X
, L
, and
C
in any order and nothing else, and no real line of poetry matches that
pattern. That means that we can match that pattern by using a regex character
class, which you can read about at http://www.regular-expressions.info/charclass.html. This approach will match
sequences that aren’t Roman numerals, like XVX
, but those don’t occur, so we
don’t have to worry about them. This illustrates a useful strategy: a simple regex that
overgeneralizes vacuously may be more useful than a complex one that avoids matching
things that won’t occur anyway. You can use the character class (wrapped in square
brackets) followed by a plus sign (meaning one or more) to complete your regex so that
it matches only <line>
elements that contain a Roman numeral and
nothing but a Roman numeral. Try it.
In this case you want to write the Roman numeral into the replacement string, but you
want to get rid of the spurious <line>
tags and replace them with
other markup. \0
will write the entire match into the replacement, but that
would include the original <line>
tags that you want to remove. To
capture part of a regex match, you wrap it in parentheses; this doesn’t match
parenthesis characters, but it does make the part of the regex that’s between the
parentheses available for reuse in the replacement string. For example,
a(b)c
would match the sequence abc
and capture the b
in
the middle, so that it could be written into the replacement. Capturing a single literal
character value isn’t very useful because you could have just written the b
into
the replacement literally, but you can also capture wildcard matches. For example,
a(.)c
matches a sequence of a literal a
character followed by
any single character except a new line followed by a literal c
character, and you
can use that information to capture everything between the <line>
tags in the matched string. To write a captured pattern into the replacement, use a
backslash followed by a digit, where \1
means the first capture group,
\2
means the second, etc. (and in this case you’re capturing only one
group). We’d build a replacement string that starts with a </sonnet>
end tag, then a new line, and then a <sonnet>
start tag, including
the @number
attribute and using the captured string as its value, etc. Try
it.
You may have to clean up the beginning and end of the document manually, including the title and author, and you’ll also need to add a root element.
Although you’ve added XML markup to the document, <oXygen/> remembers that you
opened it as plain text, which means that you can’t check it for well-formedness. To fix
that, save it as XML with File → Save as and give it the extension .xml
. Even
that doesn’t tell <oXygen/> that you’ve changed the file type, though; you have to
close the file and reopen it. When you do that, <oXygen/> now knows that it’s XML,
so you can verify that it’s well formed in the usual way: Control+Shift+W on Windows,
Command+Shift+W on Mac, or click on the arrow next to the red check mark in the icon bar
at the top and choose Check well-formedness
.
As we mention above, there are several ways to get to the target output, and whatever works is legitimate, as long as you make meaningful use of computational tools, including regular expressions (where appropriate), and don’t just tag everything manually. As you saw in class, there are ways to build your own regular expressions to match whatever patterns you need to identify, and the regex languages is complex and often difficult to read. The way we would approach this task is by figuring out what we need to match and then looking up how to match it. In addition to the mini-tutorial above, there is a more comprehensive description in the regex section of Michael Kay’s book and more detailed tutorial information at http://www.regular-expressions.info/tutorialcnt.html. If you decide to look around for alternative reference sites and find something that seems especially useful, please post the URL on the discussion boards, so that your classmates can also consult it.
We don’t need to see the XML that you produce as the output of your transformation because we’re going to recreate it ourselves anyway, but you do need to upload a step-by-step description of what you did. Your write-up can be brief and concise, but it should provide enough information to enable us to duplicate the procedure you followed.
If you don’t get all the way to a solution, just upload the description of what you did, what the output looked like, and why you were not able to proceed any further. As always, you’re encouraged to post any questions on our class GitHub Issues board!