Monday, June 10: XPath

Introduction to XPath in eXist-db and <oXygen/> (10:15 a.m.–12:00 p.m.)

  1. Getting started with XPath and eXide (15 minutes; 10:15 a.m.–10:30 a.m.)  | 
    1. Open , click “New XQuery”, and erase all content in the editing window. You’ll type your XPath in the editing window and run it with the “Eval” button.
    2. Learning XPath (and other languages, including XSLT, XQuery, Schematron) means learning the …
      1. Vocabulary (e.g., the division operator in XPath is div, not /)
      2. Syntax (e.g., in XPath conditional expressions, the if test must be parenthesized and an else is required: if (condition) then 1 else ())
      3. Function library (e.g., string-length() and count() are functions, but there is no length() or len() or size())
      4. Programming paradigm (e.g., you cannot change the value of a variable in a declarative or functional programming language like XQuery or XSLT)
    3. All XPath expressions return a sequence. Sequences may contain nodes (elements, attributes, etc.), atomic values (strings, numbers, etc.), or both. A sequence of one item is nonetheless a sequence, as is an empty sequence. Nested sequences are automatically flattened.
      1. Type a number and hit Eval. This is a one-item sequence that consists of a single atomic value. Try integers and decimal numbers. Try wrapping the number in parentheses.
      2. Type a string (inside single or double quotes) and hit Eval. This is a one-item sequence that consists of a single atomic value. Try omitting the quotation marks. Try using curly quotation marks. Try wrapping the string in parentheses.
      3. Type empty parentheses and hit Eval. This is an empty sequence.
      4. Type multiple items of different types (numbers, strings), separated by commas. Try wrapping them in parentheses. Try wrapping them in multiple parentheses. Try removing the commas. This is a multi-item sequence.
      5. Try to type a nested sequence, e.g., (1, 2, (3, 4)), and hit Eval. What result do you expect? What do you get?
  2. Simple XPath expressions (25 minutes; 10:30 a.m.–10:55 a.m.)  | 
    1. Review: strings and numbers (atomic values) are XPath expressions
      1. "Hi, Mom!" (Strings are enclosed in single or double quotation marks—straight, not curly)
      2. 1 (Numbers are not enclosed in quotation marks)
      3. 1.0 (What should this return? lexical space and value space)
    2. Arithmetic expressions are XPath expressions
      1. 1 + 1
      2. Practice: +, -, *, div, idiv, mod (/ is not division)
    3. XPath library functions (with no arguments) are XPath expressions
      1. current-date()
      2. current-time()
      3. current-dateTime()
    4. XPath library functions (with arguments) are XPath expressions
      1. upper-case('dhsi') (How many arguments, and of what type?)
      2. concat('Curly', 'Larry', 'Moe') (How many arguments, and of what type?)
      3. count(('Curly', 'Larry', 'Moe')) (Why two sets of parentheses? Hint: How many arguments, and of what type?)
      4. Function signature and cardinality: count($items as item()*) as xs:integer
    5. Nested XPath library functions and operations are XPath expressions. Read them from the inside out
      1. max((1 + 2, 10 div 5, 6 * 0.2)) (Remember those two sets of parentheses?)
      2. translate(upper-case('Hi, Mom!'),'AEIOU','xxxxx') (How is this different from upper-case(translate('Hi, Mom!','AEIOU','xxxxx'))?)
      3. format-dateTime(current-dateTime(),'[h].[m01] [Pn] on [FNn], [D1o] [MNn]')
    6. Nested functions are hard to read. Use the arrow operator (=>) instead
      1. upper-case('Hi, Mom!') => translate('AEIOU','xxxxx')
      2. current-dateTime() => format-dateTime('[h].[m01][Pn] on [FNn], [D1o] [MNn]')
    7. Path expressions may span multiple lines (try it with the examples above), that is, new-line and space have the same meaning
  3. XPath in <oXygen/> (20 minutes; 10:55 a.m.–11:15 a.m.)  | 
    1. Launch <oXygen/> editor, hit Ctrl+u (Windows) or Cmd+u (MacOS), copy and paste the string, and hit OK. (Backup copy at This is a copy of Hamlet with TEI markup.
    2. Set the dropdown in the upper left to XPath 3.1. (This widget is called the XPath Toolbar.) Enter some XPath expressions (from above, such as 1 + 1). Limited to one line; hit Enter to run the expression. The XPath Toolbar works only if you have an XML document open in <oXygen/>, even if you aren’t using the document in your XPath expression.
    3. Go to Window → Show View → XPath/XQuery Builder. Set the dropdown in the upper left to XPath 3.1. Enter some XPath expressions. May span multiple lines; Enter for a new line. To run, hit Ctrl+Enter (Windows) or Cmd+Enter (MacOS), or click the red right-pointed triangle.
  4. XPath path expressions (20 minutes; 11:15 a.m.–11:35 a.m.)  | 
    1. An XPath path expression is a sequence of steps, each of which proceeds from one node (called the context node) to a sequence of zero (!) or more others. It returns the results in document order (order of start tags). (Details at Kay 1227)
    2. Sample XPath path expression: /TEI/text/body/div: start at the document node, then navigate to a sequence of all its <TEI> children. For each of those, navigate to all of their <text> children, then to their <body> children, and then to their <div> children.
    3. XPath steps are separated by single slashes (/).
    4. An XPath expression that begins with a slash (/) starts at the document node; this is an absolute path. Any other XPath expression starts at the current context; this is a relative path.
    5. It is not an error to ask for something that doesn’t exist; it just returns an empty sequence.
    6. With Hamlet open and selected, go to the XPath Toolbar or XPath Builder and try the following examples. Click on some of the results in the lower panel:
      1. /TEI/teiHeader/fileDesc/titleStmt/title (returns 1 <title> element)
      2. /TEI/text/body/div (returns 5 <div> elements)
      3. /TEI/teiHeader/fileDesc/titleStmt/info (returns no results; this is not an error)
      4. /TEI/teiHeader/fileDesc/title Stmt/title (raises an error; spaces are not allowed in path expressions)
    7. Namespaces matter in Real Life (XSLT, XQuery, Schematron), but the <oXygen/> XPath Toolbar and XPath Builder take care of them for you behind the scenes (eXide does not).
  5. XPath path steps (25 minutes; 11:35 a.m.–12:00 p.m.)  | 
    1. Path steps move along axes: child::, parent::, descendant::, ancestor::, preceding-sibling::, following-sibling::, etc. See:
    2. Axes are specified with a double colon, e.g., descendant::div matches all <div> descendants of the current context node. There are two common shortcut notations
      1. The default is the child axis, so /TEI/teiHeader is synonymous with /child::TEI/child::teiHeader. Use the shorthand.
      2. // is shorthand for /descendant-or-self::, so /TEI//div finds all of the <div> elements that are descendants of the <TEI> root element, that is, anywhere in the document. The document node has a descendant axis, too: //div. Be careful with this one!
    3. Each path step returns a sequence of zero or more context nodes for the next path step. Only the final path step is permitted to return something other than a node. Why?
    4. The end of a path expression may return nodes or atomic values
      1. //body/div/count(descendant::sp) navigates from the document node to all of the acts in the play and then returns a count of the speeches in each act
      2. What’s wrong with //body/div/count(//sp)? The leading double slash resets the current context to the document node, and selects all <sp> elements in the entire document, instead of just the individual act.
    5. * matches any element
      1. /TEI/teiHeader/* matches all child elements of the <teiHeader>
    6. .. matches the parent node of the current context node. That is, it’s shorthand for parent::*
      1. //stage/.. matches the parent nodes of all <stage> elements
    7. Your turn:
      1. Find the acts (<div> children of <body>) in Hamlet //body/div
      2. Find the stage directions (<stage>) in Hamlet //stage
      3. Find the <stage> children of <div> elements (but not other <stage> elements) in Hamlet //div/stage
      4. Find the parents of the stage directions in Hamlet //stage/.. or //stage/parent::*
      5. Find the <div> parents of the stage directions in Hamlet, but not other parents //stage/parent::div

Exploring document structures and data with XPath (1:30 p.m.–4:00 p.m.)

  1. XPath functions for strings (25 minutes; 1:30 p.m.–1:55 p.m.)  | 
    1. concat()
      1. concat('Curly','Larry','Moe')
      2. concat('Curly is #', 1)
      3. Or use the concatenation operator: 'Curly is #' || 1
      4. What’s wrong with concat(//speaker)? The arguments to concat() must be two or more individual atomic (or atomizable) items, and //speaker is a sequence
    2. string-join()
      1. string-join(( 'Curly', 'Larry', 'Moe'), ', ')
      2. string-join(//speaker, ', ')
      3. string-join(//speaker) Why does this work when concat(//speaker) didn’t? The first argument to string-join() is a sequence. All arguments to concat() must be atomic or atomizable.
    3. string-length()
      1. string-length('Curly, Larry, and Moe')
    4. lower-case(), upper-case()
      1. lower-case('Curly, Larry, and Moe')
    5. normalize-space()
      1. normalize-space(' Curly, Larry, Moe ')
    6. substring-before(), substring-after()
      1. substring-before('Larry', 'r') What if there’s more than one?
      2. substring-after('Larry', 'r') What if there’s more than one?
    7. substring()
      1. substring('Curly', 1, 2) XPath starts counting with 1 (not 0).
    8. contains() Foreshadowing: This returns a Boolean (True or False) value. How might this be useful?
      1. contains('Ophelia', 'ph')
      2. //speaker/contains(., 'ph') (the dot refers to the current context item)
    9. starts-with(), ends-with()
      1. starts-with('Ophelia', 'Op')
  2. XPath functions for numbers (20 minutes; 1:55 p.m.–2:15 p.m.)  | 
    1. max(), min(), sum(), avg()
      1. max((1, 2, 3)), etc.
      2. What happens when these are applied to strings? To a sequence that mixes strings and numbers?
    2. ceiling(), floor()
      1. ceiling(3.141592653)
    3. round()
      1. round(3.141592653, 4)
    4. format-integer(), format-number()
      1. format-integer(154,'w')
      2. format-integer(154,'I')
      3. format-number(1, '#.000')
    5. Find the length in character count of each <speaker> //speaker/string-length() (why doesn’t string-length(//speaker) work?)
    6. Find the length of the longest speaker name max(//speaker/string-length())
  3. XPath functions for sequences (15 minutes; 2:15 p.m.–2:30 p.m.)  | 
    1. distinct-values()
      1. distinct-values(/TEI//speaker)
    2. count()
      1. count(('Curly', 'Larry', 'Moe', 'Curly'))
      2. count(distinct-values(('Curly', 'Larry', 'Moe', 'Curly')))
      3. distinct-values(count(('Curly', 'Larry', 'Moe', 'Curly')))
    3. sort()
      1. sort(//speaker)
      2. sort(//speaker,(), function($item) {string-length($item)})
    4. Your turn:
      1. How many <speaker> elements are there in Hamlet? count(//speaker)
      2. How many distinct <speaker> elements are there in Hamlet? count(distinct-values(//speaker))
      3. How many acts are there in Hamlet? count(//body/div)
      4. How many scenes are there in Hamlet? count(//div/div)
      5. What does count(//div) tell you about Hamlet, and why is it unhelpful? It counts <div> elements of different types together: acts, scenes, cast list.
  4. Looking Stuff Up: XPath function signatures and cardinality (10 minutes; 2:30 p.m.–2:40 p.m.)  | 
    1. The function signature is the number and type of arguments it accepts or requires, and the number and type of items it returns
    2. Type error: string-length(1.2345)
    3. Cardinality error: string-length(/TEI//speaker)
    4. Why is count(/TEI//speaker) okay, while count('Curly', 'Larry', 'Moe') is broken?
    5. The error message is your friend. Read it.
    6. Resources and references:
  5. Break (10 minutes; 2:40 p.m.–2:50 p.m.)
  6. XPath predicates (15 minutes; 2:50 p.m.–3:05 p.m.)  | 
    1. Predicates, in square brackets after a path step, filter the results
    2. Numerical predicates
      1. //body/div[3] matches the third <div> child of each <body> element (same as //body/div[position() eq 3]
      2. //body/div[last()] matches the last <div> child of each <body> element
    3. Predicates with node tests
      1. //stage[parent::div] is equivalent to //div/stage
    4. Predicates with operators and functions
      1. //sp[speaker eq 'Ophelia']
      2. //sp[contains(speaker, 'Rosencrantz')]
      3. //lg[@type eq 'couplet']
  7. Comparison (15 minutes; 3:05 p.m.–3:20 p.m.)  | 
    1. Value comparison
      1. eq, ne, lt, le, gt, ge
      2. 1 eq 1
      3. 1 lt 1
      4. "Curly" gt "Larry"
    2. General comparison
      1. =, !=, <, <=, >, >=
      2. 1 = 1
      3. 1 eq 1
      4. 1 = (1, 2, 3)
      5. 1 != (1, 2, 3)
      6. not(1 = (1, 2, 3))
      7. (1, 2, 3) = (1, 4, 5)
      8. 1 eq (1, 2, 3)
      9. "Moe" = ("Curly", "Larry", "Moe")
  8. Odds and ends (15 minutes; 3:20 p.m.–3:35 p.m.)  | 
    1. Three ways to apply a function to a sequence
      1. Explicit for
        1. for $speaker in /TEI//speaker return string-length($speaker)
      2. Implicit for
        1. /TEI//speaker/string-length()
      3. Simple map (!)
        1. /TEI//speaker ! string-length(.)
    2. Difference between simple map (!) and arrow (=>)
      1. ('Curly', 'Larry', 'Moe') => count()
      2. ('Curly', 'Larry', 'Moe') ! count(.)
  9. Read and evaluate XML projects with XPath (25 minutes; 3:35 p.m.–4:00 p.m.)  | 
    1. How many speeches (<sp>) does Ophelia have? count(//sp[speaker eq 'Ophelia'])
    2. How many speeches does Ophelia have in Act 2? count(//body/div[2]//sp[speaker eq 'Ophelia'])
    3. What types of elements can have stage directions (<stage>) as children? (Hint: use the name() function.) distinct-values(//stage/../name())
    4. How many speeches don’t contain any metrical line child elements (<l>)? (Hint: use the not() function.) count(//sp[not(l)])
    5. Building on your answer to the last question, who are the speakers of those speeches? distinct-values(//sp[not(l)]/speaker)
    6. Building on your answer to the last two questions, what kinds of elements do they contain instead? distinct-values(//sp[not(l)]/*/name())
    7. What is Hamlet’s first spoken line (<l>)? (//sp[speaker eq 'Hamlet']/l)[1]
    8. What is the last stage direction in the entire document? (//stage)[last()]
    9. How many speeches have more than 8 line children? count(//sp[count(l) gt 8])
    10. Building on your answer to the preceding question, how many line children does each of those speeches have? //sp[count(l) gt 8]/count(l)
    11. Building on your answers to the preceding two questions, who are the speakers of speeches that have more than 8 line children? distinct-values(//sp[count(l) gt 8]/speaker)
    12. How long is the longest speech? max(//sp/string-length()) (or, better: max(//sp/string-length(normalize-space())))
    13. Building on your answer to the last question, who is the speaker of the longest speech? //sp[string-length() eq max(//sp/string-length())]/speaker (or, better: //sp[string-length(normalize-space()) eq max(//sp/string-length(normalize-space()))]/speaker). //sp[not(string-length() < //sp/string-length())] (with or without a normalize-space() operation) also works; how?

Tuesday, June 11: XPath and XQuery

XPath and XQuery in eXist-db (9:00 a.m.–12:00 p.m.)

  1. Housekeeping: documents, collections, and namespaces (10 minutes; 9:00 a.m.–9:10 a.m.)  | 
    1. Open In the eXide window, click on the New XQuery tab. This brings up a window with xquery version "3.1"; at the top.
    2. Access a document with doc()
      1. doc('/db/apps/shakespeare/data/ham.xml')
    3. Access a collection of documents with collection()
      1. collection('/db/apps/shakespeare/data/')
    4. Namespace declaration
      1. declare namespace tei="";
      2. <stage> elements in Hamlet: doc('/db/apps/shakespeare/data/ham.xml')//tei:stage
      3. Find all the stage directions in the entire Shakespeare collection collection('/db/apps/shakespeare/data/')//tei:stage
  2. The seven types of nodes (30 minutes; 9:10 a.m.–9:40 a.m.)  | 
    1. document()
    2. element()
    3. attribute()
      1. collection('/db/apps/shakespeare/data/')//tei:sp/@who
      2. collection('/db/apps/shakespeare/data/')//tei:sp/@who/string()
    4. text() (not a function; not to be confused with string())
      1. doc('/db/mitford/literary/Charles1.xml')//tei:stage (Mary Russell Mitford’s Charles the First)
      2. What does doc('/db/mitford/literary/Charles1.xml')//tei:stage/string() return? The string values of the stage directions, that is, the stage directions with all markup stripped
      3. What does doc('/db/mitford/literary/Charles1.xml')//tei:stage/text() return? The text() nodes in each stage direction
      4. Compare:
        1. (doc('/db/mitford/literary/Charles1.xml')//tei:stage)[1]
        2. (doc('/db/mitford/literary/Charles1.xml')//tei:stage)[1]/string()
        3. (doc('/db/mitford/literary/Charles1.xml')//tei:stage)[1]/text()
    5. Rarely used: comment(), namespace(), processing-instruction()
  3. Neglected XPath axes (25 minutes; 9:40 a.m.–10:05 a.m.)  | 
    1. preceding::, following::
      1. Find all the stage directions (<stage>) that precede Act 2, Scene 2 in Charles the First. (First, take a look at the TEI body element and see how acts and scenes are coded. Write an XPath to take you to the second act and the second scene. Then in your next path step, switch to the preceding:: axis to look for the stage directions. doc('/db/mitford/literary/Charles1.xml')//tei:body/tei:div[@type='act'][2]/tei:div[@type='scene'][2]/preceding::tei:stage
    2. self::
      1. doc('/db/mitford/literary/Charles1.xml')//tei:body/tei:div[@type='act'][2]/tei:div[@type='scene'][2]/tei:head/following-sibling::*[2]
      2. doc('/db/mitford/literary/Charles1.xml')//tei:body/tei:div[@type='act'][2]/tei:div[@type='scene'][2]/tei:head/following-sibling::*[2][self::tei:stage]
      3. doc('/db/mitford/literary/Charles1.xml')//tei:body/tei:div[@type='act'][2]/tei:div[@type='scene'][2]/tei:head/following-sibling::*[2][not(self::tei:stage)]
      4. doc('/db/mitford/literary/Charles1.xml')//tei:speaker[. eq 'Queen.']
      5. doc('/db/mitford/literary/Charles1.xml')//tei:speaker[self::node() eq 'Queen.']
    3. ancestor::, ancestor-or-self::, descendant-or-self::, namespace::
      1. doc('/db/mitford/literary/Charles1.xml')//tei:speaker[. eq 'Queen.']/ancestor::tei:div[@type eq 'act']/@n/string()
  4. Scavenger hunt 1 (40 minutes; 10:05 a.m.–10:45 a.m.)  | 
    1. Work with the Digital Mitford Site Index posted in eXist at /db/mitford/si.xml or the official version at its external location: Can you find out the following?
      1. Use XPath to show the attributes on <div> elements. What are the attribute names, and how can you show their values in the return window? Retrieve the names with doc('')//tei:div/@*/name() => distinct-values(). Once you know thaat the name is @type, retrieve the values with doc('')//tei:div/@type/string()
      2. Write an XPath expression that shows you the children of <div> elements without your specifying their names. Now write an XPath expression that shows you the grandchildren. Children: doc('')//tei:div/*. Grandchildren: doc('')//tei:div/*/*.
      3. Write an XPath expression to report the name of the only attribute that can appear on child elements of <div> elements without your knowing what the elements are. What values can that attribute have in the document? doc('')//tei:div/*/@*/name() => distinct-values() tells you that only attribute that can appear there is @sortKey. Get the occurring values with doc('')//tei:div/*/@sortKey/string() or (without knowing the name of the attribute) doc('')//tei:div/*/@*/string()
      4. What kinds of elements can be the parents of @xml:id attributes in this document? Write an XPath that isolates these elements. doc('')//@xml:id/parent::* or doc('')//*[@xml:id]. These elements hold entries describing named entities, and the @xml:id contains a distinct identifier for each one.
      5. The @xml:id for the play Charles the First in the site index is "CharlesI_MRMplay". Write an XPath to retrieve that entry. doc('')//*[@xml:id eq 'CharlesI_MRMplay']
      6. References to the play throughout the site index will be made with various attributes (such as @ref but there may be others) whose values begin with a hash mark, formatted like this: "#CharlesI_MRMplay". Sometimes other entries (such as for persons or fictional characters) make reference to the play in their note or other internal elements (to say, for example, that an actor or a character was involved in this play). How can we find those entries? How many are there? (Use a function to do the counting.) Find the elements with doc('')//tei:div/*/*[@xml:id][descendant::*/@*="#CharlesI_MRMplay"]. Count them with doc('')//tei:div/*/*[@xml:id][descendant::*/@*="#CharlesI_MRMplay"] => count()
  5. Break (10 minutes; 10:45 a.m.–10:55 a.m.)
  6. Regex in XPath (35 minutes; 10:55 a.m.–11:30 a.m.)  | 
    1. contains() vs. matches()
      1. doc('/db/mitford/literary/Charles1.xml')//tei:l[contains(., 'murder')]
      2. doc('/db/mitford/literary/Charles1.xml')//tei:l[contains(., 'unrighteousness')]
      3. doc('/db/mitford/literary/Charles1.xml')//tei:l[matches(., '[a-z]{15,}','i')] (Read about flags.)
      4. doc('/db/mitford/literary/Charles1.xml')//tei:*/text()[matches(., '\d{4}')]
      5. doc('/db/mitford/literary/Charles1.xml')//tei:*/text()[matches(., '(^|\D)\d{4}($|\D)')] Why is the number of results smaller than for the previous expression?
        xquery version "3.1"; declare namespace tei=""; let $a := doc('/db/mitford/literary/Charles1.xml')//tei:* /text()[matches(., '(^|\D)\d{4}($|\D)')] let $b := doc('/db/mitford/literary/Charles1.xml')//tei:* /text()[matches(., '\d{4}')] return $b except $a (: returns items in $b that are not in $a :)
    2. translate() vs. replace()
      1. Try this expression, doc('/db/mitford/literary/Charles1.xml')//tei:roleDesc//tei:rdg/string(), and notice the pseudomarkup (parentheses) in the cast list. translate() to the rescue! doc('/db/mitford/literary/Charles1.xml')//tei:roleDesc//tei:rdg/translate(., '()', '')
      2. The next examples work with the @xml:id attributes on the <l> elements. How can you get a look at the @xml:id values first? doc('/db/mitford/literary/Charles1.xml')//tei:l/@xml:id/string()
      3. Uh oh! We wanted to write C1 instead of Chas. How does the following XPath expression fix the problem? doc('/db/mitford/literary/Charles1.xml')//tei:l/replace(@xml:id, '^Chas', 'C1')
    3. substring-before() and substring-after() vs. tokenize()
      1. Return only location (e.g. ded, pro, act) and line number information of the @xml:id attributes: doc('/db/mitford/literary/Charles1.xml')//tei:l/substring-after(@xml:id, 'Chas_')
      2. Working with the expression we just wrote, how would you use substring-before() to return only the location (ded, pro, act), trimming off the line number information? The old-fashioned way uses nested functions: doc('/db/mitford/literary/Charles1.xml')//tei:l/substring-before(substring-after(@xml:id, 'Chas_'), '_'). The more legible way uses the simple map operator: doc('/db/mitford/literary/Charles1.xml')//tei:l/substring-after(@xml:id, 'Chas_') ! substring-before(.,'_'), or even doc('/db/mitford/literary/Charles1.xml')//tei:l/@xml:id ! substring-after(., 'Chas_') ! substring-before(.,'_'). Why can’t we use the arrow operator (=>) here?
  7. Introducing variables (10 minutes; 11:30 a.m.–11:40 a.m.)  | 
    1. Global variables and syntax, how to return their values in eXist-db
    2. In eXist-db, keep the TEI namespace declaration line, and copy the following global variables (the trailing semicolon is required):
      1. declare variable $Chas as document-node() := doc('/db/mitford/literary/Charles1.xml');
      2. declare variable $ChasPlay as element() := $Chas/*;
      3. Return the value of the variables by typing their names on the next line $Chas, and $ChasPlay. Notice the difference in the data type declaration and in the results.
      4. The value after as specifies the data type. It is optional, but strongly recommended.
        1. Other common values include xs:string or xs:integer().
        2. May also include repetition indicators, e.g., declare variable $ChasActs as element(tei:div)+ := doc('/db/mitford/literary/Charles1.xml')//tei:body/tei:div;.
  8. Introducing FLWOR (20 minutes; 11:40 a.m.–12:00 p.m.)  | 
    1. FLWOR keywords: for, let, where, order by, return
    2. The simplest FLWOR: let (or for) followed by return
    3. Retrieve a sequence of whole elements:
      1. let $places := $Chas//tei:placeName return $places
      2. How would you return only their text contents? return $places/string(), but notice the white space issues. Repair these with return $places/normalize-space()

XQuery flow control (1:30 p.m.–4:00 p.m.)

  1. Writing XQuery in stages (30 minutes; 1:30 p.m.–2:00 p.m.)  | 
    1. If our eXist-db on the NewtFire server is misbehaving, we will be writing XQuery in the <oXygen/> XML editor. This involves some initial configuration. You may open the Digital Mitford site index at its external location (as we did yesterday with File > Open URL): Look for four tiny buttons in the top right-hand corner of the interface. The second one from the right, marked XQ, is the XQuery debugger. Click on it to change the interface for XQuery writing. Then, go to File > New (or click on the paper icon on the top left), and go to open a New File: Type XQ in the search window to open a new XQuery document. We can begin writing XQuery here. In the XQuery debugger view, we need to choose a dummy file to transform, and the name of an XQuery file to run. We also need to choose a parsing engine: choose Saxon-PE XQuery....
    2. If working on the NewtFire eXist-db, work in the same file we started before the break and delete only the return line. Use variables and FLWOR statements to define and retrieve the following:
      1. Define a global variable pointing to the Digital Mitford site index document, at its external location: Declare a second global variable for the Charles I play at its GitHub location: declare variable $si as document-node() := doc(''); and declare variable $si as document-node() := doc(''); declare statements must appear at the beginning of the XQuery script.
      2. Write a variable (in either declare or let form) that locates all of the <place> elements in the site index document. Use the variable you just defined for the si.xml document in your expression. declare variable $siPlaces := $si//tei:place; or let $siPlaces := $si//tei:place.
      3. For housekeeping purposes, rename the variable $places (which we defined earlier to retrieve place references in the play, using $Chas//tei:placeName); call it $Chasplaces.
      4. Define a new variable to retrieve the values of the @ref attributes on those $Chasplaces Don’t forget the string() to return the attribute value: let $ChasPlaceRefs := $Chasplaces/@ref/string()
      5. Take a look at the last XPath Scavenger hunt challenge and solution from this morning. (This asked you to find references to "CharlesI_MRMplay" in the site index.) How woud you rewrite that solution as a let statement in XQuery? let $siChasRefs := $si//tei:div/*/*[@xml:id][descendant::*/@*="#CharlesI_MRMplay"]
  2. Review XPath for loops; sequence and range variables (in <oXygen/>) (20 minutes; 2:00 p.m.–2:20 p.m.)  | 
    1. In the <oXygen/> XPath Builder View, try this code: for $i in ("Curly", "Larry", "Moe") return concat($i, " is a Stooge!")
    2. Rewrite it as a simple map (with !). ("Curly", "Larry", "Moe") ! concat(., " is a Stooge!")
    3. Open the Digital Mitford site index URL in <oXygen/> using In the <oXygen/> XPath Builder:
      1. Enter //person[@xml:id eq 'Hume_Jos'] to find the entry for Joseph Hume. Notice that the string Scotland appears inside his <birth> child element, which means that he was born in Scotland.
      2. Find all persons listed as born in Scotland in two ways: with a for loop and as a one-line XPath path expression. for $i in //person[contains(birth, "Scotland")] return $i or //person[contains(birth, "Scotland")]. You should return 30 <person> elements.
      3. Now, modify that example to return the @xml:id attributes of the <person> elements. For the @xml:id: for $i in //person[contains(birth, "Scotland")] return $i/@xml:id or //person[contains(birth, "Scotland")]/@xml:id. Why do we see the values even though we didn’t write string() at the end of the path expression?
  3. FLWOR statements in XQuery: how for works: Part 1 (20 minutes; 2:20 p.m.–2:40 p.m.)  | 
    1. We’ll find the places mentioned in Charles the First and dereference them (look them up) in the Digital Mitford site index.
      1. First find the distinct (unique) values of @ref attributes on <placeName> elements in the play and assign them to a variable. let $distChPRs := distinct-values($ChasPlaceRefs)
      2. Next, loop through each of these distinct values: for $i in $distChPRs
      3. How will we find the site index entry that matches distinct place reference? Each site index entry holds an @xml:id, and each placeName element has a @ref attribute whose value is formatted with a leading # followed by the @xml:id value. We can use these to match the references in the play with the relevant information in the site index.
      4. Find the site index entry whose @xml:id matches the value of the range variable in our for loop and assign it to a variable. let $siCPrs := $si//tei:place[@xml:id = substring-after($i, '#')]
  4. Break (10 minutes; 2:40 p.m.–2:50 p.m.)
  5. FLWOR statements in XQuery: how for works: Part 2 (30 minutes; 2:50 p.m.–3:20 p.m.)  | 
    1. Sorting and ordering your sequence in two ways:
      1. Employ the XPath sort() function to sort the unique place references (@ref attributes on <placename> elements) in the play and assign the resulting sequence to a variable. let $distChPRs := sort(distinct-values($ChasPlaceRefs))
      2. Alternatively, use the XQuery order by statement as part of a FLWOR: order by $someVariable. By default, the ordering is alphabetic and ascending (a to z). Override this with the optional keywords numeric and descending. To order in reverse alphabetical order by the @xml:id in the site index entry? order by $siCPrs/@xml:id descending
    2. Numbering the results with $pos
      1. Set the $pos variable in the for statement: for $i at $pos in $YourSequenceVariable, but caution: order by happens after $pos is set. How do we ask for sorted, numbered output? Use the sort() function before the for loop begins. Try a return like return concat($pos, '. ', $siCPrs/@xml:id/string())
    3. Adding where in a for-loop to limit the returns
      1. Notice the blank results: A number of entries are not yet in the site index. So we can limit by selecting only those where the variable $siCPrs exists: where $siCPRs
      2. Or use where to return only results in the site index whose string() contains "France": where $siCPrs[contains(string(.), 'France')]
    4. Which is more efficient in XQuery: a predicate or where?
    5. Text returns: combining strings into one result: concat() and string-join()
      1. Retrieving a full canonical place name: let $name := $siCPrs/tei:placeName[1]
  6. 'To create a little flower is the labour of ages.' –William Blake, The Proverbs of HellPutting it all together: writing FLWORs to make new files (40 minutes; 3:20 p.m.–4:00 p.m.)  | 
    1. HTML returns: how to use curly braces to layer and activate XQuery in an HTML file.
      1. HTML table output:
      2. XQuery to make the HTML, in the newtfire eXist-db: /db/DHSI-Queries/Chas-SI-HTMLTable.xql, or on GitHub:
    2. SVG returns: a bar graph from XQuery
      1. SVG bar graph output: or
      2. XQuery to make the SVG, in the newtfire eXist-db: /db/DHSI-Queries/Chas-PersNameGraph-SVG.xql, or on GitHub:

Wednesday, June 12: XPath and XSLT

Introduction to XPath in XSLT (9:00 a.m.–12:00 p.m.)

  1. Preparation for writing XSLT in <oXygen> (20 minutes; 9:00 a.m.–9:20 a.m.)  | 
    1. Settings: XSLT debugger and Saxon parser
    2. Selecting files to run and save
      1. Open <oXygen/> and open the following url:
      2. Open this starter XSLT file, too: Save this file locally on your computer.
  2. XSLT overview in <oXygen/> (40 minutes; 9:20 a.m.–10:00 a.m.)  | 
    1. XSLT (eXtensible Stylesheet Language Transformations): an XML document with special namespaced elements designed to process XML documents
    2. Basic structure: <xsl:stylesheet> is the root element, with <xsl:template> children.
    3. Housekeeping: Namespaces and where they matter (It’s always a namespace issue!)
      1. xsl:: distinguishes the XSLT elements
      2. Namespaced input: set @xpath-default-namespace on the xsl:stylesheet, e.g., xpath-default-namespace=""
      3. Namespaced output: set xmlns=""
    4. More housekeeping: <xsl:output> attributes: @method, @indent, @omit-xml-declaration, @doctype-system
      1. <xsl:output method="xml" indent="no"/>
      2. <xsl:output method="xml" indent="yes" doctype-system="about:legacy-compat"/>
    5. A declarative programming language. Not written to be executed in a line-by-line order. Templates can be written in any order.
    6. Templates match patterns: <xsl:template match="???">: The value of @match is an XPath pattern, which is not the same as an XPath path expression.
      1. XPath patterns don’t find elements; they match them. <xsl:template match="p"> will match (and process) a <p> element anywhere in the document.
      2. For that reason, it is a mistake to write <xsl:template match="//p">. (It will work, but don’t do it; it isn’t idiomatic.)
    7. In each example below, look at the @match value: What should the XPath pattern be matching in the source XML document? How is this XPath pattern different from the way we write an XPath expression in the XPath Toolbar?
      1. <xsl:template match="div/head>" Matches any <head> child of a <div> at any level of the XML hierarchy. In the XPath Toolbar, we have to start the expression with two leading forward slashes (//) to indicate we are looking down the tree from the document node.
      2. <xsl:template match="div[count(descendant::p) gt 1]>" Matches any <div> element that contains more than one <p> descendant. In the XPath toolbar, we must add // to the beginning.
    8. Inside template rules, <xsl:apply-templates> specifies what to process, and where.
    9. <xsl:apply-templates> may or may not have a @select attribute.
      1. Without a @select attribute, all child nodes of the current context (typically element nodes and text nodes) are processed.
      2. With a @select attribute, only the nodes specified by the XPath expressions are processed. Those nodes do not have to be children of the current context.
      3. The XPath expression that is the value of the @select attribute may be absolute (e.g., <xsl:apply-templates select="//p"/>; <xsl:apply-templates select="/TEI/teiHeader/titleStmt/title"/>) or relative to the current context (e.g., <xsl:apply-templates select="p"/>).
  3. Group walk-through activity in <oXygen/> (20 minutes; 10:00 a.m.–10:20 a.m.)  | 
    1. Ozymandias transformation
      1. XML:
      2. XSLT:
      3. Output HTML: XSLT developed in class:
  4. Break (10 minutes; 10:20 a.m.–10:30 a.m.)
  5. From identity transformation to revision (50 minutes; 10:30 a.m.–11:20 a.m.)  | 
    1. Change the structure and add line numbers to the Ozymandias XML file
      1. Open our simple identity transformation starter:
      2. Change the <line> elements to self-closing <lb> elements.
      3. Work with attribute value templates to add numbers to the new <lb> elements.
    2. Combining a collection of files into a single XML file
      1. See
    3. Splitting a single XML file into multiple output files
      1. Input XML:, a TEI document with six <div> children of <body>
      2. XSLT to output each <div> as a separate TEI document:
    4. Exercise: Repair our Pacific Voyage file:
      1. Open this file URL in <oXygen/>:
      2. Develop this XSLT file following this exercise:
  6. Comparing XSLT and XQuery (15 minutes; 11:20 a.m.–11:35 a.m.)  | 
    1. Invoking namespaces
    2. Sequential processing
    3. Pull vs. push processing
  7. Preparing XSLT to output HTML from TEI XML (25 minutes; 11:35 a.m.–12:00 p.m.)  | 
    1. Open the url of the Emily Dickinson Fascicle 16 file:, and study the document.
    2. The output we want:
    3. Open this starter XSLT in <oXygen>: and save it on your computer to work with it.
    4. <xsl:stylesheet> and <xsl:output>
    5. Template matching on the document node to output HTML
    6. Structure of an HTML document: <head> and <body>

XSLT Activity (1:30 p.m.–4:00 p.m.)

  1. Extracting information from multiple documents with XSLT (15 minutes; 1:30 p.m.–1:45 p.m.)  | 
    1. Task: Output plain text lists of subjects and historical/biographical notes, as well as two-column TSV of subjects (with sources)
    2. Input: corpus of 155 EAD documents. Sample:
    3. XSLT:
    4. Subject list (, subject TSV (, and historical/biographical notes (
  2. TEI XML to HTML transformation (55 minutes; 1:45 p.m.–2:40 p.m.)  | 
    1. Continue to work with the Emily Dickinson Fascicle 16 document ( to convert it to HTML, working with the XSLT file we started before lunch.
    2. Push processing: <xsl:apply-templates>
    3. Pruning the tree: when to use the @select attribute
    4. When to use <xsl:value-of>
  3. Break (10 minutes; 2:40 p.m.–2:50 p.m.)
  4. XSLT activity: Making a linked table of contents (70 minutes; 2:50 p.m.–4:00 p.m.)  | 
    1. Continue working with the XSLT we are writing on the Emily Dickinson file.
    2. Modal XSLT: Processing the same nodes in multipe ways
      1. The output we want:
      2. Build a table of contents using the @mode attribute to permit XML nodes to be processed in multiple ways.
    3. How internal links work

Thursday, June 13: XPath and Schematron

Using Schematron to constrain your markup (9:00 a.m.–12:00 p.m.)

  1. XSLT odds and ends (30 minutes; 9:00 a.m.–9:30 a.m.)
  2. Schematron overview (10 minutes; 9:30 a.m.–9:40 a.m.)  | 
    1. Schematron is constraint based; Relax NG, XML Schema, DTD are grammar based
    2. Sample constraint-based tasks involve multiple elements
      1. Are start pages (<start>) no larger than end pages (<end>)?
      2. Are birth dates no later than death dates?
      3. Does a list (e.g., of students in a course) contain duplicates?
      4. Do pointers to persons really point to persons (and not places)?
    3. Schematron structure: <pattern><rule><assert> or <report>
  3. Looking at Schematron (20 minutes; 9:40 a.m.–10:00 a.m.)  | 
    1. Document analysis of a mock bibliography:
      1. <start> shouldn’t be greater than <end>
      2. <issue> is optional, but we could omit it by mistake
      3. <initial> should usually be one letter
      4. Apostrophes and quotation marks should usually be curly (“, ”, ‘, ’), not straight (', ")
    2. What Relax NG can constrain:
      1. <volume>, <issue>, <year>, <start>, and <end> must be positive integers
      2. <year> must be exactly four digits
      3. <issue> is optional
      4. No empty elements
    3. Schematron to the rescue:
      1. Anatomy of a schematron rule
      2. Validating start and end pages
      3. Validating apostrophes and quotation marks (text, not markup)
    4. Associating Schematron with XML
  4. Schematron error reporting (15 minutes; 10:00 a.m.–10:15 a.m.)  | 
    1. Schematron has the best error messages
    2. Enhance Schematron reporting with <sch:value-of>:
    3. Enhance Schematron maintenance with <sch:let>:
    4. Generate warnings as well as errors with @role:
  5. XPath functions practice: Leipzig glossing rules, part 1 (20 minutes; 10:15 a.m.–10:35 a.m.)  | 
    1. Document analysis:
    2. Target output:
    3. Validation challenge: the spaces and hyphens need to be aligned
    4. Best practice
      1. Test the XPath separately first
      2. Develop and test incrementally
    5. Schematron validation
      1. Housekeeping: create the Schematron skeleton in <oXygen/>, save it, link it to XML
      2. Two ways of counting spaces and hyphens
        1. translate() string-length('one two three') - string-length(translate('one two three', ' ', ''))
        2. tokenize() count(tokenize('one two three', ' ')) or tokenize(('Curly Larry Moe') ,'\s+') => count()
  6. Break (10 minutes; 10:35 a.m.–10:45 a.m.)
  7. XPath functions practice: Leipzig glossing rules, part 2 (50 minutes; 10:45 a.m.–11:35 a.m.)  | 
    1. Comparing three things
      1. Three-way test not available in XPath
        1. $a eq $b eq $c
        2. $a lt $b lt $c
      2. What is available
        1. Composite expression: $a eq $b and $b eq $c
        2. Compare to average value: ($a, $b, $c) != avg(($a, $b, $c))
        3. Count distinct values
          1. count(distinct-values(($a, $b, $c))) eq 1
          2. distinct-values(($a, $b, $c)) => count() eq 1
    2. Whitespace normalization
      1. Require it in the XML with Relax NG xsd:string { pattern = "(\S+ )*\S+" }
      2. Require it in the XML with Schematron test='. eq normalize-space(.)'
      3. Manage it with Schematron inside tier-comparison test normalize-space(.) instead of just .
    3. Solutions
      1. Simple
      2. Enhanced
      3. Finding which word has a hyphen misalignment
  8. The Three Stooges go to Schematron Summer Camp (25 minutes; 11:35 a.m.–12:00 p.m.)  | 
    1. The Edge Case Saloon
      1. “QA Engineer walks into a bar. Orders a beer. Orders 0 beers. Orders 999999999 beers. Orders a lizard. Orders -1 beers. Orders a sfdeljknesv.”
      2. More edge cases at
    2. Best Stooge Ever contest results:
    3. Hands on validation tasks
      1. All stooges must have percentages (no empty <stooge> elements)
      2. Percentages total 100
      3. Individual votes range from 0 through 100, inclusive
      4. There are exactly three stooges!
      5. No duplicate stooges!
    4. Solution (no peeking!)

Schematron and external files (1:30 p.m.–4:00 p.m.)

  1. One more way of counting spaces and hyphens (15 minutes; 1:30 p.m.–1:45 p.m.)  | 
    1. Explode the string
      1. string-to-codepoints(), codepoints-to-string()
      2. for $c in string-to-codepoints('one two three') return codepoints-to-string($c)
      3. string-to-codepoints('one two three') ! codepoints-to-string(.)
    2. Find the index values of the spaces
      1. index-of()
      2. index-of(('a', 'b', 'c', 'b', 'a'), 'a')
    3. Count them
      1. count(index-of(for $c in string-to-codepoints('one two three') return codepoints-to-string($c), ' '))
    4. Make it legible
      1. string-to-codepoints('one two three') ! codepoints-to-string(.) => index-of(' ') => count()
  2. ID/IDREF validation (25 minutes; 1:45 p.m.–2:10 p.m.)  | 
    1. Files
      1. Instance:
      2. Relax NG:
      3. Transformed:
    2. Details
      1. Datatypes xsd:ID, xsd:IDREF, xsd:IDREFS
      2. Value must be unique within the document
      3. Lexical space: NCName (begin with letter or underscore, may contain letters, digits, underscores, hyphens, periods) (simplified)
      4. @xml:id is not of type xsd:ID unless your schema says it is
      5. You don’t have to call it @xml:id, but you should
      6. Validates by exact string matching
    3. Limitations
      1. Validates only within the same file (but XInclude can help)
      2. No subcategory support (e.g., you can’t require person IDREF to match only person ID)
      3. Cannot require mixed content to be non-empty
    4. Desiderata
      1. Validation against external (remote) files
      2. Subcategory support
      3. Require (selected) mixed content to be non-empty
  3. General comparison and value comparison (10 minutes; 2:10 p.m.–2:20 p.m.)  | 
    1. Value comparison
      1. Operators: eq, ne, lt, gt, le, ge
      2. Compares one thing to one thing
      3. Example: count(distinct-values(('Curly', 'Larry', 'Moe'))) eq 1
    2. General comparison
      1. Operators: =, !=, <, >, <=, >= (angle brackets may have to be spelled &lt;, &gt;)
      2. Compares sequences of any length
      3. Example:
        1. 'Curly' = ('Curly', 'Larry', 'Moe')
        2. What does 'Curly' != ('Curly', 'Larry', 'Moe') return? What should we have written instead? not('Curly' = ('Curly', 'Larry', 'Moe'))
      4. substring(@ref, 2) = $ancillary//person/@xml:id
  4. Schematron validation (30 minutes; 2:20 p.m.–2:50 p.m.)  | 
    1. Instance:
    2. Relax NG:
    3. Schematron:
    4. External reference file:
  5. Break (10 minutes; 2:50 p.m.–3:00 p.m.)
  6. Exploring Digital Mitford (30 minutes; 3:00 p.m.–3:30 p.m.)  | 
    1. Project site:
    2. Site index
      1. Workshop repo on GitHub:
      2. Mitford project site:
      3. Outline:
  7. Hamilton 1823-04-09 letter (30 minutes; 3:30 p.m.–4:00 p.m.)  | 
    1. Letter
      1. XML:
      2. Read on line:
    2. Schematron starter:
    3. Tasks
      1. Save local copy of Schematron
      2. Associate letter with local copy
      3. Test validation of <editor> element
      4. Add and test rules for other element types

Friday, June 14: Taking stock

Putting it all to work (9:00 a.m.–12:00 p.m.)

  1. XPath in up-conversion: Syriaca taxonomy (60 minutes; 9:00 a.m.–10:00 a.m.)  | 
    1. Project context:
    2. Planning ahead
      1. Document analysis
        1. Google spreadsheet:
          1. Title for display (A)
          2. Title for filename (E)
          3. Terms (multiple languages) (O, P, Q, S, U, V, X, Z)
          4. Glosses (multiple languages) (I, J, K, L, M, N, R, T, W, Y, AA)
          5. Relations (AC, AD, AE, AF, AG, AH, AI, AJ, AK, AL, AM, AN, AO, AP, AQ, AR, AS, AT, AU, AV, AW, AX, AY, AZ, BA, BB, BC, BD, BE, BF, BG, BH, BI, BJ, BK, BL, BM, BN, BO, BP, BQ)
          6. Identifiers (idno) (F, G, H)
          7. Note (AB)
        2. TSV export:
      2. Output specification
    3. Pull vs push processing
    4. Autotagging plain text with XSLT (upconversion)
      1. XSLT:
      2. Features
        1. Documentation (comments; 7)
        2. Variables (<xsl:variable>; 26)
        3. Typing with @as (variables, parameters, functions; 26)
        4. User-defined functions (<xsl:function>; 11)
        5. Apply XSLT to non-XML input (unparsed-text-lines(); 26)
        6. Omit input document specification (<xsl:template name="xsl:initial-template">; 61)
        7. tokenize() (44)
        8. index-of() (48) and the user-defined skos:index-of-starts-with() (11)
        9. <xsl:message> (65)
        10. <xsl:processing-instruction> (75)
        11. Test whether a value exists with <xsl:if> before creating output
          1. <xsl:if test="string-length(normalize-space($values[current()])) ne 0"> (228)
          2. see esp. <listRelation> (251)
        12. <xsl:attribute> (275)
    5. Using the <oXygen/> Outline view with XML and XSLT
    6. Running saxon ( from the command line saxon -it Taxonomy.xsl
  2. Resources and references (25 minutes; 10:00 a.m.–10:25 a.m.)  | 
  3. Break (10 minutes; 10:25 a.m.–10:35 a.m.)
  4. Building our syllabus (60 minutes; 10:35 a.m.–11:35 a.m.)  | 
    1. Author in XML, validating with Relax NG and Schematron
      1. XML:
      2. RelaxNG:
      3. Schematron:
    2. Transform to fragment (<section>) with XSLT for inclusion in GitHub pages
      1. XSLT:
      2. Features
        1. @omit-xml-declaration (6)
        2. User-defined function (7)
        3. @id on <button> elements (14)
        4. Construct <h2> (26)
        5. Calculate time range (38)
        6. Datatype wrangling (48)
        7. @class on <button> elements (62)
        8. <xsl:call-template> (75)
        9. Matching multiple element types in a single template (88)
        10. Dynamic element construction with <xsl:element> (125)
    3. XInclude fragment in wrapper and transform to full local schedule with XSLT
      1. XML wrapper:
      2. XSLT:
    4. Use XProc to manage pipeline
      1. XProc:
  5. Retrospective (25 minutes; 11:35 a.m.–12:00 p.m.)