Schedule

All class sessions are in room B-3255 in the main University building at 3200 rue Jean-Brillant.

Monday, May 26: XPath

Introduction to XPath in eXist-db and <oXygen/> (11:00 a.m.–12:00 p.m.)

Getting started with XPath and eXide (20 minutes; 11:00 a.m.–11:20 a.m.) |
1. Open http://newtfire.org:8338/exist/apps/eXide/index.html (or eXide within eXist-db on your own laptop, if you’ve installed it), click “New XQuery”, and erase all content in the editing window. You’ll type your XPath in the editing window and run it with the “Eval” button.
2. Learning XPath (and other languages, including XSLT, XQuery, Schematron) means learning the …
  1. Vocabulary (e.g., the division operator in XPath is div^⬀, not /: 6 div 2 evaluates to 3, but 6 / 2 raises an error)
  2. Syntax (e.g., in XPath conditional expressions, the if^⬀ test must be parenthesized and an else is required: if (condition) then 1 else ())
  3. Function library (e.g., string-length()^⬀ and count()^⬀ are functions, but there is no length() or len() or size())
3. All XPath expressions return a sequence. Sequences may contain nodes (elements, attributes, etc.), atomic values (strings, numbers, etc.), or both. A sequence of one item is nonetheless a sequence, as is an empty sequence. Nested sequences are automatically flattened.
  1. Type a number and hit Eval. This is a one-item sequence that consists of a single atomic value. Try integers and decimal numbers. Try wrapping the number in parentheses.
  2. Type a string (inside single or double quotes) and hit Eval. This is a one-item sequence that consists of a single atomic value. Try omitting the quotation marks. Try using curly quotation marks. Try wrapping the string in parentheses.
  3. Type empty parentheses and hit Eval. This is an empty sequence.
  4. Type multiple items of different types (numbers, strings), separated by commas. Try wrapping them in parentheses. Try wrapping them in multiple parentheses. Try removing the commas. This is a multi-item sequence.
  5. Try to type a nested sequence, e.g., (1, 2, (3, 4)), and hit Eval. What result do you expect? What do you get?
Simple XPath expressions (40 minutes; 11:20 a.m.–12:00 p.m.) |
1. Review: strings and numbers (atomic values) are XPath expressions
  1. "Hi, Mom!" (Strings are enclosed in single or double quotation marks—straight, not curly)
  2. 1 (Numbers are not enclosed in quotation marks)
  3. 1.0 (What should this return? lexical space and value space)
2. Arithmetic expressions are XPath expressions
  1. 1 + 1
  2. Practice: +, -, *, div, idiv, mod (/ is not division)
3. XPath library functions (with no arguments) are XPath expressions
  1. current-date()^⬀
  2. current-time()^⬀
  3. current-dateTime()^⬀
4. XPath library functions (with arguments) are XPath expressions
  1. upper-case('dhsi')^⬀ (How many arguments, and of what type?)
  2. concat('Curly', 'Larry', 'Moe')^⬀ (How many arguments, and of what type?)
  3. count(('Curly', 'Larry', 'Moe'))^⬀ (Why two sets of parentheses? Hint: How many arguments, and of what type?)
  4. Function signature and cardinality: count($items as item()*) as xs:integer^⬀
5. Nested XPath library functions and operations are XPath expressions. Read them from the inside out
  1. max((1 + 2, 10 div 5, 6 * 0.2))^⬀ (Remember those two sets of parentheses?)
  2. translate(upper-case('Hi, Mom!'),'AEIOU','xxxxx')^⬀ (How is this different from upper-case(translate('Hi, Mom!','AEIOU','xxxxx'))?)
  3. format-dateTime(current-dateTime(),'[h].[m01] [Pn], [FNn], [D1o] [MNn]')^⬀
  4. format-dateTime(current-dateTime(),'[h].[m01] [Pn], [FNn], [D1o] [MNn]', 'fr', (), ())^⬀
6. Nested functions are hard to read. Use the arrow operator^⬀ (=>) instead
  1. upper-case('Hi, Mom!') => translate('AEIOU','xxxxx')
  2. current-dateTime() => format-dateTime('[h].[m01][Pn], [FNn], [D1o] [MNn]')
  3. current-dateTime() => format-dateTime('[h].[m01][Pn], [FNn], [D1o] [MNn]', 'fr', (), ())
7. Path expressions may span multiple lines (try it with the examples above), that is, new-line and space have the same meaning

Exploring document structures and data with XPath (1:30 p.m.–4:00 p.m.)

XPath in <oXygen/> (15 minutes; 1:30 p.m.–1:45 p.m.) |
1. Launch <oXygen/> Editor, hit Ctrl+u (Windows) or Cmd+u (MacOS), copy and paste the string http://newtfire.org:8338/exist/apps/shakespeare/data/ham.xml, and hit OK. (Backup copy at https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/ham.xml.) This is a copy of Hamlet with TEI markup.
2. Set the dropdown in the upper left to XPath 3.1. (This widget is called the XPath Toolbar^⬀.) Enter some XPath expressions (from above, such as 1 + 1). Limited to one line; hit Enter to run the expression. The XPath Toolbar works only if you have an XML document open in <oXygen/>, even if you aren’t using the document in your XPath expression.
3. Go to Window → Show View → XPath/XQuery Builder. Set the dropdown in the upper left to XPath 3.1. Enter some XPath expressions. May span multiple lines; Enter for a new line. To run, hit Ctrl+Enter (Windows) or Cmd+Enter (MacOS), or click the red right-pointed triangle.
XPath path expressions (15 minutes; 1:45 p.m.–2:00 p.m.) |
1. An XPath path expression is a sequence of steps, each of which proceeds from one node (called the context node) to a sequence of zero (!) or more others. It returns the results in document order (order of start tags). (Details at Kay 1227)
2. Sample XPath path expression: /TEI/text/body/div: start at the document node, then navigate to a sequence of all its <TEI> children. For each of those, navigate to all of their <text> children, then to their <body> children, and then to their <div> children.
3. XPath steps are separated by single slashes (/).
4. An XPath expression that begins with a slash (/) starts at the document node; this is an absolute path. Any other XPath expression starts at the current context; this is a relative path.
5. It is not an error to ask for something that doesn’t exist; it just returns an empty sequence.
6. With Hamlet open and selected, go to the XPath Toolbar or XPath Builder and try the following examples. Click on some of the results in the lower panel:
  1. /TEI/teiHeader/fileDesc/titleStmt/title (returns 1 <title> element)
  2. /TEI/text/body/div (returns 5 <div> elements)
  3. /TEI/teiHeader/fileDesc/titleStmt/info (returns no results; this is not an error)
  4. /TEI/teiHeader/fileDesc/title Stmt/title (raises an error; spaces are not allowed in path expressions)
XPath path steps (25 minutes; 2:00 p.m.–2:25 p.m.) |
1. Path steps move along axes: child::, parent::, descendant::, ancestor::, preceding-sibling::, following-sibling::, etc. See: http://dh.obdurodon.org/introduction-xpath.xhtml#xpath_axes.
2. Axes are specified with a double colon, e.g., descendant::div matches all <div> descendants of the current context node. There are two common shortcut notations
  1. The default is the child axis, so /TEI/teiHeader is synonymous with /child::TEI/child::teiHeader. Use the shorthand.
  2. // is shorthand for descendant-or-self::node()/^⬀, so /TEI//div finds all of the <div> elements that are descendants of the <TEI> root element, that is, anywhere in the document. The document node has a descendant axis, too: //div. Be careful with this one!
3. Each path step returns a sequence of zero or more context nodes for the next path step. Only the final path step is permitted to return something other than a node. Why?
4. The end of a path expression may return nodes or atomic values
  1. //body/div/count(descendant::sp) navigates from the document node to all of the acts in the play and then returns a count of the speeches in each act
  2. What’s wrong with //body/div/count(//sp)? The leading double slash resets the current context to the document node, and selects all <sp> elements in the entire document, instead of just the individual act.
5. * matches any element
  1. /TEI/teiHeader/* matches all child elements of the <teiHeader>
6. .. matches the parent node of the current context node. That is, it’s shorthand for parent::*
  1. //stage/.. matches the parent nodes of all <stage> elements
7. Your turn:
  1. Find the acts (<div> children of <body>) in Hamlet //body/div
  2. Find the stage directions (<stage>) in Hamlet //stage
  3. Find the <stage> children of <div> elements (but not other <stage> elements) in Hamlet //div/stage
  4. Find the parents of the stage directions in Hamlet //stage/.. or //stage/parent::*
  5. Find the <div> parents of the stage directions in Hamlet, but not other parents //stage/parent::div
XPath functions for strings (20 minutes; 2:25 p.m.–2:45 p.m.) |
1. concat()^⬀
  1. concat('Curly','Larry','Moe')
  2. concat('Curly is #', 1)
  3. Or use the concatenation operator^⬀: 'Curly is #' || 1
  4. What’s wrong with concat(//speaker)? The arguments to concat() must be two or more individual atomic (or atomizable) items, and //speaker is a sequence
2. string-join()^⬀
  1. string-join(( 'Curly', 'Larry', 'Moe'), ',')
  2. string-join(//speaker, ', ')
  3. string-join(//speaker) Why does this work when concat(//speaker) didn’t? The first argument to string-join() is a sequence. All arguments to concat() must be atomic or atomizable.
3. string-length()^⬀
  1. string-length('Curly, Larry, and Moe')
4. lower-case()^⬀, upper-case()^⬀
  1. lower-case('Curly, Larry, and Moe')
5. normalize-space()^⬀
  1. normalize-space(' Curly, Larry, Moe ')
6. substring-before()^⬀, substring-after()^⬀
  1. substring-before('Larry', 'r') What if there’s more than one?
  2. substring-after('Larry', 'r') What if there’s more than one?
7. substring()^⬀
  1. substring('Curly', 1, 2) XPath starts counting with 1 (not 0).
8. contains()^⬀ Foreshadowing: This returns a Boolean (True or False) value. How might this be useful?
  1. contains('Ophelia', 'ph')
  2. //speaker/contains(., 'ph') (the dot refers to the current context item)
  3. See also contains-token()^⬀, which would match "Rosencrantz" but not "Rosencrantzenfeld".
9. starts-with()^⬀, ends-with()^⬀
  1. starts-with('Ophelia', 'Op')
Break (15 minutes; 2:45 p.m.–3:00 p.m.)
XPath functions for numbers and for sequences of numbers (15 minutes; 3:00 p.m.–3:15 p.m.) |
1. ceiling()^⬀, floor()^⬀
  1. ceiling(3.141592653)
2. round()^⬀
  1. round(3.141592653, 4)
3. format-integer()^⬀, format-number()^⬀
  1. format-integer(154,'w')
  2. format-integer(154,'w','fr') (You could also try 'ar', 'de', 'bg', or see https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)
  3. format-integer(154,'I')
  4. format-number(1, '#.000')
4. max()^⬀, min()^⬀, sum()^⬀, avg()^⬀
  1. max((1, 2, 3)), etc.
  2. What happens when these are applied to strings? To a sequence that mixes strings and numbers?
5. Find the length in character count of each <speaker> //speaker/string-length() (why doesn’t string-length(//speaker) work?)
6. Find the length of the longest speaker name max(//speaker/string-length())
XPath functions for sequences (15 minutes; 3:15 p.m.–3:30 p.m.) |
1. distinct-values()^⬀
  1. distinct-values(/TEI//speaker)
2. count()^⬀
  1. count(('Curly', 'Larry', 'Moe', 'Curly'))
  2. count(distinct-values(('Curly', 'Larry', 'Moe', 'Curly')))
  3. distinct-values(count(('Curly', 'Larry', 'Moe', 'Curly')))
3. sort()^⬀
  1. sort(//speaker)
  2. sort(//speaker,(), function($item){string-length($item)})
4. Your turn:
  1. How many <speaker> elements are there in Hamlet? count(//speaker)
  2. How many distinct <speaker> elements are there in Hamlet? count(distinct-values(//speaker))
  3. How many acts are there in Hamlet? count(//body/div)
  4. How many scenes are there in Hamlet? count(//div/div)
  5. What does count(//div) tell you about Hamlet, and why is it unhelpful? It counts <div> elements of different types together: acts, scenes, cast list.
Looking Stuff Up: XPath function signatures and cardinality (10 minutes; 3:30 p.m.–3:40 p.m.) |
1. The function signature consists of 1) the name of the function, 2) the number and type of arguments it accepts or requires, and 3) the number and type of items it returns
2. Type error: string-length(1.2345)
3. Cardinality error: string-length(/TEI//speaker)
4. Why is count(/TEI//speaker) okay, while count('Curly', 'Larry', 'Moe') is broken? The count() function is receiving three arguments, but it is designed to receive only one argument. To give it one argument, it needs a set of inner parentheses: count(('Curly', 'Larry', 'Moe'))
5. The error message is your friend. Read it.
6. Resources and references: https://ebeshero.github.io/UpTransformation/References.html
XPath predicates (20 minutes; 3:40 p.m.–4:00 p.m.) |
1. Predicates, in square brackets after a path step, filter the results
2. Numerical predicates
  1. //body/div[3] matches the third <div> child of each <body> element (same as //body/div[position() eq 3]
  2. //body/div[last()] matches the last <div> child of each <body> element
3. Predicates with node tests
  1. //stage[parent::div] is equivalent to //div/stage
4. Predicates with functions and operators
  1. //sp[speaker eq 'Ophelia']
  2. //sp[contains-token(speaker, 'Rosencrantz')]
  3. //lg[@type eq 'couplet']

Tuesday, May 27: XPath and XQuery

From XPath to XQuery (9:00 a.m.–12:00 p.m.)

Working with sequences (15 minutes; 9:00 a.m.–9:15 a.m.) |
1. Three ways to apply a function to a sequence
  1. Explicit for
    1. for $speaker in /TEI//speaker return string-length($speaker)
  2. Implicit for
    1. /TEI//speaker/string-length()
  3. Simple map (!^⬀)
    1. /TEI//speaker ! string-length(.)
2. Difference between simple map (!^⬀) and arrow (=>^⬀)
  1. ('Curly', 'Larry', 'Moe') => count()
  2. ('Curly', 'Larry', 'Moe') ! count(.)
Read and evaluate XML projects with XPath (30 minutes; 9:15 a.m.–9:45 a.m.) |
1. Let’s open Hamlet again in <oXygen/>. Launch <oXygen/> Editor, hit Ctrl+u (Windows) or Cmd+u (MacOS), copy and paste the string http://newtfire.org:8338/exist/apps/shakespeare/data/ham.xml, and hit OK. (Backup copy at https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/ham.xml.) This is a copy of Hamlet with TEI markup.
2. How many speeches (<sp>) does Ophelia have? count(//sp[speaker eq 'Ophelia'])
3. How many speeches does Ophelia have in Act 2? count(//body/div[2]//sp[speaker eq 'Ophelia'])
4. What types of elements can have stage directions (<stage>) as children? (Hint: use the name()^⬀ function.) distinct-values(//stage/../name())
5. How many speeches don’t contain any metrical line child elements (<l>)? (Hint: use the not()^⬀ function.) count(//sp[not(l)])
6. Building on your answer to the last question, who are the speakers of those speeches? distinct-values(//sp[not(l)]/speaker)
7. Building on your answers to the last two questions, what kinds of elements do they contain instead? distinct-values(//sp[not(l)]/*/name())
8. What is Hamlet’s first spoken line (<l>)? (//sp[speaker eq 'Hamlet']/l)[1]
9. What is the last stage direction in the entire document? (//stage)[last()]
10. How many speeches have more than 8 line children? count(//sp[count(l) gt 8])
11. Building on your answer to the preceding question, how many line children does each of those speeches have? //sp[count(l) gt 8]/count(l)
12. Building on your answers to the preceding two questions, who are the speakers of speeches that have more than 8 line children? distinct-values(//sp[count(l) gt 8]/speaker)
13. How long is the longest speech? max(//sp/string-length()) (or, better: max(//sp/string-length(normalize-space())))
14. Building on your answer to the last question, who is the speaker of the longest speech? //sp[string-length() eq max(//sp/string-length())]/speaker (or, better: //sp[string-length(normalize-space()) eq max(//sp/string-length(normalize-space()))]/speaker) Try writing it this way: let $max := //sp ! normalize-space() ! string-length() => max() return sp[normalize-space() ! string-length() = $max]
Housekeeping: documents, collections, and namespaces (10 minutes; 9:45 a.m.–9:55 a.m.) |
1. Open our web server installation of eXist-dB at http://exist.newtfire.org/exist/apps/eXide/index.html. In the eXide window, click on the New XQuery tab. This brings up a window with xquery version "3.1"; at the top.
2. Access a document with doc()
  1. doc('/db/apps/shakespeare/data/ham.xml')
3. Access a collection of documents with collection()
  1. collection('/db/apps/shakespeare/data/')
4. Namespace declaration
  1. declare namespace tei="http://www.tei-c.org/ns/1.0";
  2. <stage> elements in Hamlet: doc('/db/apps/shakespeare/data/ham.xml')//tei:stage
  3. Find all the stage directions in the entire Shakespeare collection collection('/db/apps/shakespeare/data/')//tei:stage
The seven types of nodes (20 minutes; 9:55 a.m.–10:15 a.m.) |
1. Document (document-node())
2. Element (element())
3. Attribute (attribute())
  1. collection('/db/apps/shakespeare/data/')//tei:sp/@who
  2. collection('/db/apps/shakespeare/data/')//tei:sp/@who/string()^⬀
4. Text (text(); not a function; not to be confused with string())
  1. doc('/db/mitford/literary/Charles1.xml')//tei:stage (Mary Russell Mitford’s Charles the First)
  2. What does doc('/db/mitford/literary/Charles1.xml')//tei:stage/string() return? The string values of the stage directions, that is, the stage directions with all markup stripped
  3. What does doc('/db/mitford/literary/Charles1.xml')//tei:stage/text() return? The text() nodes in each stage direction
5. Rarely used: comment (comment()), processing instruction (processing-instruction())
6. Deprecated: namespace (namespace-uri())
Scavenger hunt 1 (20 minutes; 10:15 a.m.–10:35 a.m.) |
1. Work with the Digital Mitford Site Index posted in eXist at /db/mitford/si.xml or the official version at its external location: https://digitalmitford.org/si.xml Can you find out the following?
  1. Look at the <div> elements in the site index. What attribute on this element can tell you how the document is organized? Write an XPath that isolates these attribute values. doc('https://digitalmitford.org/si.xml')//tei:div/@type/string()
  2. Look at the element children of the <div> elements (you can do this without knowing what all the elements are). What do you think is the purpose of the @sortKey attributes? What XPath expression would show you those values? doc('https://digitalmitford.org/si.xml')//tei:div/*/@sortKey ! string()
Break (15 minutes; 10:35 a.m.–10:50 a.m.)
Wildcard node testing (15 minutes; 10:50 a.m.–11:05 a.m.) |
1. Work with the Digital Mitford Site Index posted in eXist at /db/mitford/si.xml or the official version at its external location: https://digitalmitford.org/si.xml Can you find out the following?
  1. The @xml:id for the play Charles the First in the site index is "CharlesI_MRMplay". References to the play throughout the site index will be made with various attributes that begin with a hashmark #, formatted like this: "#CharlesI_MRMplay". Knowing this, can you locate all the individual entries in any of the site index lists that contain references of any kind to the play? doc('https://digitalmitford.org/si.xml')//tei:div/*/*[descendant::*/@*="#CharlesI_MRMplay"] How can you find out how many these are using a function? doc('https://digitalmitford.org/si.xml')//tei:div/*/*[descendant::*/@*="#CharlesI_MRMplay"] => count()
Regex in XPath (30 minutes; 11:05 a.m.–11:35 a.m.) |
1. contains() vs. matches()
  1. doc('/db/mitford/literary/Charles1.xml')//tei:l[contains(., 'murder')]
  2. doc('/db/mitford/literary/Charles1.xml')//tei:l[contains(., 'unrighteousness')]
  3. doc('/db/mitford/literary/Charles1.xml')//tei:l[matches(., '[a-z]{15,}','i')]
  4. doc('/db/mitford/literary/Charles1.xml')//tei:*/text()[matches(., '\d{4}')]
  5. doc('/db/mitford/literary/Charles1.xml')//tei:*/text()[matches(., '(^|\D)\d{4}($|\D)')] Why is the number of results smaller than for the previous expression?
    xquery version "3.1"; declare namespace tei="http://www.tei-c.org/ns/1.0"; let $a := doc('/db/mitford/literary/Charles1.xml')//tei:*/text()[matches(., '(^|\D)\d{4}($|\D)')] let $b := doc('/db/mitford/literary/Charles1.xml')//tei:*/text()[matches(., '\d{4}')] return $b except $a (: returns items in $b that are not in $a:)
2. translate() vs. replace()
  1. Try this expression, doc('/db/mitford/literary/Charles1.xml')//tei:castList, and notice the pseudo-markup in the cast list. translate() to the rescue! doc('/db/mitford/literary/Charles1.xml')//tei:castList//tei:roleDesc/translate(., '()', '')
  2. The next examples work with the @xml:id attributes on the <l> elements. How can you get a look at the @xml:id attributes first? doc('/db/mitford/literary/Charles1.xml')//tei:l/@xml:id/string()
  3. Change the format of the @xml:id attributes on the <l> elements with replace(): doc('/db/mitford/literary/Charles1.xml')//tei:l/replace(@xml:id, 'Chas(_\w+_)', 'C1$1')
3. substring-before() and substring-after() vs. tokenize()
  1. Return only the document location (e.g., "ded", "pro", "act") and line number information in the @xml:id attributes: doc('/db/mitford/literary/Charles1.xml')//tei:l/substring-after(@xml:id, 'Chas_')
  2. Working with the expression we just wrote, how would you apply substring-before() to return only the document location ("ded", "pro", "act"), and trim off the line number information? Two ways: old-fashioned: doc('/db/mitford/literary/Charles1.xml')//tei:l/substring-before(substring-after(@xml:id, 'Chas_'), '_') and more legible with simple map operator: doc('/db/mitford/literary/Charles1.xml')//tei:l/substring-after(@xml:id, 'Chas_') ! substring-before(.,'_') Why can’t we use the arrow operator (=>) here?
Introducing variables (25 minutes; 11:35 a.m.–12:00 p.m.) |
1. Global variables and syntax, how to return their values in eXist-db
2. In eXist-db, keep the TEI namespace declaration line, and copy the following global variables:
  1. declare variable $Chas as document-node() := doc('/db/mitford/literary/Charles1.xml');
  2. declare variable $ChasPlay as element() := $Chas/*;
  3. Return the value of the variables one by one, by typing each of their names on the next line: $Chas and $ChasPlay. Notice the difference in the data type declaration and in the results. Other common values of as include xs:string or xs:integer.

XQuery flow control (1:30 p.m.–4:00 p.m.)

Introducing FLWOR (10 minutes; 1:30 p.m.–1:40 p.m.) |
1. FLWOR keywords: for, let, where, order by, return
2. The simplest FLWOR: let (or for) followed by return
Introducing FLWOR, continued. (20 minutes; 1:40 p.m.–2:00 p.m.) |
1. Retrieve a sequence of whole elements:
  1. let $places := $Chas//tei:placeName
  2. return $places
  3. How would you return only their text contents? $places/string(), but notice the white space issues. Repair these with return $places/normalize-space()
Scavenger hunt 2: in XQuery this time. (30 minutes; 2:00 p.m.–2:30 p.m.) |
1. Work in eXist-db in the same file we started before the break and delete only the return line. Let’s keep adding to it. Use variables and FLWOR statements to define and retrieve the following:
  1. Define a global variable pointing to the Digital Mitford site index document, posted in eXist at /db/mitford/si.xml or the official version at its external location:https://digitalmitford.org/si.xml. Hint: Declarations need to come first. declare variable $si as document-node() := doc('/db/mitford/si.xml'); or declare variable $si as document-node() := doc('https://digitalmitford.org/si.xml'); The new global variable must be added before the first let statement.
  2. Write a variable (either global or in let form) that locates all of the <place> elements in the site index document. Use the $si variable you just defined for the site index document in your expression. let $siPlaces as element()+ := $si//tei:place or as a global variable above the first let statement and after the variable defining the si.xml document: declare variable $siPlaces as element()+ := $si//tei:place;.
  3. For housekeeping purposes, rename the variable $places (that we defined earlier to retrieve $Chas//tei:placeName): Call it $Chasplaces.
  4. Define a new variable to retrieve the values of the @ref on those $Chasplaces Don’t forget the string() to return the attribute value: let $ChasPlaceRefs := $Chasplaces/@ref/string()
  5. How would you rewrite the last XPath scavenger hunt solution as a let statement in this XQuery? (Find references to "CharlesI_MRMplay" in the site index): let $siChasRefs := $si//tei:div/*/*[descendant::*/@*="#CharlesI_MRMplay"]
Break (15 minutes; 2:30 p.m.–2:45 p.m.)
XPath for expressions; sequence and range variables (<oXygen/>) (20 minutes; 2:45 p.m.–3:05 p.m.) |
1. In the <oXygen/> XPath Builder View, try this code: for $i in ("Curly", "Larry", "Moe") return concat($i, " is a Stooge!")
2. Can we write it as a simple map (with !)? ("Curly", "Larry", "Moe") ! concat(., " is a Stooge!")
3. Open the Digital Mitford site index URL in <oXygen/> using https://digitalmitford.org/si.xml . Try finding out the following in the <oXygen/> XPath Builder:
  1. Find each person we have listed as born in Scotland in the site index. Notice that sometimes place names are stored inside the <birth> elements. for $i in //person[contains(birth, "Scotland")] return $i. You should return 28 <person> entries.
  2. Now, modify that example to return the @xml:id, (or anything else you want to find out about the person elements): For the @xml:id: for $i in //person[contains(birth, "Scotland")] return $i/@xml:id. Notice that we don’t need the string() function after the @xml:id in the <oXygen/> XPath builder view because the <oXygen/> viewer exposes the attribute values and eXide does not.
FLWOR statements in XQuery: how for works: Part 1 (30 minutes; 3:05 p.m.–3:35 p.m.) |
1. for in XQuery and iterative returns: for $i in $YourSequenceVariable. Look up the places coded in Charles the First for their entries in the Digital Mitford site index.
  1. Get the unique (distinct) values of @ref attributes on placeName elements. let $distChPRs := distinct-values($ChasPlaceRefs)
  2. Next, loop through each of these distinct values: for $i in $distChPRs
  3. How will we find the site index entry that matches up with each member of our sequence of place references in Charles the First? Each site index entry holds an @xml:id, and each placeName element has a @ref attribute whose value is formatted with a leading # followed by the @xml:id value.
  4. Write the variable that finds the site index entry whose @xml:id matches the value of the range variable in our for expression. let $siCPrs := $si//tei:place[@xml:id = substring-after($i, '#')]
FLWOR statements in XQuery: how for works: Part 2 (25 minutes; 3:35 p.m.–4:00 p.m.) |
1. Sort your sequence: two ways:
  1. Apply the XPath sort() function to the variable that defines the sequence (above the for loop): let $distChPRs := $ChasPlaceRefs => distinct-values() => sort()
  2. Or, within the FLWOR with the XQuery order by statement: order by $siCPrs/@xml:id followed by nothing (default: ascending alphabetical order), or a keyword: ascending or descending. To order in reverse alphabetical order by the @xml:id in the site index entry? order by $siCPrs/@xml:id descending
  3. Do either of these methods really deliver alphabetical order? In no human understanding of alphabetical order does Zebra come before aardvark. This sorting reflects Unicode order.
2. Number the results with $pos
  1. Set the $pos variable in the for statement: for $i at $pos in $YourSequenceVariable, but caution: order by happens after $pos is set. So if we want sorted, numbered output? Use the sort() function on the sequence. Try a return like return concat($pos, ': ', $siCPrs/@xml:id)
3. Add where in a FLWOR expression to filter the returns
  1. Notice the blank results: A number of entries are not yet in the site index. We can filter by selecting only those where the variable $siCPrs exists: where $siCPrs
  2. Or use where to return only results in the site index whose string value contains "France": where $siCPrs[contains(., 'France')]
4. Which is more efficient in XQuery: a predicate or where?
5. Text returns: combining strings into one result: concat() and string-join()
  1. Retrieving a full canonical place name: let $name := $siCPrs/tei:placeName[1]

Wednesday, May 28: XPath and XSLT

XQuery to HTML (9:00 a.m.–9:55 a.m.)

Putting it all together, Part I: Writing FLWORs to make new files (HTML output) (35 minutes; 9:00 a.m.–9:35 a.m.) |
1. HTML returns: how to use curly braces to layer and activate XQuery in an HTML file.
  1. HTML table output: https://ebeshero.github.io/UpTransformation/Chas1_FrenchPlaces.html
  2. XQuery to make the HTML, as created in the June 2024 edition of this course: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/xquery/DHSI-Queries/Chas1-SI-HTMLTable-2024.xquery, or earlier version: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/xquery/DHSI-Queries/Chas-SI-HTMLTable.xql
Putting it all together, Part II: Writing FLWORs to make new files (SVG output) (20 minutes; 9:35 a.m.–9:55 a.m.) |
1. SVG returns: a bar graph from XQuery
  1. SVG bar graph output: http://newtfire.org:8338/exist/rest/db/DHSI-Queries/Chas-PersNameGraph-SVG.xql (may require permission) or https://ebeshero.github.io/UpTransformation/Chas-PersNameGraph.svg
  2. XQuery to make the SVG, in the newtfire eXist-db: /db/DHSI-Queries/Chas-PersNameGraph-SVG.xql, or on GitHub: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/xquery/DHSI-Queries/Chas-PersNameGraph-SVG.xql

Introduction to XPath in XSLT (9:55 a.m.–12:00 p.m.)

Preparation for writing XSLT in <oXygen> (20 minutes; 9:55 a.m.–10:15 a.m.) |
1. Settings: XSLT debugger and Saxon parser
2. Selecting files to run and save
  1. Open <oXygen/> and open the following url: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/ozymandias.xml
  2. Open this starter XSLT file, too: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/xslt/xsltStarter1.xsl. Save this file locally on your computer.
XSLT overview in <oXygen/> (10 minutes; 10:15 a.m.–10:25 a.m.) |
1. XSLT (eXtensible Stylesheet Language Transformations) is a programming language expressed as an XML document, where programming instructions are represented by elements in the XSLT namespace
2. XSLT is a declarative programming language. Not written to be executed in a line-by-line order. Template elements (or template rules) do the work, but can be written in any order.
3. Basic structure: <xsl:stylesheet> is the root element, with <xsl:template> children that do the processing.
Break (15 minutes; 10:25 a.m.–10:40 a.m.)
Housekeeping: up to three namespaces (10 minutes; 10:40 a.m.–10:50 a.m.) |
1. Namespace for XSLT elements: xsl:: distinguishes the XSLT elements
2. Namespace for input: if the input is in a namespace, set the @xpath-default-namespace attribute on the <xsl:stylesheet>. For example: <xsl:stylesheet xpath-default-namespace="http://www.tei-c.org/ns/1.0"> says that input will be in the TEI namespace unless specified otherwise.
3. Namespace for output: set the default namespace using the @xmlns attribute. For example, <xsl:stylesheet xpath-default-namespace="http://www.tei-c.org/ns/1.0" xmlns="http://www.w3.org/1999/xhtml"> means that input is in the TEI namespace and output will be HTML (that is, in the HTML namespace)
Housekeeping: <xsl:output> (5 minutes; 10:50 a.m.–10:55 a.m.) |
1. Configure the @method, @html-version, @omit-xml-declaration, @include-content-type, and @indent attributes on the <xsl:output> element:
  1. <xsl:output method="xhtml" html-version="5" omit-xml-declaration="no" include-content-type="no" indent="yes"/>
XSLT and templates, part 1 (20 minutes; 10:55 a.m.–11:15 a.m.) |
1. Templates match patterns: <xsl:template match="???">: The @match attribute is an XPath pattern that specifies what the template processes. XPath patterns are not the same as XPath expressions because they don’t navigate or find; they just match. For example <xsl:template match="p"> will match and process all <p> elements. It is a mistake to write <xsl:template match="//p"> because @match values don’t have to find <p> elements; they just have to … well … match them.
2. In each example below, look at the @match value: What should the XPath pattern be matching in the source XML document? And how is this XPath different from the way we write XPath expressions (which have to find, and not just match, elements) in the XPath Toolbar?
  1. <xsl:template match="div/head>" Matches any <head> child of any <div> at any level of the XML hierarchy. In the XPath Toolbar, we have to start the expression with two leading forward slashes (//div/head) to indicate we are looking down the tree from the document node.
  2. <xsl:template match="div[count(descendant::p) gt 1]>" Matches any <div> element that contains more than one <p> descendants. In the XPath toolbar, we must add // to the beginning.
3. Inside a template rule, an <xsl:apply-templates/> elements specifies what to process at that location.
4. An <xsl:apply-templates/> element with no @select attribute means process all my child nodes here. What if you want to process only some children, or some non-children?
5. <xsl:apply-templates/> with a @select attribute specifies what to process. The value of @select is an XPath expression (not the shorter XPath pattern) because it has to find the things to process. The path starts from the current context, that is, from the single item you are processing at the moment. Examples:
  1. <xsl:template match="body/div"><xsl:apply-templates/></xsl:template>
  2. <xsl:template match="body/div"><xsl:apply-templates select="div[1]"/></xsl:template>
Identity transformation for making changes to an XML file (45 minutes; 11:15 a.m.–12:00 p.m.) |
1. Why perform an identity transformation?
2. How to perform an identity transformation: <xsl:mode on-no-match="shallow-copy"/>
3. Change the structure and add line numbers to the Ozymandias XML file
  1. Open the url of our simple identity transformation starter: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/xslt/ID-TransformSimple-Starter.xsl
  2. Use attribute value templates to add numbers to the new <code> elements
4. Optional activity: Combining a collection of files into a single XML file
  1. See https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/xslt/coll-IDTransform.xsl.
5. Optional exercise: Repair our Pacific Voyage file:
  1. Open this file URL in <oXygen/>: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/xslt/ID-TransformTEI-Starter.xsl
  2. Develop this XSLT file following this exercise: http://dh.newtfire.org/XSLTExercise1.html

XSLT to HTML (1:30 p.m.–4:00 p.m.)

Preparing XSLT to Output HTML (25 minutes; 1:30 p.m.–1:55 p.m.) |
1. Using default XSLT processing. (This is not an identity transformation! Remove <xsl:mode/> for shallow-copying nodes.)
2. <xsl:stylesheet> and <xsl:output>
3. Template matching on the document node to output HTML
4. Structure of an HTML document: <head> and <body>
5. Complete basic Ozymandias transformation:
  1. Input: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/ozymandias.xml)
  2. Required HTML elements: <html>, <head>, <title>, <body>, <h1>, <h2>, <p>, <cite> (for publication venue), <div> (for poem), <br/> (NB: empty element, after all lines except the last)
  3. The output we want: https://ebeshero.github.io/UpTransformation/ozymandias.html
XSLT Activity: TEI XML to HTML transformation (40 minutes; 1:55 p.m.–2:35 p.m.) |
1. <xsl:stylesheet> adaptation for processing TEI to HTML: (It’s always a namespace issue!)
2. Open the url of the Emily Dickinson Fascicle 16 file: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/dickinsonColl.xml, and study the document.
3. Open this starter XSLT file url in <oXygen/>: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/xslt/TEI-HTML-Starter.xsl
4. The output we want: https://ebeshero.github.io/UpTransformation/dickinson16.html
Break (15 minutes; 2:35 p.m.–2:50 p.m.)
XSLT Activity: TEI XML to HTML transformation (continued) (35 minutes; 2:50 p.m.–3:25 p.m.) |
1. Writing and refining template rules.
2. Numbering lines
3. CSS for styling
XSLT activity: Making a linked table of contents (35 minutes; 3:25 p.m.–4:00 p.m.) |
1. The output we want
  1. https://ebeshero.github.io/UpTransformation/dickinson16-with-toc.html
2. How internal links work in HTML
3. Attribute Value Templates
4. Modal XSLT: Processing the same nodes in multiple ways
  1. Starter file for Modal XSLT to create the table of contents https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/xslt/dickinson-dhsi24.xsl
  2. One possibility for modal XSLT with a table of contents (example from DHSI 2024) https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/xslt/modal-dickinson.xsl

Thursday, May 29: XPath and Schematron

Concluding XSLT (9:00 a.m.–9:30 a.m.)

XSLT activity: Making a linked table of contents (continued) (30 minutes; 9:00 a.m.–9:30 a.m.) |
1. Modal XSLT: Processing the same nodes in multiple ways

Using Schematron to constrain your markup (9:30 a.m.–12:00 p.m.)

Schematron overview (15 minutes; 9:30 a.m.–9:45 a.m.) |
1. Schematron is constraint based; Relax NG, XML Schema, DTD are grammar based
2. Sample constraint-based tasks involve multiple elements
  1. Are start pages (<start>) no larger than end pages (<end>)?
  2. Are birth dates no later than death dates?
  3. Does a list (e.g., of students in a course) contain duplicates?
  4. Do pointers to persons really point to persons (and not places)?
3. Schematron structure: <pattern> → <rule> → <assert> or <report>
Looking at Schematron (25 minutes; 9:45 a.m.–10:10 a.m.) |
1. Document analysis of our XML: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/pages.xml
  1. <start> shouldn’t be greater than <end>
  2. <issue> is optional, but we could omit it by mistake
  3. <initial> should usually be one letter
  4. Apostrophes and quotation marks should usually be curly (“, ”, ‘, ’), not straight (', ")
2. What Relax NG can constrain: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/pages.rnc
  1. <volume>, <issue>, <year>, <start>, and <end> must be positive integers
  2. <year> must be exactly four digits
  3. <issue> is optional
  4. No empty elements
3. Schematron to the rescue: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/pages.sch
  1. Anatomy of a schematron rule
  2. Validating start and end pages
  3. Validating apostrophes and quotation marks (text, not markup)
4. Associating Schematron with XML
Schematron error reporting (15 minutes; 10:10 a.m.–10:25 a.m.) |
1. Schematron has the best error messages
2. Enhance Schematron reporting with <sch:value-of>: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/pages_value-of.sch
3. Enhance Schematron maintenance with <sch:let>: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/pages_variables.sch
4. Generate warnings as well as errors with @role: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/pages_warnings.sch
XPath functions practice: Leipzig glossing rules, part 1 (15 minutes; 10:25 a.m.–10:40 a.m.) |
1. Document analysis: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/leipzig.xml
2. Target output: http://htmlpreview.github.io/?https://github.com/ebeshero/UpTransformation/blob/master/data/leipzig.html
3. Validation challenge: the spaces and hyphens need to be aligned
4. Best practice
  1. Test the XPath separately first
  2. Develop and test incrementally
5. Schematron validation
  1. Housekeeping: create the Schematron skeleton in <oXygen/>, save it, link it to XML
  2. Two ways of counting spaces and hyphens
    1. translate()^⬀ string-length('one two three') - string-length(translate('one two three', ' ', ''))
    2. tokenize()^⬀ count(tokenize('one two three', ' ')) or tokenize(('Curly Larry Moe') ,'\s+') => count()
Break (15 minutes; 10:40 a.m.–10:55 a.m.)
XPath functions practice: Leipzig glossing rules, part 2 (35 minutes; 10:55 a.m.–11:30 a.m.) |
1. Comparing three things
  1. Three-way test not available in XPath
    1. $a eq $b eq $c
    2. $a lt $b lt $c
  2. What is available
    1. Composite expression: $a eq $b and $b eq $c
    2. Compare to average value: ($a, $b, $c) != avg(($a, $b, $c))
    3. Count distinct values
      1. count(distinct-values(($a, $b, $c))) eq 1
      2. distinct-values(($a, $b, $c)) => count() eq 1
2. Whitespace normalization
  1. Require it in the XML with Relax NG xsd:string { pattern = "(\S+ )*\S+" }
  2. Require it in the XML with Schematron test='. eq normalize-space(.)'
3. Solutions
  1. Simple https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/leipzig-basic.sch
  2. Enhanced https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/leipzig.sch
The Three Stooges go to Schematron Summer Camp (30 minutes; 11:30 a.m.–12:00 p.m.) |
1. The Edge Case Saloon
  1. “QA Engineer walks into a bar. Orders a beer. Orders 0 beers. Orders 999999999 beers. Orders a lizard. Orders -1 beers. Orders a sfdeljknesv.”
  2. More edge cases at https://www.sempf.net/post/On-Testing1
2. Best Stooge Ever contest results: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/stooges.xml
3. Hands on validation tasks
  1. All stooges must have percentages (no empty <stooge> elements)
  2. Percentages total 100
  3. Individual votes range from 0 through 100, inclusive
  4. There are exactly three stooges!
  5. No duplicate stooges!
4. Solution (no peeking!) https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/stooges.sch

Schematron and external files (1:30 p.m.–4:00 p.m.)

ID/IDREF validation (25 minutes; 1:30 p.m.–1:55 p.m.) |
1. Files
2. Details
  1. Datatypes xsd:ID, xsd:IDREF, xsd:IDREFS
  2. Value must be unique within the document
  3. Lexical space: NCName (begin with letter or underscore, may contain letters, digits, underscores, hyphens, periods) (simplified)
  4. @xml:id is not of type xsd:ID unless your schema says it is
  5. You don’t have to call it @xml:id, but you should
  6. Validates by exact string matching
3. Limitations
  1. Validates only within the same file (but XInclude can help)
  2. No subcategory support (e.g., you can’t require person IDREF to match only person ID)
4. Desiderata
  1. Validation against external (remote) files
  2. Subcategory support
General comparison and value comparison (20 minutes; 1:55 p.m.–2:15 p.m.) |
1. Value comparison
  1. Operators: eq, ne, lt, gt, le, ge
  2. Compares one thing to one thing
  3. Example: count(distinct-values(('Curly', 'Larry', 'Moe'))) eq 1
2. General comparison
  1. Operators: =, !=, <, >, <=, >= (angle brackets may have to be spelled <, >)
  2. Compares sequences of any length
  3. Example:
    1. 'Curly' = ('Curly', 'Larry', 'Moe')
    2. What does 'Curly' != ('Curly', 'Larry', 'Moe') return? What should we have written instead? not('Curly' = ('Curly', 'Larry', 'Moe'))
  4. substring(@ref, 2) = $ancillary//person/@xml:id
Schematron validation (25 minutes; 2:15 p.m.–2:40 p.m.) |
Break (15 minutes; 2:40 p.m.–2:55 p.m.)
Exploring Digital Mitford (30 minutes; 2:55 p.m.–3:25 p.m.) |
1. Project site: https://digitalmitford.org
2. Site index
  1. Workshop repo on GitHub: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/si.xml
  2. Mitford project site: https://digitalmitford.org/si.xml
  3. Outline: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/si-outline.xml
Hamilton 1823-04-09 letter (35 minutes; 3:25 p.m.–4:00 p.m.) |
1. Letter
  1. XML: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/1823-04-09-Hamilton.xml
  2. Read on line: https://digitalmitford.org/getLetterText.php?uri=1823-04-09-Hamilton.xml
2. Schematron starter: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/mitford.sch
3. Tasks
  1. Save local copy of Schematron
  2. Associate letter with local copy
  3. Test validation of <editor> element
  4. Add and test rules for other element types

Friday, May 30: Taking stock

Schematron Exercise (9:00 a.m.–10:15 a.m.)

Webb 1819-05-16 letter (or participant project data) (60 minutes; 9:00 a.m.–10:00 a.m.) |
1. Letter
  1. XML: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/1819-05-16_MWebb.xml
  2. Read on line: https://digitalmitford.org/getLetterText.php?uri=1819-05-16_MWebb.xml
2. Schematron starter: https://raw.githubusercontent.com/ebeshero/UpTransformation/master/data/mitford-back.sch
3. New items for the site index are in the <back>
  1. Some @ref values in the back have also already been added to the site index; report pointers to them as errors
  2. Some @ref values in the back still have to be added to the site index; report them as info
  3. If an element that should have a @ref doesn’t, report an error
Break (15 minutes; 10:00 a.m.–10:15 a.m.)

Putting it all to work (10:15 a.m.–11:30 a.m.)

Hands on activity with participant data: TBA (60 minutes; 10:15 a.m.–11:15 a.m.) |
1. Watch this space!
Retrospective (15 minutes; 11:15 a.m.–11:30 a.m.)

Processing Your XML/TEI with the XML Family of Languages

DHSI 2025 (Week 1, 26–30 May, 2025)

Schedule

Monday, May 26: XPath

Introduction to XPath in eXist-db and <oXygen/> (11:00 a.m.–12:00 p.m.)

Exploring document structures and data with XPath (1:30 p.m.–4:00 p.m.)

Tuesday, May 27: XPath and XQuery

From XPath to XQuery (9:00 a.m.–12:00 p.m.)

XQuery flow control (1:30 p.m.–4:00 p.m.)

Wednesday, May 28: XPath and XSLT

XQuery to HTML (9:00 a.m.–9:55 a.m.)

Introduction to XPath in XSLT (9:55 a.m.–12:00 p.m.)

XSLT to HTML (1:30 p.m.–4:00 p.m.)

Thursday, May 29: XPath and Schematron

Concluding XSLT (9:00 a.m.–9:30 a.m.)

Using Schematron to constrain your markup (9:30 a.m.–12:00 p.m.)

Schematron and external files (1:30 p.m.–4:00 p.m.)

Friday, May 30: Taking stock

Schematron Exercise (9:00 a.m.–10:15 a.m.)

Putting it all to work (10:15 a.m.–11:30 a.m.)