What’s an XML database and how does it work?

XQuery is one of the XML family of languages that builds on what you have learned of XPath, and we use it to work with XML databases. XML databases basically work by storing XML files and building persistent indexes for them—and this indexing capacity makes it speedy and efficient to search for elements, attribute values and calculate functions (anything you can locate and process with XPath expressions) across collections of files. XML databases can run speedily because they build an index of each file, so that the computer doesn’t have to review the entire file every time you run XQuery code. Basically the database’s index stores the tree structure of XML in the database memory, and makes it available for quick retrieval through XQuery.

How to Access our eXist XML database:

We are working with a particular XML database called eXist-db, which we have installed on the NewtFire server. Usually when we work on homework exercises and on project development, we will be working on our NewtFire eXist installation, but we can also write and run XQuery locally inside <oXygen/> by clicking on the little XQuery Debugger button right next to the XSLT debugger button in the top right of the <oXygen/> window to work with a batch of files we have saved locally. For our projects we tend to prefer working with an eXist installation either on a server or offline on a local computer because a) eXist has indexing tools that we use to make it more efficient to run over multiple collections stored on a remote server, and b) we can connect the XQuery scripts and XML files we have stored in eXist to our project websites. Here’s how to access our NewtFire eXist database:

  1. In a web browser, go to http://newtfire.org:8338/
  2. Wait for eXist to load, and it will show you a page of several icons/images.
  3. Look for the eXide button in the second row of icons and click on it. An interface will open where you can input XQuery code and work with the collections of files we have stored here.
  4. For our first assignments, a login is not strictly necessary, but as you work on projects or if you wish to save your XQuery code to a directory you create here, you will need to log in with a username and password. If you are an enrolled student in one of my digital humanities courses, I will likely have created a login for you, and for details please see Courseweb.
  5. If you save an XQuery script to the database, give it a file extension to identify it as XQuery: filename.xql or filename.xquery. Note: You can also write XQuery in <oXygen/> to query a document or collection on your local computer: Open a new XQuery file from File -> New, and you will notice that <oXygen/> uses the .xquery extension.

How the database is organized:

eXist holds file directories, or collections in a hierarchical structure, so that you can access and query a collection of XML files all together. You might think of a collection as one giant XML file with subfiles inside, so you can step up and down the file directory structure with XPath just as you step up and down the XML element hierarchy in the parent/ancestor and child/descendant axes within a single XML document. In the eXist database there is a single root directory called db, with subfolders containing folders (or collections), which may in turn contain their own subfolders (more collections), and finally files. I’ve installed a copy of our Georg Forster XML file here, in a collection called voyages, inside a directory called pacific, and that means that its address in our database is /db/pacific/voyages/ForsterGeorgComplete.xml, starting from the root db directory.

As we work on project development, you may find that you want to upload your own collection of XML files into eXist, and we’ll walk you through how to do that. This is different from uploading files to publish on your web space, which makes them publicly viewable but doesn’t build index files or let you collect, extract, and remix your coding using XQuery.

XQuery for a Single Document vs. a Collection:

XQuery uses XPath expressions to find its way through its index of files. It can work on one file, or on a whole collection, thus:

Actually, both doc() and collection() are XPath expressions (doc() reaches for an XML document node and the collection() function retrieves a collection of document nodes). We’ll be adding more XPath once you’ve designated the document or collection: You can write Xpath expressions, use predicates, functions, and walk up and down axes. Your XPath expressions will locate results from all the files in a collection as long as those files are coded (at least structurally) in the same or similar ways.

The TEI and XQuery: Declaring the TEI Namespace

Speaking of coding in the same or similar ways, we need to introduce you to the Text Encoding Initiative, or TEI. This a language of XML with designated rules and tag sets for coding digital versions of literary, linguistic, historical, and cultural texts, and it represents an international standard for coding work consistently for long-term, sustainable archives. TEI is also a community and people (like me) serve on its Technical Council to make judgment calls on best practices and coding guidelines. We use TEI to build digital archives that can "talk to" each other around the world, and follow recognizable, standard patterns. We could make up our own XML tag sets, but when coding cultural resources, it’s a good idea to make your work accessible, so it is easy for others to access and, say, load into databases to run XQuery for analyzing it, or studying it, or connecting it with other comparable texts in other archives! We’ll talk more about TEI structure and coding, and give you some experience with it. (To read more, here’s the TEI’s home site.) For now, you can quickly tell if one of our files is coded in TEI from its root element: <TEI> .

XQuery requires a namespace declaration when we use the TEI in order to properly follow its index and in order to follow the schema rules for TEI (to determine if your file is valid as a TEI document). Similarly, we also use a namespace declaration for HTML, to say there are certain rules governing the relationship of tags, their organization, etc. When we query our TEI files, we’ll need to include the following namespace declaration as the first statement of our XQuery:

declare default element namespace "http://www.tei-c.org/ns/1.0";

Following are examples of some XQuery expressions on collections of TEI files in our eXist database. Try copying them into the eXide window and running them by clicking on the Eval button. Notice the results you return with each.

  1. declare default element namespace "http://www.tei-c.org/ns/1.0";
                collection('/db/pacific/literary')//titleStmt/title

    The above expression accesses a collection of files, the literary texts associated with our Pacific voyages project. It starts at the root of the eXist directory, always named db and steps down into a collection named pacific, and into a collection-inside-the-collection called literary. (There are a couple of other collections inside the pacific collection, named voyages and mapping, and you can access these collections by inputting their names in the appropriate directory path step inside the collection() function. Notice that after the collection() function, we are stepping down the XPath descendant axis with // and peering into two standard TEI elements that sit in a nested relation to one another. In the TEI header there must always be a <titleStmt> element, and it must contain a <title> element that is understood to be the title of the XML document. (You can use the <title> element elsewhere in a TEI document to mark titles of anything, say references in the document to other books, works of art, etc), but the <title> inside the <titleStmt> has a special function of identifying the title of the XML file itself. So, looking at your output you should see a list of those special <title> elements, which helps us to view at a glance the contents of a TEI collection like this. Stepping down the tree helps us to isolate just the piece of it that we want to return when we run (or eval) the XQuery code.

  2. declare default element namespace "http://www.tei-c.org/ns/1.0";
                collection('/db/pacific/literary')/distinct-values(descendant::body//persName)

    This XQuery illustrates the use of an XPath function, distinct-values(), so that we will return a list gathered from across the entire of the distinctly different names of people (indicated by the TEI <persName> element) referenced within the <body> portion of the document. (In the TEI, much like in HTML, the <body> is a major top-level structure in the text's hierarchy and typically contains a text, like in this case, the full text of a poem, novel, or play. Other parts of the TEI text include a header or metadata, or information about the document, such as its title (up in the <titleStmt> and publication data.) Here, we want to you to notice how we positioned the distinct-values() function: Notice that we have to keep the collection() function outside of distinct-values(), once we are inside the collection, we take distinct values, using the dot (.) which means (as it always does in XPath), the self::* axis. If you try running this query without the dot, the function will lake the precise context it needs to understand its starting point. The dot (or self axis) refers to the collection as a whole.

FLWOR Expressions in XQuery

Flower or FLWOR expressions are a powerful tool in XQuery, letting us work in more complex ways with querying and remixing information in files and collections—sometimes both in the same expression! Here's a primer on FLWOR (or really, LFWOR!):

A really, really simple little FLWOR

 let $hamlet := doc('/db/shakespeare/plays/hamlet.xml')
            return $hamlet   

Another simple FLWOR: processing a collection of files to return information from a single document

Here is an example to demonstrate how we can start with a variable defining a collection of files, and reach into it to retrieve information from a particular special file inside. Note: this particular collection, our Pacific voyage collection, is in the TEI namespace, so we require a special namespace declaration line.

      declare default element namespace "http://www.tei-c.org/ns/1.0";
      let $pacific := collection('/db/pacific/voyages')/*
      let $GeorgFile := $pacific[descendant::author[contains(., 'Georg')]]//titleStmt/title
      return $GeorgFile
      

This returns just one result in the eXide output window:

       1
      <title xmlns="http://www.tei-c.org/ns/1.0">A Voyage Round the World in His Majesty's Sloop, Resolution, commanded by Capt. James Cook, 
      during the Years 1772, 3, 4, 5.</title>
      

Notice how we referenced the descendant:: axis in our XQuery FLWOR. We could also have used .// to indicate the self:: axis, but we must NOT use //. We require the dot or the indication of the descendant:: axis in the variable $GeorgFile to set a starting point, to indicate that we are stepping down from the position defined by the $pacific variable. (If we do not use the dot, we return zero results because the starting position of the XPath in the predicate is unclear to the computer parser! Try it yourself and see what happens.)

Examples of two related FLWOR Expressions, to demonstrate Where and For statements

  1. No For statement here):
    declare default element namespace "http://www.tei-c.org/ns/1.0";
                let $cook := doc('/db/pacific/voyages/cookVoy2Pnum.xml')
                let $p := $cook//p[geo]
                let $geo := $cook//p/geo
                let $countlat := count ($geo[@select="lat"])
                let $countlon := count ($geo[@select="lon"])
                where $countlat gt $countlon
                return $p
  2. Using a For statement, with an XQuery comment.
    Note: An XQuery comment is formatted inside smiley faces like this: (: your comment here :)
    declare default element namespace "http://www.tei-c.org/ns/1.0";
                let $cook := doc('/db/pacific/voyages/cookVoy2Pnum.xml') 
                let $Paras := $cook//p[geo]
                let $geo := $cook//p/geo
                let $countlat := count ($geo[@select="lat"])
                let $countlon := count ($geo[@select="lon"])
                for $p in $Paras 
                where $countlat gt $countlon
                return string-join(('paragraph',$p/@n),': ') 
                 
             (: Note use of the string-join function, with its separator. Also notice which parts
             of it take the single-quotes ' ', and which parts do not! The single quotes,  ' ' , allow you to indicate 
    that you want some literal text to be returned here. Without it, the computer thinks you are referring to an XPath expression. :) 

The O in the FLWOR: Order

The Order statement in the FLWOR is optional, but when you use it, it must follow a Where statement and immediately precede the Return. One of the standard, default uses of Order is to sort a list of results in alphabetical order, so, for example:

order by $a

organizes results in alphabetical order sorted by the whatever is indicated in the variable $a.

There are more complex ways to set up an Order statement to organize results. For example, you can order by descending to get reverse alphabetical order:

order by $a descending

Or you can order a set results according to their numerical position or count, in ascending or descending order.

Building New HTML or XML with XQuery: Using Curly Braces: { }

To add HTML or XML markup to the XQuery output, add the elements where needed to produce conformant code. However, these elements are passive, or non-functional when executing XQuery commands. So we use curly-braces { } to enclose any XPath or XQuery statements that we want to execute in XQuery, to separate them from the HTML or XML markup elements. Inside html elements, when we need to do some calculation or refer to a variable we defined in XQuery, we use the curly-braces again. We’ll work on some examples in class. Here is one example that may be helpful as a reference point, showing how to make an HTML page with a table of two columns, making a list of two related variable results side by side. The resulting html file is coded to display a table of the distinct characters (<speaker> elements) in Hamlet from our Shakespeare collection, next to a count of their speeches (<sp>) in the play. Speeches in the play are coded in TEI like this, with speaker names entered as a child element. (Speaker identifiers are also coded as an attribute on the sp element. In the code below, we will simply work with the contents of the speaker element, but you could practice and see if you can adapt our example by changing it to work with the @who attribute instead.)

<sp xmlns="http://www.tei-c.org/ns/1.0" who="Francisco">
  <speaker>Francisco</speaker>
  <l xml:id="sha-ham101002" n="2">Nay, answer me: stand, and 
     unfold yourself.</l>
</sp>

We have highlighted the position of the curly-braces in the example:

      xquery version "3.1";
declare default element namespace "http://www.tei-c.org/ns/1.0";  
         <html>
         <head><title>Speakers and counts of their 
         speeches in Hamlet</title></head>
         <body>
         <table>
         
         {
         let $hamlet := doc('/db/apps/shakespeare/data/ham.xml')
         let $speeches := $hamlet//sp
         let $speakers := $hamlet//speaker
         let $distinctsp := distinct-values($speakers)
         for $sp in $distinctsp
         let $count := count($speeches[speaker = $sp])
         order by $count descending
         return
         
         <tr>
         <td>{$sp}</td>
         <td>{$count}</td>
         </tr>
         }
         </table>
         
         </body>
         </html>

Here’s what’s happening when we apply the curly braces { }. These wrap the portion of our code in which XQuery must be processed. We write the basic structural HTML tags: the HTML, head, and body elements to encircle our FLWOR statement, since these do not require any special XQuery processing and just need to be output to create a well-formed and valid HTML document. We then encircle the whole FLWOR statement inside curly braces, and when you write this in the eXide window, you will notice that if you remove those curly braces and hit the Eval button, the XQuery code is simply output as text (and appears all the same color as the HTML documents). When you apply the curly braces, eXide applies color to show you the XQuery code is active. So, why do we need a second set of curly braces inside our return statement, where we output a <p> element? Try removing them and look at your output! The answer has to do with the use of HTML (or other nonXQuery markup code, such as XML or KML, etc) in our output: The computer parser requires the curly braces any time you are representing the contents of an angle-bracketed element, so that it can tell when a string of text inside the angle-bracketed tags is a literal text string (no curly braces) or XQuery code to be processed (nested within curly braces).

Applying XPath string functions to control output in XQuery

Our model for the next two examples is adapted from Obdurodon’s Generating a list of characters from a collection of Shakespeare plays in alphabetical order. Try testing and exploring the XQuery scripts below with our Shakespeare collection on the newtFire eXist-db.

1. Returning a concatenated string of results in plain text:

This example returns the characters in Hamlet whose names end with the letter “o”, and outputs the number of characters in their names. To follow this example, you should review the string functions in XPath, so see part III on Strings in Obdurodon’s The XPath functions we use the most.

xquery version "3.1";
     declare default element namespace "http://www.tei-c.org/ns/1.0";
     let $hamlet := doc('/db/apps/shakespeare/data/ham.xml')
     let $speakers := distinct-values($hamlet//speaker) 
     for $speaker in $speakers
     let $NameLength := string-length($speaker)
      where ends-with($speaker,'o')
       (:order by string-length($speaker):)  
          (:commenting out! :)
     order by $NameLength
         
       (:return $speaker:) (:commenting out! :)
     return concat ($speaker, ' has ', $NameLength , ' characters.')

2. Insert HTML formatting around the FLOWR statement to turn the results into a web page

Notice the positioning of two pairs of curly braces { } in this XQuery code:

  <html>
         <head><title>Title</title></head>
         <body> {
            let $hamlet := doc('/db/apps/shakespeare/data/ham.xml')
            let $speakers := distinct-values($hamlet//speaker)
            for $speaker at $pos in $speakers                              
         (: The above line creates a special variable named $pos that
         identifies the position number of each speaker in the sequence of all the distinct speakers. We can use that position number in our output. :) 
        
            let $speakerLength := string-length($speaker)
            where ends-with($speaker,'o')
            order by $speakerLength
            return 
            <p>{concat ($speaker, '#', $pos, ' has ', $speakerLength , ' characters')}</p>
           }
          </body>  
          </html>

Namespaces and XQuery output formats

While we frequently write XQuery to output plain text or HTML, we can also write it to produce output code in a namespace, such as specialized forms of XML like TEI or KML. Above, when we were processing XQuery on a TEI file for the Pacific project, we used a convenient line of code at the top of the file:

declare default element namespace "http://www.tei-c.org/ns/1.0";

Using this means that the default format of all elements being processed and output iwll be in TEI, and that was fine for our processing above. It may not be okay, though, when you need to process the special Wordhoard TEI Shakespeare collection to convert its TEI elements into HTML elements. Here we need to declare two namespaces, and we have to make a decision which one should be the default. The other one that isn't marked as the default will have to be distinguished, using a namespace prefix, like this: tei:text (for the TEI element <text>). When transforming from TEI to HTML, we recommend setting the output HTML as the default namespace and treating TEI elements with prefixes (and generally speaking we suggest setting the namespace format of the output file as the default namespace in your XQuery code. Here is how to set a default namespace line and a namespace line that requires prefixes:

         declare default element namespace "http://www.w3.org/1999/xhtml";
         declare namespace tei="http://www.tei-c.org/ns/1.0";
         (:Continue writing XQuery here... :)

The top line of our example above is a default element namespace line, which we're setting for our output format, the HTML namespace. (We found it by opening an HTML file in oXygen, and just pasted it in here.) The default element namespace won't require us to set prefixes, but if we want to be processing code from a different namespace, we need to declare it too. The TEI elements being processed will all require the tei: prefix in front for the code to properly distinguish these elements. Note: Attributes are in no namespace at all, but their parent (hosting) elements are what is namespaced. That means you only need to use the namespace prefix on the element names, not the attributes.

Links to Some Excellent XQuery Resources: