Pages

Banner 468

Saturday 28 January 2012

CMT3315 Lab 11 - XML Schemas

0 comments
 

Today's post answers questions related to XML Schemas.

You can download the complete lab questions from here.

Quick Questions

1. The following passage is to be found in the middle of a particular XML document:

  The heavily-used<service xlink:type = "simple" xlink:href ="http://www.thetrams.co.uk/croydon"> Croydon Tramlink </service> provides a cross link to nearby <location>Wimbledon</location>, <location>Addington</location> and <location>Beckenham</location>.
 

What can you say about how the text Croydon Tramlink will be treated by a browser such as Mozilla Firefox?

A web browser would treat the text "Croydon Tramlink" as a hyperlink. Since the "show" XLink attribute was not specified, clicking the link would load the ending resource in the same browser window, entirely replacing the starting resource. This is because the default behaviour of the "show" XLink attribute is in fact "Replace".


2. It’s possible to provide validation for a class of XML document using a Document Type Definition (.dtd) file, or using an XML schema. The DTD approach is easier. Why might you want to use the XML schema approach?

Just like a DTD, an XML schema provides a way of defining the basic structure of an XML document. XML Schemas however offer developers much more control on what the document can and cannot contain, and does so in an object oriented approach. In contrast to DTD's basic CDATA and PCDATA data types, XML Schemas offer data types such as integer, string, dates and decimal numbers. Furthermore, these basic types can be used by a developer to define more complex types.

XML Schemas become essential for validating data contained in XML documents, especially where that data is destined to be stored in a database.


Longer Questions
The answers to the longer questions are given below. Please refer to the Lab 10a handout for the questions themselves.

1a. Here is an XML document:

  <?xml version="1.0" encoding="UTF-8"?> 
  <book isbn="0836217462"> 
   <title> Being a Dog Is a Full-Time Job </title> 
   <author>Charles M. Schulz</author> 
   <character>
    <name>Snoopy</name>
    <friend-of>Peppermint Patty</friend-of> 
    <since>1950-10-04</since>
    <qualification> extroverted beagle </qualification>
   </character>
   <character>
    <name>Peppermint Patty</name>
    <since>1966-08-22</since>
    <qualification>bold, brash and tomboyish</qualification>
   </character>
  </book>
 

An XML schema is to be constructed, which will validate this document and other similar documents. Make notes on the elements etc that this document contains, and record any significant factors about them.

The first step in designing an XML Schema for an XML document, is to analyse the contents of the XML document itself to identify suitable data types. In addition the structure of the XML document dictates the complex data types that will need to be constructed.

For instance, in the example above, most elements contain strings. One noticable exeption is the "<since>" element which holds a date and the "<book>" element's "ISBN" attribute which holds numerical data. Furthermore, the "<book>" and "<character>" elements are complex types made up of other elements of simple types. We can also notice that the "<friend-of>" element is optional and refers to a "<character>" element.


Readmore...

CMT3315 Lab 10 - XLink

0 comments
 

Today's post answers questions related to XLink.

Wherever possible, lab questions were reproduced before providing the answer. The lab handout may be downloaded from here.

Quick Questions

1. One of the advantages claimed for the "extended links", that the W3C consortium intended to be part of the XLink language, was that the definition of a particular hyperlink could be located, not in the local resource (the document where the link starts), or the remote resource (the document where the link ends), but in a quite different "third party" document. Why might this be an advantage?

The ability to store extended links in a "third party" document allows for the creation of relationships between resources independedntly of the resources themselves. In other words, by storing links in a separate file (a linkbase) you would eliminate the need to change the source or target resources which could be expensive or even impossible if you are not the owner of the resource itself. In fact, the ability to link resources over which you have no control is one of the main advantages offered by XLink in general.


2. The XLink language provides an attribute for a hyperlink called show – it has several possible values. What is the effect of providing such a link with each of the following attribute values?

  • show="replace"
  • show="new"
  • show="embed"

The "show" XLink attribute defines where the link should be opened. "Replace" instructs the application to open the ending resource in the same window in which the starting resource was loaded. "New" instructs the application to open the ending resource in a new window. "Embed" instructs the application to load the presentation of the ending resource in place of the presentation of the starting resource. In practice, the ending resource is merged with the starting resource, similar to how the HTML "<img> tag works.

Which of these three attribute values is the default?

The default value for the "show" attribute is "Replace".


Longer Questions

The answers to the longer questions are given below. Please refer to the Lab 10a handout for the questions themselves.

1a. Using xlink-specific attributes, any element in an XML document can be set to act as a hyperlink. To use these attributes we must first declare the Xlink namespace - "http://www.w3.org/1999/xlink" - in the XML document. XML allows for two types of hyperlinks: simple links and extended links. Simple links are one-directional, very similar to hyperlinks in HTML while extended links provide more functionality.

In our example, a specific piece of text - "this website" - found within the "<message>" element needs to be changed into a link pointing to "http://engineering.suite101.com/article.cfm/wind_power". If we were to add XLink attributes to the "<message> element, the entire text within it would be changed into a hyperlink, which is not what we want. We must therefore create a new element to wrap the words "this website" and add the Xlink attributes to this new element. We must also change the DTD to support this new element as well as the Xlink attributes which we are about to add.

Listing 1 below shows the modified XML containing the hyperlink. The XLink namespace is declared at line 4 in the root element. Line 11 shows the new element called "<link>" wrapping the words "this website". The xlink:type="simple" attribute specifies that this element is a simple link while the xlink:href= attribute specifies the URL of the target document or ending resource in XLink terminology.

 
  <?xml version="1.0"?>
  <!DOCTYPE memo SYSTEM "memo.dtd">
  <?xml-stylesheet href="stylesheet02.css" type="text/css"?>
  <memo xmlns:xlink="http://www.w3.org/1999/xlink">
   <heading>memo 1334</heading>
   <date>date: 11 November 09</date>
   <time>time: 09:30</time>
   <sender>from: The Managing Director</sender>
   <addressee>to: Heads of all Departments</addressee>
   <message>I think we should be making wind-turbines. Have a look at 
    <link xlink:type="simple" xlink:href="http://engineering.suite101.com/article.cfm/wind_power">this website</link>. 
    Tell me what you think. 
   </message>
  </memo>
 
 

Since we added new elements and attributes to our XML document, it will no longer validate against the original DTD. Listing 2 below shows the modified DTD that supports the changes made in the XML above. Line 3 defines the "xmlns:xlink" attribute of the memo element as fixed-value (the XLink namespace). Line 9 changes the "message" element definition to accept 0 or more parsed character data (#PCDATA) values and/or "link" elements. Line 10 defines the new "link" element and specifies that it should contain #PCDATA values. Lines 11 to 13 specify that every "link" element must have an "href" attribute and a "type" attribute, where the latter must be either set to "simple" or "extended". These changes will ensure that the XML document above successfully validates against the DTD.

  <?xml version= "1.0" encoding="UTF-8"?>
  <!ELEMENT memo (heading, date, time, sender, addressee, message)>
  <!ATTLIST memo xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink">
  <!ELEMENT heading (#PCDATA)>
  <!ELEMENT date (#PCDATA)>
  <!ELEMENT time (#PCDATA)>
  <!ELEMENT sender (#PCDATA)>
  <!ELEMENT addressee (#PCDATA)>
  <!ELEMENT message (#PCDATA|link)*>
  <!ELEMENT link (#PCDATA)>
  <!ATTLIST link 
  xlink:href CDATA #REQUIRED
  xlink:type (simple|extended)  #REQUIRED>
 

In this example, only two XLink attributes were used: "type" and "href" which are bot required in order to have a valid hyperlink. Other optional XLink attributes include "show" which defines where to open the link and "actuate" which defines when to show the linked resource.

1b. Suppose that the heading of one of the sections in the target website is <A NAME="WE Elec Facts">Wind Energy Electricity Facts</A>, including the tags as shown. What changes would you have to make to the link in the managing director’s memo, to make the hyperlink finish at that point rather than at the wind_power document as a whole?

In order to make the link point directly to the specified point, all we need to do is add a "#" followed by the name of the anchor to the end of our target URL as follows:

  <link xlink:type="simple" xlink:href="http://engineering.suite101.com/article.cfm/wind_power#WE Elec Facts">this website</link>. 
 

2. Here is another XML document:

  <?xml version="1.0"?>
  <!DOCTYPE memo SYSTEM "memo.dtd">
  <?xml-stylesheet href="stylesheet02.css" type="text/css"?>
  <memo xmlns:xlink="http://www.w3.org/1999/xlink">
   <heading>memo 1335</heading>
   <date>date: 11 November 09</date>
   <time>time: 09:45</time>
   <sender>from: The Managing Director</sender>
   <addressee>to: Heads of all Departments</addressee>
   <message>
      I think we should be making solar panels. Have a look at this website.  Tell me what you think. 
   </message>
  </memo>

 

At the point where the document says this website, there is supposed to be a hyperlink that takes the reader to a suitable website. Find one, and amend the document, so that the link is in fact there. Is it necessary to make any changes to the .dtd file, or can we use the file as you amended it before?

Listing 3 below shows the modified XML document. Since all we are doing is replacing the XLink's target URL, we do not need to make any changes to DTD.

 
  <?xml version="1.0"?>
  <!DOCTYPE memo SYSTEM "memo.dtd">
  <?xml-stylesheet href="stylesheet02.css" type="text/css"?>
  <memo xmlns:xlink="http://www.w3.org/1999/xlink">
   <heading>memo 1335</heading>
   <date>date: 11 November 09</date>
   <time>time: 09:45</time>
   <sender>from: The Managing Director</sender>
   <addressee>to: Heads of all Departments</addressee>
   <message>I think we should be making solar panels. Have a look at 
    <link xlink:type="simple" xlink:href="http://vincent-lui.suite101.com/solar-paneling-for-the-home-a226122">this website</link>. 
    Tell me what you think. 
   </message>
  </memo>
 
 
Readmore...
Sunday 22 January 2012

CMT3315 Lab 09 - CSS & XPath

0 comments
 
Today's post answers questions related to Cascading Style Sheets (CSS) and XPath.
You can download the complete lab questions from here.
Quick Questions

1. Suppose that a CSS file is used to determine how an XML document will appear when viewed in a browser. Suppose that the CSS file contains two rules, one dictating that a particular piece of text will appear in bold type, the other dictating that it will not. What will happen?

If there are conflicting rules in a CSS file, the one that comes below the other in the CSS document will have precedence and will be applied. For instance, consider the following XML and CSS files:

  <?xml version="1.0"?>
  <?xml-stylesheet type="text/css" href="lab09-css1.css"?>
  <?book>
   <title>Rainbow Six</title>
   <author>Tom Clancy</author>
  </book> 
 
  /* Lab09-css1.css */
  book   {margin:10px; font-weight:normal;}
  title  {display:block; font-weight:bold; color: red;}
  author {display:block;}
  title  {font-weight:normal;}
 

Figure 1 shows the result of applying the stylesheet to the XML document. As specified in the first rule (line 3), the book title is red but is not bold because the second rule (line 5) which sets the font-weight to normal takes precedence.

Figure 1

2. An XML document contains the sentence "The grand old Duke of York, he had 10000 men." Would XPath be able to extract the piece of data "10000" from such a document?

Assuming that the text in question is in the following XML document:

  <?xml version="1.0"?>
  <book>
   <title>The Duke of York</title>
   <summary>The grand old Duke of York, he had 10000 men.</summary>
  </book> 
 
The following XPath query would extract "10000" from the "summary" element's text:
  substring(/book/summary/text(),36,5)
   
 

Longer Questions
The answers to the longer questions are given below. Please refer to the Lab 09 handout for the questions themselves.

1a. Adding the following line to the prolog of our XML document will associate "stylesheet01.css" to it:

  <?xml-stylesheet type="text/css" href="stylesheet01.css"?> 
 

1b. The listing to stylesheet01.css is given below. Figure 2 shows how the XML document appears in a browser after applying this stylesheet.

 /*stylesheet01.css*/

  chemElements {display: table;
                margin:10px;
                  font-family: Calibri, Arial, Helvetica, Sans-Serif;
                 font-size:10pt;
                 border-collapse: collapse;
                  color: #666666;}

  tableHead, element {display: table-row;}

  anum, name, symbol, mp, bp,
  anumHead, nameHead, symbolHead, mptHead, bptHead {
   display: table-cell; 
   border:solid 1px #494949; 
   padding:0px 4px 0px 4px;}

  tableHead {display: table-row;
             background-color: #555555;
              color: #B0AFAF;
              height: 30px;
               line-height: 30px;}

  element {display: table-row;} 

  element:nth-child(odd) {background-color:#F7F7F7;}

  name   {font-weight: bold; color: #d2580d}

  symbol {text-align:center}

  mp, bp {text-align:right}
 

Using the "display" CSS property we can render the XML document as an HTML table. The root element will be our table, so its display property is set to "table". The "tableHead" and "element" elements are our rows, so their display property is set to "table-row". The display property of the rest of the elements is set to "table-cell". The rest of the CSS is fairly straight-forward, mainly setting rules for typeface, colours, borders and spacing. Of note is line 27, where the CSS 3 ":nth-child" pseudo class is being used to specify that every odd "element" row should have a light grey background. This produces the alternating row effect seen in Figure 1 below.

Figure 2 - Stylesheet applied to chemElements2.xml

2. Provide XPath expressions which will do the following:

a) Select all the elements subordinate to the root node.

  /musicList/*
  
 

b) Select all track elements that have a total attribute with a value of 5.

  //tracks[@total = 5]  
  
 

c) Select all elements that contain the word "Penderecki" in their title.

  <!-- cd elements whose title contains "Penderecki"-->
  //cd[contains(title,"Penderecki")]
  
  <!-- title elements that contain "Penderecki" in their text" -->
  //title[contains(.,"Penderecki")]
  
 

d) Select any elements that have titles with greater than 11 characters.

  <!-- cd elements whose title is longer than 11 characters-->
  //cd[string-length(title) > 11]
  
  <!-- title elements where their text is longer than 11 characters" -->
  //title[string-length(.) > 11]
  
 

e) Select all the siblings of the first cd element

  //cd[1]/*
  
 
Readmore...
Saturday 21 January 2012

CMT3315 Lab 08 - XML & CSS

0 comments
 
Today's post answers questions related to Cascading Style Sheets (CSS) and XML.
You can download the complete lab questions from here.
Quick Questions
Q1. You have a set of legal documents. Each has four sections: the title, the case, the background, and the judgement, in that order. Each has been made into an XML document by inserting a prolog and suitable tags. You want to write a CSS file that will display these documents using a suitable browser.
  1. Can you write the CSS file in such a way that it will display the title, then the judgement, then the background, then the case?
  2. Can you write the CSS file in such a way that it will display just the title, and the judgement?
  3. If the CSS file is called legalWrit.css, what processing instruction should you put in the prolog of the XML document(s)?

1a) Ideally, elements in an XML document should be listed in the order in which they are to be displayed since web browsers render the elements in sequence.  However, using CSS absolute positioning one could set the position of the elements as required.  For example, suppose the following are our XML and CSS files:

   <?xml version="1.0"?>
   <?xml-stylesheet type="text/css" href="legalWrit.css"?>
   <legaldoc>
      <title>Case Title</title>
      <case>Case Text</case>
      <background>Case Background</background>
      <judgement>Case Judgement</judgement>
   </legaldoc>

   title {color:red; display:block;}
   case  {color:black; display:block;}
   background {color:black; display:block;}
   judgement  {color:black; display:block;}

Figure 1 shows how the XML document would be displayed.

Figure 1
CSS cannot be used to change the order of elements within an XML document.  However there are ways to play around with how and where elements can be displayed.  For instance,  to change the display position of an element, we set the "position" CSS property to "absolute" and specify its top left coordinate as follows:
   
title      {color:red;   position: absolute; top: 0px;  left: 10px; display: block;}
case       {color:black; position: absolute; top: 60px; left: 10px; display: block;}
background {color:black; position: absolute; top: 40px; left: 10px; display: block;}
judgement  {color:black; position: absolute; top: 20px; left: 10px; display: block;}

Figure 2 shows the effect of using absolute positioning to re-arrange the display order of the XML elements.

Figure 2 - Using absolute positioning

It must be said however that although absolute positioning can be used in simple cases such as this,  it is nothing more than a work-around.   It is far more desirable to structure the document in the required sequence in the first place.

1b) To hide unwanted elements we set the value of their "display" property to to "none":

title      {color:red;   position: absolute; top: 0px;  left: 10px; display: block;}
case       {color:black; position: absolute; top: 60px; left: 10px; display: none;}
background {color:black; position: absolute; top: 40px; left: 10px; display: none;}
judgement  {color:black; position: absolute; top: 20px; left: 10px; display: block;}

1c)  Adding the following line to the prolog of the XML document specifies that it should be styled using "legalWrit.css":

   <?xml-stylesheet type="text/css" href="legalWrit.css"?>


Q2. What's the difference between a URI and a URL?

The difference between URIs (Uniform Resource Identifiers) and URLs (Uniform Resource Locators) is very subtle so much so that the terms are sometimes used interchangeably.  A URI is a string of characters that identifies an abstract or physical resource on the network.  URIs provide a method for identifying resources that is both simple and extensible.  In fact a URI can be extended to become a locator or a name or both.  URLs are the "locator" subset of URIs that identifiy the location of a physical resource on the network and also define how that resource can be accessed.  This means that all URLs are URIs but not all URIs are URLs.  I remember coming across the following analogy on the Internet:  "All humans are mammals but not all mammals are human".  Similarly URIs that are names are called URNs (Uniform Resource Names) and are designed to provide a universally unique name to a resource that is meant to be persistent even after that resource no longer exists.


Q3. Why does the XML language allow namespaces?

Given that in XML you need to make up your own element names, chances are that element names might clash some time down the line.  In other words, you may end up using the same element name for different purposes or come across the same element name in an XML document written by someone else.  This is especially true for large complex documents where the chances of having clashing element names becomes even greater.  Duplicate element names that mean different things will obviously create confusion when processing the XML document.

Namespaces were introduced specifically to resolve this issue.  They are collections of unique element names that logically belong together.  In XML, a namespace is nothing more than a URI.  Choosing one that belongs to you ensures that the namespace is unique to you and being a URI, it doesn't even have to point anywhere!  Let's say we wanted to add a namespace to the XML document specified in Question 1 above and choose "http://www.WZLawyers.com/legal" as our namespace URI.  We add the namespace declaration in the start tag of our root element:

   <?xml version="1.0"?>
   <?xml-stylesheet type="text/css" href="legalWrit.css"?>
   <lg:legaldoc xmlns:lg="http://www.wzlawyers.com/legal">
      <lg:title>Case Title</lg:title>
      <lg:case>Case Text</lg:case>
      <lg:background>Case Background</lg:background>
      <lg:judgement>Case Judgement</lg:judgement>
   </lg:legaldoc>

Line 3 specifies that the XML namespace prefix "lg" refers to our namespace URI.  This prefix is then used in each element name to create what is known as a qualified name (e.g "lg:title").  Qualified names ensure that element names in our namespace will not clash with elements of the same name in other namespaces.
  
Longer Questions
Please refer to the Lab handout to see the questions' full text.

Answer to Question 1

Using "display:block;" ensures that each element is rendered on a different line when then XML is viewed through the browser.


Answer to Question 2

2a) The following CSS will style the document as required. A "display:block" property was added to the chapter title and poem lines to make the text more readable.

/*stylesheet4.css*/

chapter {font-family: Palatino, "Times New Roman", Sans-Serif;
         font-size: 12pt; 
         background-color: #FCFBC4; 
         margin-left: 1em;}

chapterHead {font-size: 24pt; 
             font-weight: bold; 
             font-style: italic; 
             color: blue; 
             display: block;}

poem {font-style: italic;
      display:block;
      margin-left: 1em;}

line {display: block;}

The requirements specified that the chapter title and text should be indented from the left margin by 1em, while the poem lines should be indented by 2ems.  However, looking closely at the CSS above, you will notice that neither the "chapterHead" nor the "line" elements have a left margin specified. On the other hand, the "chapter" and "poem" elements each have a 1 em left margin.  The reason for this is simple.  Since the "chapter" element contains the rest of the elements, indenting the "chapter" element by 1 em automatically indents the rest.  Furthermore, since the "poem" element is already indented by 1 em, we only need add a further 1 em to make it 2 ems indented from the left margin.

2b) Here's how the document looks when styled using the above stylesheet:

Figure 3 - XML Styled with stylesheet4.css

2c  The following stylesheet is a revised version of the one in 2a above.  Here, the "chapter" element is given a margin of 1 em on all sides and the text has been justified.  This will give the document a cleaner look, removing the jaggedness seen on the right hand side of the document (see Figure3).  The chapter title has been given a bottom margin of 10 pixels and the poem has been centered on the page by setting its left and right margin to "auto".  Finally, "indexRef" elements have been given a green color to make them stand out from the rest of the document.

/* stylesheet4revised.css */
chapter {font-family: Palatino, "Times New Roman", Sans-Serif;
         font-size: 12pt; 
         background-color: #FCFBC4; 
         margin: 1em;
         text-align: justify;}

chapterHead {font-size: 24pt; 
             font-weight: bold; 
             font-style: italic; 
             color: blue; 
             display: block;
             margin-bottom:10px;}

poem {font-style: italic;
      display: block;
      margin-left:auto;
      margin-right: auto;
      margin-top:20px; 
      margin-bottom: 20px;
      text-align: center;}

line {display: block;}

indexRef {color: green}

Here's how the document looks when styled using "stylesheet4revised.css"

Figure 4 - XML styled with stylesheet4revised.css
Readmore...
Thursday 8 December 2011

CMT3315 Lab 07 - DTD 3

0 comments
 
Quick Questions
Q1. People who prepare XML documents sometimes put part of the document in a CDATA section.

  • Why would they do that?
  • How is the CDATA section indicated?
  • If CDATA sections hadn't been invented, would there be any other way to achieve the same effect?

Sometimes, the contents of an XML document might have characters which have a special meaning in XML such as "<", ">" and "&".  When an XML document is being parsed, text between XML tags is also parsed so   including such characters 'as is' will break your XML document.  The XML parser will interpret them as XML syntax when in fact they are only part of the text and should be ignored.  The CDATA section solves this problem by marking a section of text as unparsed character data which the parser will ignore.

A CDATA section is indicated by placing text between the "<![CDATA[" start tag and "]]>" end tag.

Instead of using CDATA, one could also replace the "<", ">" and "&" characters with "&lt;", "&gt;" and "&amp;" respectively to achieve the same effect, a technique known as "escaping".  This option however is more laborious and makes the XML harder to read (by humans).


Q2. What is a parser and what does it have to do with validity?
XML parsers can be classified as either "validating" or "non-validating" depending on the checks they perform on an XML.  Non-validating parsers simply check the XML syntax to determine whether or not the XML document is well formed.  Validating parsers on the other hand go one step further by checking the validity of the XML document against a schema (such as a DTD).  Validating parsers ensure that the XML document is both well formed and valid.


Q3. You write a .dtd file to accompany a class of XML documents.  You want one of the elements, with the tag <trinity>, to appear exactly three times within the document element of every document in this class.  Is it possible for the .dtd file to specify this?
Unfortunately no.  DTDs can specify whether an element appears
  • exactly once
  • zero or one times
  • zero or many times
  • one or many times
but cannot specify that an element appears exactly n times within the document/element.

Longer Question
This question is a continuation on last post's "long" question number 2, where we are given the contents of chapter 2 of the book "Toba: The worst volcanic eruption ever".  We were required to:
  • Write a suitable prolog for the document (chapter 2);
  • Modify the book's .dtd file to cater for the new tags introduced in this document;
  • Put suitable tags into the document, to identify special pieces of text such as a poem or words that would feature in the book's index. 
  • Add pictures to the document at appropriate places.

A suitable prolog for this document would be the following:
        <?xml version="1.0" encoding="UTF-8"?>
        <!-- Chapter 2: Volcanic Winter-->
        <!DOCTYPE text SYSTEM "Lab07_book.dtd">
    

The DOCTYPE declaration specifies that the .dtd file used to validate this document is "Lab07_book.dtd" (the same used by the entire book), identifying the "text" element as the document (root) element.
The "text" element from last post's .dtd file is modified to include the new elements being introduced this week.  Furthermore, new entities pointing to the images that will be added to the document are also declared.  The resulting dtd - Lab07_book.dtd - is as follows:
        <?xml version="1.0" encoding="UTF-8"?>
        <!NOTATION jpg PUBLIC "image/jpeg">
        <!ENTITY Sumbawa SYSTEM "sumbawa.jpg" NDATA jpg>
        <!ENTITY LakeGeneva SYSTEM "Geneva1816.jpg" NDATA jpg>
        <!ENTITY MaryShelley SYSTEM "MaryShelley.jpg" NDATA jpg>
        <!ENTITY pub "STC Press, Malta">
        <!ENTITY chap1 SYSTEM "chap1.xml">
        <!ENTITY chap2 SYSTEM "chap2.xml">
        <!ENTITY chap3 SYSTEM "chap3.xml">
        <!ELEMENT book (titlePage, titlePageVerso, contents, chapter+)>
        <!ELEMENT titlePage (bookTitle, author+, publisher)>
        <!ELEMENT bookTitle (#PCDATA)>
        <!ELEMENT author (#PCDATA)>
        <!ELEMENT publisher (#PCDATA)>
        <!ELEMENT titlePageVerso (copyright, publishedBy, ISBN)>
        <!ELEMENT copyright (#PCDATA)>
        <!ELEMENT publishedBy (#PCDATA)>
        <!ELEMENT ISBN (#PCDATA)>
        <!ELEMENT contents (chapterName+)>
        <!ELEMENT chapterName (#PCDATA)>
        <!ATTLIST chapterName number CDATA #REQUIRED>
        <!ELEMENT chapter (text)>
        <!ATTLIST chapter number CDATA #REQUIRED name CDATA #REQUIRED>
        <!ELEMENT text (paragraph+)>
        <!ELEMENT paragraph (#PCDATA|image|indexEntry|poem)*>
        <!ELEMENT image EMPTY>
        <!ATTLIST image source ENTITY #REQUIRED caption CDATA #REQUIRED>
        <!ELEMENT indexEntry (#PCDATA)>
        <!ELEMENT poem (verse+)>
        <!ELEMENT verse (#PCDATA)>
    

Line 24 specifies that the "text" element must contain one or more "paragraph" elements.  A "paragraph" element can contain a mix of (0 or many of each) parsed character data, "image", "indexEntry" and "poem" elements.
The "image" element is an empty element having two attributes, one for the name of the entity pointing to the corresponding picture and another for the picture's caption (Lines 26 & 27).

The "poem" element must contain one or more "verse" elements (Line 29) containing parsed character data.  
The resultant XML file for chapter 2 of the book (chap2.xml) is the following:
<?xml version="1.0" encoding="UTF-8"?>
        <!DOCTYPE text SYSTEM "Lab07_book.dtd">
        <text>
           <paragraph>
              A volcanic winter is very bad news.  The worst eruption  in recorded history happened at <indexEntry>Mount Tambora</indexEntry> in 1815. It killed about 71 000 people locally, mainly because the <indexEntry>pyroclastic flows</indexEntry> killed everyone on the island of <indexEntry>Sumbawa</indexEntry> and the tsunamis drowned the neighbouring islands, but also because the ash blanketed many other islands and killed the vegetation.<image source="Sumbawa" caption="Sumbawa, after the volcanic eruption"/> It also put about 160 cubic kilometres of dust and ash, and about 150 million tons of sulphuric acid mist, into the sky, which started a volcanic winter throughout the northern hemisphere. The next year was <indexEntry>the year without a summer</indexEntry> . No spring, no summer – it stayed dark and cold all the year round. This had its upside. In due course, all that ash and mist in the upper atmosphere made for some lovely sunsets, and Turner was inspired to paint this. 
           </paragraph>
           <paragraph>
              <image source="LakeGeneva" caption="Lake Geneva, during the summer of 1816"/>
              <indexEntry>The Lakeland poets took a holiday at Lake Geneva</indexEntry>, and the weather was so horrible that Lord Byron was inspired to write this.
              <poem>
                 <verse>The bright sun was extinguish'd, and the stars</verse>
                 <verse>Did wander darkling in the eternal space,</verse>
                 <verse>Rayless,and pathless, and the icy earth</verse>
                 <verse>Swung blind and blackening in the moonless air; </verse>
                 <verse>Morn came and went – and came, and brought no day.</verse>
              </poem>
           </paragraph>
           <paragraph>
              <image source="MaryShelley" caption="Mary Shelley, author of Frankenstein"/>
              Mary Shelley was inspired to write Frankenstein. The downside was that there were <indexEntry>famines</indexEntry> throughout Europe, India, China and North America, and perhaps 200 000 people died of starvation in Europe alone.
           </paragraph>
        </text>
    

Readmore...
Saturday 26 November 2011

CMT3315 Lab 06 - Character Encoding

0 comments
 

Today's post deals with character encoding and how it can be specified in XML documents.

Character Encoding

Character encoding is the process of converting any character into another form which facilitates its transmission over a telecommunications network or its storage. Early examples of character encoding are Morse code - which converts characters into a series of long and short presses of a telegraph key - and Baudot code a precursor to ASCII. The ASCII - American Standard Code for Information Interchange - character set was developed in the 1960s and uses a series of 8 bits (1 byte) to represent each character. It originally consisted of 128 characters but was later extended with a further 128 characters bringing the total to 256. The ASCII character set includes characters from the English language and many other European languages as well as simple mathematical characters. ASCII remained the most widely used character-encoding method right through the late 2000s and hence, much of the software in use today is designed to process ASCII documents.

Although popular, ASCII has its limitations. For starters it cannot encode documents written in non European alphabets and lacks many other technical, scientific, cultural and artistic symbols. ISO 8859 provides one solution to this problem by providing a number of different character sets allowing software to switch among the sets according to what is needed. An even better solution is to have one, much larger character set that includes as many characters and symbols from as many characters as possible. One such character set is Unicode which currently includes more than 107,000 characters and symbols from over 90 alphabets.

Character Sets in XML

Im XML, you can specify the character encoding by modifying the XML declaration in the document prolog"


<?xml version="1.0" encoding="UTF-8"?>

Here, the XML declaration specifies that the XML document uses Unicode UTF-8 character encoding. In actual fact, UTF-8 is the default character encoding and the parser will assume UTF-8 is being used if no character encoding is specified in the XML declaration.

The following are this weeks questions.

Quick Questions
Q1. What exactly does a DTD do in XML?
A DTD (Document Type Definition) - defines the structure a particular type of XML document should take.  It dictates the elements that should be present in the XML document, their attributes as well as the order in which they should appear in the document.


Q2. You've written an XML document, with the XML declaration '<?xml version="1.0"?>' at the start.  You realise that the text contains some arabic characters.  Which of the following should you do:
  • change the XML declaration to '<?xml version="1.0" encoding="ISO 8859-6"?>'
  • change the XML declaration to '<?xml version="1.0" encoding="UTF-8"?>'
  • do nothing: the declaration is fine as it is.

Although both of the first two options would work, the current declaration is fine as it is.  This is because XML parsers assume that the document encoding is UTF-8 if it is not specified, and UTF-8 contains all the arabic characters.


Q3. Can you use a binary graphics file in an XML document?
Yes.  To do so you must define the image as an external entity, mark it as non-parsable data and define the image format.  You can then assign this entity to an attribute of an empty element.  Here's an example:
<?xml version="1.0" encoding="utf-8"?>
        <!DOCTYPE koala [
        <!ENTITY koalaimage SYSTEM "koala.gif" NDATA gif>
        <!NOTATION gif PUBLIC "image/gif">
        <!ELEMENT koala (image)>
        <!ELEMENT image EMPTY>
        <!ATTLIST image source ENTITY #REQUIRED>
        ]>
        <koala>
        <image source="koalaimage"/>
        </koala>
    

The entity declaration at line 3 defines an entity called "koalaimage" that points to an external file "koala.gif".  The entity is also marked as non-parsable data using the keyword 'NDATA' which is followed by the "gif" format code which is defined at line 4 using the "NOTATION" keyword.  Finally, the "source" attribute of the empty "image"element is set to the name of our image entity, i.e "koalaimage".

Longer Questions
Q1. For this question we were required to produce an XML document and accompanying DTD file for a book entitled "Toba: the worst volcanic eruption of all".  The first three chapters of the book are written as separate XML files, where the text of each is placed between "<text>" and "</text>" tags.  
From the specification given, the structure of this book's XML document can be described with the following diagram:

Given this information a suitable DTD for this specification would be as follows:

<?xml version="1.0" encoding="UTF-8"?>
        <!ENTITY pub "STC Press, Malta">
        <!ENTITY chap1 SYSTEM "chap1.xml">
        <!ENTITY chap2 SYSTEM "chap2.xml">
        <!ENTITY chap3 SYSTEM "chap3.xml">
        <!ELEMENT book (titlePage, titlePageVerso, contents, chapter+)>
        <!ELEMENT titlePage (bookTitle, author+, publisher)>
        <!ELEMENT bookTitle (#PCDATA)>
        <!ELEMENT author (#PCDATA)>
        <!ELEMENT publisher (#PCDATA)>
        <!ELEMENT titlePageVerso (copyright, publishedBy, ISBN)>
        <!ELEMENT copyright (#PCDATA)>
        <!ELEMENT publishedBy (#PCDATA)>
        <!ELEMENT ISBN (#PCDATA)>
        <!ELEMENT contents (chapterName+)>
        <!ELEMENT chapterName (#PCDATA)>
        <!ATTLIST chapterName number CDATA #REQUIRED>
        <!ELEMENT chapter (text)>
        <!ATTLIST chapter number CDATA #REQUIRED name CDATA #REQUIRED>
        <!ELEMENT text (#PCDATA)>
    

From the details given, one could notice that the publisher name will appear more than once in the XML document: in the title page and also in the title page verso.  This makes the publisher name an ideal candidate for an entity, as declared in line 2 of the DTD above.  Furthermore, since the chapters of the book are each stored as a separate XML file, an entity for each chapter was declared (lines 3 to 5) each pointing to the corresponding external file.  The rest of the DTD is quite straight forward, declaring the rest of the elements and attributes.

Here's what the book's XML document looks like:
        <?xml version="1.0" encoding="UTF-8"?>
        <!DOCTYPE book SYSTEM "Lab06_book.dtd">
        <book>
            <titlePage>
                <bookTitle>Toba: the worst volcanic eruption of all</bookTitle>
                <author>John</author>
                <author>Jack</author>
                <author>Jill</author>
                <author>Joe</author>
                <publisher>&pub;</publisher>
            </titlePage>
            <titlePageVerso>
                <copyright>Copyright 2010 STC Press</copyright>
                <publishedBy>&pub;</publishedBy>
                <ISBN>978-0-596-52722-0</ISBN>
            </titlePageVerso>
            <contents>
                <chapterName number="1">The Mystery of Lake Toba's origins</chapterName>
                <chapterName number="2">Volcanic Winter</chapterName>
                <chapterName number="3">What Toba did to the human race</chapterName>
            </contents>
            <chapter number="1" name="The Mystery of Lake Toba's origins">&chap1;</chapter>
            <chapter number="2" name="Volcanic Winter">&chap2;</chapter>
            <chapter number="3" name="What Toba did to the human race">&chap3;</chapter>
        </book>
    

The XML document references the DTD file explained earlier and makes use of the entities declared within that DTD to refer to the publisher name (lines 10 and 14) and the three book chapters (lines 22 to 24).

Readmore...
Saturday 12 November 2011

CMT3315 Lab 05 - XML Well-formedness & DTDs

0 comments
 

The last post covered the basics of XML syntax and document type definitions. This weeks post is a continuation, answering some more questions related to XML well formedness and DTDs. Where possible, lab questions were reproduced before providing the answer.

Quick Questions
Q1. <:-/> This is a smiley.  Is it also a well-formed XML document?  Say why.

From a structural point of view, an XML document must consist of at least one element, known as the root or document element for it to be well formed.  So in this case, the XML document is structurally well formed as the smiley is our root element.  It is also a properly closed empty element denoted by the "/>" at the end and the name of our element is therefore ":-".  According to the W3C Recommendation "Extensible Markup Language (XML) 1.0 (Fifth Edition)", element names can start with a colon ":" and can contain hyphens "-", so technically the element name is also well formed.  However, its is generally considered good practice to avoid using the colon character as it is reserved for use with namespaces.


Q2. What is the difference between well-formed and valid XML?

A well formed XML document is one which is syntactically correct, i.e. it follows proper XML syntax as defined in the XML 1.0 Fifth Edition W3C Recommendation.  On the other hand, a well formed XML document is not necessarily valid.  In addition to being well formed, an XML document must also follow rules set out in a Document Type Definition (DTD) or XML Schema for it to be valid.

Longer Questions
Q1. For this question, we were required to write a Document Type Definition (DTD) for an XML specification to store information about college textbooks.  The given specification can be described with the following diagram:

Additionally, a chapter is identified by a chapter number and a chapter title and a section is identified by a section number and a section title.  Finally the publisher name will always be "Excellent Books Ltd" and their address will always be "21, Cemetery Lane, SE1 1AA, UK".

Given this information, a suitable DTD  for this specification would be as follows:

<?xml version="1.0" encoding="utf-8"?>
      <!ENTITY pubName "Excellent Books Ltd">
      <!ENTITY pubAddress "21, Cemetery Lane, SE1 1AA, UK">
      <!ELEMENT textbook (titlePage, titlePageVerso, chapter+)>
      <!ELEMENT titlePage (title, author, publisher, aphorism?)>
      <!ELEMENT titlePageVerso (publisherAddress, copyrightNotice, ISBN, dedication*)>
      <!ELEMENT chapter (section+)>
      <!ELEMENT section (bodyText+)>
      <!ATTLIST chapter chapterNo CDATA #REQUIRED chapterTitle CDATA #REQUIRED>
      <!ATTLIST section sectionNo CDATA #REQUIRED sectionTitle CDATA #REQUIRED>
      <!ELEMENT title (#PCDATA)>
      <!ELEMENT author (#PCDATA)>
      <!ELEMENT publisher (#PCDATA)>
      <!ELEMENT aphorism (#PCDATA)>
      <!ELEMENT publisherAddress (#PCDATA)>
      <!ELEMENT copyrightNotice (#PCDATA)>
      <!ELEMENT ISBN (#PCDATA)>
      <!ELEMENT dedication (#PCDATA)>
      <!ELEMENT bodyText (#PCDATA)>
   


Q2. Write an XML document that contains the following information: the name of a London tourist attraction. The name of the district it is in. The type of attraction it is (official building, art gallery, park etc). Whether it is in-doors or out-doors. The year it was built or founded [Feel free to make this up if you don’t know]. Choose appropriate tags. Use attributes for the type of attraction and in-doors or out-doors status.


<?xml version="1.0" encoding="utf-8"?>
<attraction type="Park" indoors="N">
  <name>Hyde Park</name>
  <district>West London</district>
  <yearFounded>1600</yearFounded>
</attraction>


Q3.  This multi-part question is based on an XML document which can be described  with the following diagram (click to enlarge):


Here's a snippet taken from this XML document:

<?xml version="1.0" encoding="utf-8"?>
<phraseBook targLang="Russian">
  <section>
    <sectionTitle>Greetings</sectionTitle>
    <phraseGroup>
      <engPhrase>Hi! </engPhrase>
      <translitPhrase>privEt </translitPhrase>
      <targLangPhrase>Привет!</targLangPhrase>
    </phraseGroup>
     <phraseGroup>
       <engPhrase>Good morning!</engPhrase>
       <translitPhrase>dObraye Utra</translitPhrase>
       <targLangPhrase>Доброе утро!</targLangPhrase>
       </phraseGroup>
      <phraseGroup>
...

a) It’s clear that the XML document is concerned with English phrases and their Russian translations. One of the start tags is <targLangPhrase> with </targLangPhrase> as its end tag. Why do you suppose this isn’t <russianPhrase> with </russianPhrase> ?

The structure of the document suggests that it could very well be used for translating English phrases into other languages and not just Russian.  It would not make much sense to name the "<trgLangPhrase>" with "<russianPhrase>" if the document was in fact translating English phrases into, say, Italian.

b) Write a suitable prolog for this document
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE phraseBook SYSTEM "phraseBook.dtd">

c) Write a .dtd file to act as the Document Type Description for this document

A suitable DTD would be as follows:
<?xml version="1.0" encoding="utf-8"?>
<!ELEMENT phraseBook (section+)>
<!ATTLIST phraseBook targLang CDATA #REQUIRED> 
<!ELEMENT section (sectionTitle, phraseGroup+)>
<!ELEMENT sectionTitle (#PCDATA)>
<!ELEMENT phraseGroup (engPhrase, translitPhrase, targLangPhrase)>
<!ELEMENT engPhrase (#PCDATA|gloss)*>
<!ELEMENT translitPhrase (#PCDATA|gloss)*>
<!ELEMENT targLangPhrase (#PCDATA)>
<!ELEMENT gloss (#PCDATA)>

d) The application that is to use this document runs on a Unix system, and was written some years ago.  Is that likely to make any difference to the XML declaration?

Character encoding might be an aspect that would need to be considered.  Setting the encoding property to "UTF-8" will ensure backward compatibility with older systems that might only support ASCII character set.
Readmore...