Pages

Banner 468

Saturday 12 November 2011

CMT3315 Lab 05 - XML Well-formedness & DTDs

0 comments
 

The last post covered the basics of XML syntax and document type definitions. This weeks post is a continuation, answering some more questions related to XML well formedness and DTDs. Where possible, lab questions were reproduced before providing the answer.

Quick Questions
Q1. <:-/> This is a smiley.  Is it also a well-formed XML document?  Say why.

From a structural point of view, an XML document must consist of at least one element, known as the root or document element for it to be well formed.  So in this case, the XML document is structurally well formed as the smiley is our root element.  It is also a properly closed empty element denoted by the "/>" at the end and the name of our element is therefore ":-".  According to the W3C Recommendation "Extensible Markup Language (XML) 1.0 (Fifth Edition)", element names can start with a colon ":" and can contain hyphens "-", so technically the element name is also well formed.  However, its is generally considered good practice to avoid using the colon character as it is reserved for use with namespaces.


Q2. What is the difference between well-formed and valid XML?

A well formed XML document is one which is syntactically correct, i.e. it follows proper XML syntax as defined in the XML 1.0 Fifth Edition W3C Recommendation.  On the other hand, a well formed XML document is not necessarily valid.  In addition to being well formed, an XML document must also follow rules set out in a Document Type Definition (DTD) or XML Schema for it to be valid.

Longer Questions
Q1. For this question, we were required to write a Document Type Definition (DTD) for an XML specification to store information about college textbooks.  The given specification can be described with the following diagram:

Additionally, a chapter is identified by a chapter number and a chapter title and a section is identified by a section number and a section title.  Finally the publisher name will always be "Excellent Books Ltd" and their address will always be "21, Cemetery Lane, SE1 1AA, UK".

Given this information, a suitable DTD  for this specification would be as follows:

<?xml version="1.0" encoding="utf-8"?>
      <!ENTITY pubName "Excellent Books Ltd">
      <!ENTITY pubAddress "21, Cemetery Lane, SE1 1AA, UK">
      <!ELEMENT textbook (titlePage, titlePageVerso, chapter+)>
      <!ELEMENT titlePage (title, author, publisher, aphorism?)>
      <!ELEMENT titlePageVerso (publisherAddress, copyrightNotice, ISBN, dedication*)>
      <!ELEMENT chapter (section+)>
      <!ELEMENT section (bodyText+)>
      <!ATTLIST chapter chapterNo CDATA #REQUIRED chapterTitle CDATA #REQUIRED>
      <!ATTLIST section sectionNo CDATA #REQUIRED sectionTitle CDATA #REQUIRED>
      <!ELEMENT title (#PCDATA)>
      <!ELEMENT author (#PCDATA)>
      <!ELEMENT publisher (#PCDATA)>
      <!ELEMENT aphorism (#PCDATA)>
      <!ELEMENT publisherAddress (#PCDATA)>
      <!ELEMENT copyrightNotice (#PCDATA)>
      <!ELEMENT ISBN (#PCDATA)>
      <!ELEMENT dedication (#PCDATA)>
      <!ELEMENT bodyText (#PCDATA)>
   


Q2. Write an XML document that contains the following information: the name of a London tourist attraction. The name of the district it is in. The type of attraction it is (official building, art gallery, park etc). Whether it is in-doors or out-doors. The year it was built or founded [Feel free to make this up if you don’t know]. Choose appropriate tags. Use attributes for the type of attraction and in-doors or out-doors status.


<?xml version="1.0" encoding="utf-8"?>
<attraction type="Park" indoors="N">
  <name>Hyde Park</name>
  <district>West London</district>
  <yearFounded>1600</yearFounded>
</attraction>


Q3.  This multi-part question is based on an XML document which can be described  with the following diagram (click to enlarge):


Here's a snippet taken from this XML document:

<?xml version="1.0" encoding="utf-8"?>
<phraseBook targLang="Russian">
  <section>
    <sectionTitle>Greetings</sectionTitle>
    <phraseGroup>
      <engPhrase>Hi! </engPhrase>
      <translitPhrase>privEt </translitPhrase>
      <targLangPhrase>Привет!</targLangPhrase>
    </phraseGroup>
     <phraseGroup>
       <engPhrase>Good morning!</engPhrase>
       <translitPhrase>dObraye Utra</translitPhrase>
       <targLangPhrase>Доброе утро!</targLangPhrase>
       </phraseGroup>
      <phraseGroup>
...

a) It’s clear that the XML document is concerned with English phrases and their Russian translations. One of the start tags is <targLangPhrase> with </targLangPhrase> as its end tag. Why do you suppose this isn’t <russianPhrase> with </russianPhrase> ?

The structure of the document suggests that it could very well be used for translating English phrases into other languages and not just Russian.  It would not make much sense to name the "<trgLangPhrase>" with "<russianPhrase>" if the document was in fact translating English phrases into, say, Italian.

b) Write a suitable prolog for this document
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE phraseBook SYSTEM "phraseBook.dtd">

c) Write a .dtd file to act as the Document Type Description for this document

A suitable DTD would be as follows:
<?xml version="1.0" encoding="utf-8"?>
<!ELEMENT phraseBook (section+)>
<!ATTLIST phraseBook targLang CDATA #REQUIRED> 
<!ELEMENT section (sectionTitle, phraseGroup+)>
<!ELEMENT sectionTitle (#PCDATA)>
<!ELEMENT phraseGroup (engPhrase, translitPhrase, targLangPhrase)>
<!ELEMENT engPhrase (#PCDATA|gloss)*>
<!ELEMENT translitPhrase (#PCDATA|gloss)*>
<!ELEMENT targLangPhrase (#PCDATA)>
<!ELEMENT gloss (#PCDATA)>

d) The application that is to use this document runs on a Unix system, and was written some years ago.  Is that likely to make any difference to the XML declaration?

Character encoding might be an aspect that would need to be considered.  Setting the encoding property to "UTF-8" will ensure backward compatibility with older systems that might only support ASCII character set.

Leave a Reply