Pages

Banner 468

Saturday, 26 November 2011

CMT3315 Lab 06 - Character Encoding

0 comments
 

Today's post deals with character encoding and how it can be specified in XML documents.

Character Encoding

Character encoding is the process of converting any character into another form which facilitates its transmission over a telecommunications network or its storage. Early examples of character encoding are Morse code - which converts characters into a series of long and short presses of a telegraph key - and Baudot code a precursor to ASCII. The ASCII - American Standard Code for Information Interchange - character set was developed in the 1960s and uses a series of 8 bits (1 byte) to represent each character. It originally consisted of 128 characters but was later extended with a further 128 characters bringing the total to 256. The ASCII character set includes characters from the English language and many other European languages as well as simple mathematical characters. ASCII remained the most widely used character-encoding method right through the late 2000s and hence, much of the software in use today is designed to process ASCII documents.

Although popular, ASCII has its limitations. For starters it cannot encode documents written in non European alphabets and lacks many other technical, scientific, cultural and artistic symbols. ISO 8859 provides one solution to this problem by providing a number of different character sets allowing software to switch among the sets according to what is needed. An even better solution is to have one, much larger character set that includes as many characters and symbols from as many characters as possible. One such character set is Unicode which currently includes more than 107,000 characters and symbols from over 90 alphabets.

Character Sets in XML

Im XML, you can specify the character encoding by modifying the XML declaration in the document prolog"


<?xml version="1.0" encoding="UTF-8"?>

Here, the XML declaration specifies that the XML document uses Unicode UTF-8 character encoding. In actual fact, UTF-8 is the default character encoding and the parser will assume UTF-8 is being used if no character encoding is specified in the XML declaration.

The following are this weeks questions.

Quick Questions
Q1. What exactly does a DTD do in XML?
A DTD (Document Type Definition) - defines the structure a particular type of XML document should take.  It dictates the elements that should be present in the XML document, their attributes as well as the order in which they should appear in the document.


Q2. You've written an XML document, with the XML declaration '<?xml version="1.0"?>' at the start.  You realise that the text contains some arabic characters.  Which of the following should you do:
  • change the XML declaration to '<?xml version="1.0" encoding="ISO 8859-6"?>'
  • change the XML declaration to '<?xml version="1.0" encoding="UTF-8"?>'
  • do nothing: the declaration is fine as it is.

Although both of the first two options would work, the current declaration is fine as it is.  This is because XML parsers assume that the document encoding is UTF-8 if it is not specified, and UTF-8 contains all the arabic characters.


Q3. Can you use a binary graphics file in an XML document?
Yes.  To do so you must define the image as an external entity, mark it as non-parsable data and define the image format.  You can then assign this entity to an attribute of an empty element.  Here's an example:
<?xml version="1.0" encoding="utf-8"?>
        <!DOCTYPE koala [
        <!ENTITY koalaimage SYSTEM "koala.gif" NDATA gif>
        <!NOTATION gif PUBLIC "image/gif">
        <!ELEMENT koala (image)>
        <!ELEMENT image EMPTY>
        <!ATTLIST image source ENTITY #REQUIRED>
        ]>
        <koala>
        <image source="koalaimage"/>
        </koala>
    

The entity declaration at line 3 defines an entity called "koalaimage" that points to an external file "koala.gif".  The entity is also marked as non-parsable data using the keyword 'NDATA' which is followed by the "gif" format code which is defined at line 4 using the "NOTATION" keyword.  Finally, the "source" attribute of the empty "image"element is set to the name of our image entity, i.e "koalaimage".

Longer Questions
Q1. For this question we were required to produce an XML document and accompanying DTD file for a book entitled "Toba: the worst volcanic eruption of all".  The first three chapters of the book are written as separate XML files, where the text of each is placed between "<text>" and "</text>" tags.  
From the specification given, the structure of this book's XML document can be described with the following diagram:

Given this information a suitable DTD for this specification would be as follows:

<?xml version="1.0" encoding="UTF-8"?>
        <!ENTITY pub "STC Press, Malta">
        <!ENTITY chap1 SYSTEM "chap1.xml">
        <!ENTITY chap2 SYSTEM "chap2.xml">
        <!ENTITY chap3 SYSTEM "chap3.xml">
        <!ELEMENT book (titlePage, titlePageVerso, contents, chapter+)>
        <!ELEMENT titlePage (bookTitle, author+, publisher)>
        <!ELEMENT bookTitle (#PCDATA)>
        <!ELEMENT author (#PCDATA)>
        <!ELEMENT publisher (#PCDATA)>
        <!ELEMENT titlePageVerso (copyright, publishedBy, ISBN)>
        <!ELEMENT copyright (#PCDATA)>
        <!ELEMENT publishedBy (#PCDATA)>
        <!ELEMENT ISBN (#PCDATA)>
        <!ELEMENT contents (chapterName+)>
        <!ELEMENT chapterName (#PCDATA)>
        <!ATTLIST chapterName number CDATA #REQUIRED>
        <!ELEMENT chapter (text)>
        <!ATTLIST chapter number CDATA #REQUIRED name CDATA #REQUIRED>
        <!ELEMENT text (#PCDATA)>
    

From the details given, one could notice that the publisher name will appear more than once in the XML document: in the title page and also in the title page verso.  This makes the publisher name an ideal candidate for an entity, as declared in line 2 of the DTD above.  Furthermore, since the chapters of the book are each stored as a separate XML file, an entity for each chapter was declared (lines 3 to 5) each pointing to the corresponding external file.  The rest of the DTD is quite straight forward, declaring the rest of the elements and attributes.

Here's what the book's XML document looks like:
        <?xml version="1.0" encoding="UTF-8"?>
        <!DOCTYPE book SYSTEM "Lab06_book.dtd">
        <book>
            <titlePage>
                <bookTitle>Toba: the worst volcanic eruption of all</bookTitle>
                <author>John</author>
                <author>Jack</author>
                <author>Jill</author>
                <author>Joe</author>
                <publisher>&pub;</publisher>
            </titlePage>
            <titlePageVerso>
                <copyright>Copyright 2010 STC Press</copyright>
                <publishedBy>&pub;</publishedBy>
                <ISBN>978-0-596-52722-0</ISBN>
            </titlePageVerso>
            <contents>
                <chapterName number="1">The Mystery of Lake Toba's origins</chapterName>
                <chapterName number="2">Volcanic Winter</chapterName>
                <chapterName number="3">What Toba did to the human race</chapterName>
            </contents>
            <chapter number="1" name="The Mystery of Lake Toba's origins">&chap1;</chapter>
            <chapter number="2" name="Volcanic Winter">&chap2;</chapter>
            <chapter number="3" name="What Toba did to the human race">&chap3;</chapter>
        </book>
    

The XML document references the DTD file explained earlier and makes use of the entities declared within that DTD to refer to the publisher name (lines 10 and 14) and the three book chapters (lines 22 to 24).

Readmore...
Saturday, 12 November 2011

CMT3315 Lab 05 - XML Well-formedness & DTDs

0 comments
 

The last post covered the basics of XML syntax and document type definitions. This weeks post is a continuation, answering some more questions related to XML well formedness and DTDs. Where possible, lab questions were reproduced before providing the answer.

Quick Questions
Q1. <:-/> This is a smiley.  Is it also a well-formed XML document?  Say why.

From a structural point of view, an XML document must consist of at least one element, known as the root or document element for it to be well formed.  So in this case, the XML document is structurally well formed as the smiley is our root element.  It is also a properly closed empty element denoted by the "/>" at the end and the name of our element is therefore ":-".  According to the W3C Recommendation "Extensible Markup Language (XML) 1.0 (Fifth Edition)", element names can start with a colon ":" and can contain hyphens "-", so technically the element name is also well formed.  However, its is generally considered good practice to avoid using the colon character as it is reserved for use with namespaces.


Q2. What is the difference between well-formed and valid XML?

A well formed XML document is one which is syntactically correct, i.e. it follows proper XML syntax as defined in the XML 1.0 Fifth Edition W3C Recommendation.  On the other hand, a well formed XML document is not necessarily valid.  In addition to being well formed, an XML document must also follow rules set out in a Document Type Definition (DTD) or XML Schema for it to be valid.

Longer Questions
Q1. For this question, we were required to write a Document Type Definition (DTD) for an XML specification to store information about college textbooks.  The given specification can be described with the following diagram:

Additionally, a chapter is identified by a chapter number and a chapter title and a section is identified by a section number and a section title.  Finally the publisher name will always be "Excellent Books Ltd" and their address will always be "21, Cemetery Lane, SE1 1AA, UK".

Given this information, a suitable DTD  for this specification would be as follows:

<?xml version="1.0" encoding="utf-8"?>
      <!ENTITY pubName "Excellent Books Ltd">
      <!ENTITY pubAddress "21, Cemetery Lane, SE1 1AA, UK">
      <!ELEMENT textbook (titlePage, titlePageVerso, chapter+)>
      <!ELEMENT titlePage (title, author, publisher, aphorism?)>
      <!ELEMENT titlePageVerso (publisherAddress, copyrightNotice, ISBN, dedication*)>
      <!ELEMENT chapter (section+)>
      <!ELEMENT section (bodyText+)>
      <!ATTLIST chapter chapterNo CDATA #REQUIRED chapterTitle CDATA #REQUIRED>
      <!ATTLIST section sectionNo CDATA #REQUIRED sectionTitle CDATA #REQUIRED>
      <!ELEMENT title (#PCDATA)>
      <!ELEMENT author (#PCDATA)>
      <!ELEMENT publisher (#PCDATA)>
      <!ELEMENT aphorism (#PCDATA)>
      <!ELEMENT publisherAddress (#PCDATA)>
      <!ELEMENT copyrightNotice (#PCDATA)>
      <!ELEMENT ISBN (#PCDATA)>
      <!ELEMENT dedication (#PCDATA)>
      <!ELEMENT bodyText (#PCDATA)>
   


Q2. Write an XML document that contains the following information: the name of a London tourist attraction. The name of the district it is in. The type of attraction it is (official building, art gallery, park etc). Whether it is in-doors or out-doors. The year it was built or founded [Feel free to make this up if you don’t know]. Choose appropriate tags. Use attributes for the type of attraction and in-doors or out-doors status.


<?xml version="1.0" encoding="utf-8"?>
<attraction type="Park" indoors="N">
  <name>Hyde Park</name>
  <district>West London</district>
  <yearFounded>1600</yearFounded>
</attraction>


Q3.  This multi-part question is based on an XML document which can be described  with the following diagram (click to enlarge):


Here's a snippet taken from this XML document:

<?xml version="1.0" encoding="utf-8"?>
<phraseBook targLang="Russian">
  <section>
    <sectionTitle>Greetings</sectionTitle>
    <phraseGroup>
      <engPhrase>Hi! </engPhrase>
      <translitPhrase>privEt </translitPhrase>
      <targLangPhrase>Привет!</targLangPhrase>
    </phraseGroup>
     <phraseGroup>
       <engPhrase>Good morning!</engPhrase>
       <translitPhrase>dObraye Utra</translitPhrase>
       <targLangPhrase>Доброе утро!</targLangPhrase>
       </phraseGroup>
      <phraseGroup>
...

a) It’s clear that the XML document is concerned with English phrases and their Russian translations. One of the start tags is <targLangPhrase> with </targLangPhrase> as its end tag. Why do you suppose this isn’t <russianPhrase> with </russianPhrase> ?

The structure of the document suggests that it could very well be used for translating English phrases into other languages and not just Russian.  It would not make much sense to name the "<trgLangPhrase>" with "<russianPhrase>" if the document was in fact translating English phrases into, say, Italian.

b) Write a suitable prolog for this document
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE phraseBook SYSTEM "phraseBook.dtd">

c) Write a .dtd file to act as the Document Type Description for this document

A suitable DTD would be as follows:
<?xml version="1.0" encoding="utf-8"?>
<!ELEMENT phraseBook (section+)>
<!ATTLIST phraseBook targLang CDATA #REQUIRED> 
<!ELEMENT section (sectionTitle, phraseGroup+)>
<!ELEMENT sectionTitle (#PCDATA)>
<!ELEMENT phraseGroup (engPhrase, translitPhrase, targLangPhrase)>
<!ELEMENT engPhrase (#PCDATA|gloss)*>
<!ELEMENT translitPhrase (#PCDATA|gloss)*>
<!ELEMENT targLangPhrase (#PCDATA)>
<!ELEMENT gloss (#PCDATA)>

d) The application that is to use this document runs on a Unix system, and was written some years ago.  Is that likely to make any difference to the XML declaration?

Character encoding might be an aspect that would need to be considered.  Setting the encoding property to "UTF-8" will ensure backward compatibility with older systems that might only support ASCII character set.
Readmore...
Tuesday, 8 November 2011

CMT3315 LAB04 - XML Syntax 2

0 comments
 

Today's post introduces Document Type Definition (DTD) syntax and answers a number of questions relating to XML syntax and DTDs.

Document Type Definition

In last week's post we created the following XML document containing information about a small music collection:

   <?xml version="1.0" encoding="UTF-8"?>
   <DOCTYPE musicCollection>
   <!--Prolog ends here. -->
   <musicCollection>
      <cd index="1">
         <title>Innuendo</title>
         <artist>Queen</artist>
         <tracks>
            <track id="1">Innuendo</track>
            <track id="2">I'm Going Slightly Mad</track>
            <track id="3">Headlong</track>
            <track id="4">I Can't Live With You</track>
            <track id="5">Don't Try So Hard</track>
            <track id="6">Ride The Wild Wind</track>
            <track id="7">All God's People</track>
            <track id="8">These Are The Days Of Our Lives</track>
            <track id="9">Delilah</track>
            <track id="10">The Hitman</track>
            <track id="11">Bijou</track>
            <track id="12">The Show Must Go On</track>
          </tracks>
      </cd>
      <cd index="2">
         <title>(What's The Story) Morning Glory?</title>
         <artist>Oasis</artist>
         <tracks>
            <track id="1">Hello</track>
            <track id="2">Roll With It</track>
            <track id="3">Wonderwall</track>
            <track id="4">Don't Look Back In Anger</track>
            <track id="5">Hey Now!</track>
            <track id="6">Untitled</track>
            <track id="7">Some Might Say</track>
            <track id="8">Cast No Shadow</track>
            <track id="9">She's Electric</track>
            <track id="10">Morning Glory</track>
            <track id="11">Untitled</track>
            <track id="12">Champagne Supernova</track>
          </tracks>
      </cd>
   </musicCollection>

XML uses documents known as schemas to define the structure a particular class (type) of XML document should follow. In our example, Line 2 specifies that this document is of type musicCollection. We can use a Document Type Definition (DTD) to define the legal structure of the musicCollection document type. Looking at our example we can see that every "cd" element is expected to have a "title", an "artist" and a "tracks" element. In turn, the "tracks" element further contains multiple "track" elements. The number of occurences of an element is known as it's Cardinality. In a DTD, we can define what which elements are expected as well as their attributes, sequence and cardinality.

Using this information we can write our DTD, starting from the root element:

<!--musicCollection.dtd-->
<!ELEMENT musicCollection (cd+)>

This is the first line in our DTD document. It specifies that the "musicCollection" document element is made up of one or more "cd" elements. The "+" sign after "cd" specifies the "one or more" cardinality of the "cd" element. Other Cardinality specifiers include:

  • ? - means 0 or 1
  • * - means 0 or 1 or more
  • no specifier means "exactly 1"

Let's define the "cd" element next:

<!--musicCollection.dtd-->
<!ELEMENT musicCollection (cd+)>
<!ELEMENT cd (title, artist, tracks)>
<!ATTLIST cd index CDATA #REQUIRED>

The "cd" element is made up of the "title", "artist" and "tracks" elements. Given that each of these must appear only once in a "cd" element, no cardinality specifier was necessary. Additionally, the "cd" element has an "index" attribute which is defined at line 4. We also specify that the "index" attribute is made up of character data (CDATA) and is required (#REQUIRED). CDATA is not treated as markup by the XML parser and will not be parsed. Let's add the rest of our elements:

<!--musicCollection.dtd-->
<!ELEMENT musicCollection (cd+)>
<!ELEMENT cd (title, artist, tracks)>
<!ATTLIST cd index CDATA #REQUIRED>
<!ELEMENT title (#PCDATA)>
<!ELEMENT artist (#PCDATA)>
<!ELEMENT tracks (track+)>
<!ELEMENT track (#PCDATA)>
<!ATTLIST track id CDATA #REQUIRED>

Our DTD is now complete. Each of the "title", "artist" and "track" elements are defined as containing parsed character data (#PCDATA). In contrast to CDATA, PCDATA is text that will be parsed by the parser. We also specify that the "tracks" element should contain multiple "track" elements and that every "track" element must have an "id" attribute. All we need to do now is reference our DTD from the XML document. Assuming that our DTD file is called "musicCollection.dtd" and resides in the same directory as the XML file, the following modification to the XML's document type declaration will reference the DTD:

<DOCTYPE musicCollection SYSTEM "musicCollection.dtd">
Summary

DTDs can be declared inside of the XML document itself or as external files and are a simple yet effective way of defining and validating the structure of XML documents.

The following are the questions for this week's lab session

Quick Questions
Q1. What does XML stand for?  And CSS?

XML stands for Extensible Markup Language and CSS stands for Cascading Style Sheets.

Q2. Is this XML line well-formed? Say Why.

<b><i>This text is bold and italic </i></b>

Yes this line is well formed.  Both start tags match their end tags and are properly nested.

Q3. Is this XML document well-formed? Say why.

<?xml version= "1.0" ?>
<greeting>
Hello, world!
</greeting>
<greeting>
Hello Mars too!
</greeting>
No. This XML document is not well formed as it does not have a root element.

Longer Questions
Q1. Write an XML document that contains the following information:
  • The name of this course;
  • The name of this building;
  • The name of this room;
  • The start and end times of this session.
Choose appropriate tags.  Use attributes for the start and end times.
<?xml version= "1.0" ?>
<course>
 <name>CMT 3315 Advanced Web Technologies</name>
 <buildingName>STC Training</buildingName>
 <roomName>Room 5</roomName>
 <session startTime="18:00" endTime="21:00"/>
</course>



Q2. Identify all the syntax errors in the following XML document:

<?xml version= "1.0" ?>
<!DOCTYPE bookStock SYSTEM "bookstock.dtd">
<bookstore>
  <book category="Cooking">
    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <1stEdition>2005</1stEdition >
    <2ndEdition>2007</2ndEdition >
    <price>19.99</price currency="pounds sterling">
  </book>
  <book category="Children’>
    <title lang="en">Harry Potter and the enormous pile of money</title>
  <!—best selling children’s book of the year --2009 -->
    <author>J K. Rowling</author>
   <1stEdition>2005</1stEdition>
    <price>29.99</Price>
  </book>
  <book category="Web">
    <title lang="en">Learning XML</title>
    <author>Erik T. Ray</author>
   <1stEdition>2003</1stEdition>
   <2ndEdition >2008</2ndEdition >
    <price>29.95</discount>
    <discount>15%</price>
  </book>
  <book category="Computing">
    <title lang=en>Insanely great – the life and times of Macintosh, the computer that changed everything </title>
    <author <!—other authors not listed -->>Steven Levy</author>
   <1stEdition>1994</1stEdition>
    <price>9.95</discount>
    <discount>15%</price>
  </book>

The XML document contains various syntax errors including:
  • The root node should be called "bookStock" as specified in the DOCTYPE declaration. There is also no matching closing tag;
  • The "1stEdition" and "2ndEdition" element names are invalid as names cannot start with a number (lines 7, 8, 15, 21, 22, 29);
  • Attribute placed in the end tag at line 9. This should be placed in the start tag;
  • Mismatching quote at line 11;
  • Incorrect comment start tag and extra "--" in comment on line 13;
  • Mismatching start and end tags at lines 16, 23, 24, 30 and 31;
  • Missing quotes for attribute value at line 27;
  • Comment incorrectly placed within the start tag at line 28.  Comment start tag is also incorrect;


Q3. You are asked to produce a Document Type Declaration for a class of XML documents called “memo”. You come up with this .dtd file:

<!DOCTYPE memo
[
<!ELEMENT memo (to,from,heading,body)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
]>


Your client says “That’s all very well, but every memo has to have a date. And some of them have to have a security classification too (you might want to write “Secret” at the top). And a memo has a serial number – I think that’s what you’d call an attribute, isn’t it?” How would you amend this .dtd file so that it did what the client wanted?

A suitable DTD to fulfill these requirements would be as follows:

<!DOCTYPE memo
[
<!ELEMENT memo (date,to,from,heading,body,classification?)>
<!ELEMENT to (#PCDATA)>
<!ELEMENT from (#PCDATA)>
<!ELEMENT heading (#PCDATA)>
<!ELEMENT body (#PCDATA)>
<!ELEMENT date (#PCDATA)>
<!ELEMENT classification(#PCDATA)>
<!ATTLIST memo serialNo ID #REQUIRED>
]>

Given that not all memos will have a security classification, the DTD uses the "?" cardinality specifier to indicate that the "memo" element can contain zero or one "classification" element.

Other specifiers include:

  • "*" - means zero or one or more of the element is allowed
  • "+" - means that one or more of the element is allowed
Finally, attaching none of the cardinality specifiers to an element name means that the element must appear exactly once.
Readmore...
Thursday, 3 November 2011

CMT3315 Lab 03 - XML Syntax 1

0 comments
 

Today's post introduces the basics of XML syntax and answers a number of questions related to this topic.

XML document structure & syntax

The basic idea behind XML is to produce documents whose structure can be understood by software applications. In an XML document, pieces of text that have special meaning are marked up using tags. A tag is simply a word between angle brackets such as "<name>". An XML document is made up of 3 parts:

  • A prolog (optional);
  • The document or root element;
  • Other miscellaneous content following the root element end tag.(optional)

The prolog is an optional component of an XML document which, if included, must appear before the root element. The prolog consists of an XML declaration, which defines the XML version and character encoding being used, and a Document Type Declaration (DTD). The prolog may also contain comments. Here's an example of an XML prolog:

   <?xml version="1.0" encoding="UTF-8"?>
   <!--This is a comment-->
   <DOCTYPE musicCollection"<

Lines 1 and 2 are the XML declaration and a comment respectively. Line 3 is the Document Type Declaration which provides the name of the document element "musicCollection". The document or root element follows the prolog. An XML document must have only one root element which in many cases contains a heirarchy of other elements. For instance, let's assume that our music collection is made up of a number of CDs where each CD has a title, an artist and a number of tracks. A suitable XML document to store this information would be:

   <?xml version="1.0" encoding="UTF-8"?>
   <DOCTYPE musicCollection>
   <!--Prolog ends here. -->
   <musicCollection>
      <cd index="1">
         <title>Innuendo</title>
         <artist>Queen</artist>
         <tracks>
            <track id="1">Innuendo</track>
            <track id="2">I'm Going Slightly Mad</track>
            <track id="3">Headlong</track>
            <track id="4">I Can't Live With You</track>
            <track id="5">Don't Try So Hard</track>
            <track id="6">Ride The Wild Wind</track>
            <track id="7">All God's People</track>
            <track id="8">These Are The Days Of Our Lives</track>
            <track id="9">Delilah</track>
            <track id="10">The Hitman</track>
            <track id="11">Bijou</track>
            <track id="12">The Show Must Go On</track>
          </tracks>
      </cd>
      <cd index="2">
         <title>(What's The Story) Morning Glory?</title>
         <artist>Oasis</artist>
         <tracks>
            <track id="1">Hello</track>
            <track id="2">Roll With It</track>
            <track id="3">Wonderwall</track>
            <track id="4">Don't Look Back In Anger</track>
            <track id="5">Hey Now!</track>
            <track id="6">Untitled</track>
            <track id="7">Some Might Say</track>
            <track id="8">Cast No Shadow</track>
            <track id="9">She's Electric</track>
            <track id="10">Morning Glory</track>
            <track id="11">Untitled</track>
            <track id="12">Champagne Supernova</track>
          </tracks>
      </cd>
   </musicCollection>

Above we can see that our "musicCollection" root element contains two "cd" elements which in turn contain further elements describing the CD. We can also see that every start tag has a matching end tag (e.g. "<cd>" and "</cd>"). XML elements can have zero, one or more child elements and all (except for the root element) must have a parent element. Furthermore, XML elements must be correctly nested for the document to be valid. Elements in XML can also contain attributes and when present, these should be placed withn the element's start tag. In our example, every "cd" element has an attribute called "index" and every "track" element has an attribute called "id". XML attributes can be thought of data describing data, generally (but not necessarily) used for storing ID's. An attribute is made up of a name, followed by an "=" sign and a value within quotes, which may be single or double as long as they match. We can use this example as a basis for discussing the next section:

Well Formedness

A well formed XML document is one that follows proper XML syntax. Unlike most HTML parsers, XML parsers expect the document to be well formed and will stop processing the document if any syntax errors are found. An XML document must be structured as discussed in the previous section and all element and attribute names must be valid. The first character in a name must be either a letter ([A-Z][a-z]) a colon ":" or an underscore "_" and the name cannot start with the letters "xml". The rest of the characters can also include numbers ([0-9]), dashes "-" and fullstops ".".

XML is case-sensitive so using our example above, the element "<track>" is not the same as the element <Track>. Furthermore, end tags must match their start tags. In some cases, XML elements will not contain any information. In such cases the end tag may be replaced by a "/" at the end of a start tag, for example:

 <emptyElement> index="0" />
 <!-- is equivalent to -->
 <emptyElement> index="0"><emptyElement>

These are the basic rules to follow to create a well formed XML document.

Lab Questions
Q1. Write an XML document that contains the following information:
  • Your name;
  • Your email address;
  • Your student number;
  • Your home town;
  • Your date of birth.
Choose appropriate tags. Use attributes for the date of birth.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE student>
<student dateOfBirth="01/01/1990">
 <name>Wayne Zammit</name>
 <email>wayne@somedomain.com</email>
 <studentNo>NN123</studentNo>
 <homeTown>St. Julians</homeTown>
</student>


Q2. Identify all the syntax errors in the XML document below:

<?xml version= "1.0" ?>
<!DOCTYPE countryCollection SYSTEM "countryList.dtd">
<CountryList>
<Nations TotalNations ="3"/>
<!--Data from CIA --Year Book -->
 <Country CountryCode="1"> 
  <OfficialName>United States of America</officialName>
  <Label>Common Names:</label>  
  <CommonName>United States</commonName>
  <CommonName>U.S.</commonName>
  <Label>Capital:</capital>  
  <Capital cityNum="1">Washington, D.C. </label>
  <2ndCity cityNum="2">New York </2ndCity> 
  <Label>Major Cities:</label> 
  <MajorCity cityNum="3">Los Angeles </majorCity>
  <MajorCity cityNum="4">Chicago </majorCity>  
  <MajorCity cityNum="5’>Dallas </majorCity>  
  <Label>Bordering Bodies of Water:</label>    
  <BorderingBodyOfWater> Atlantic Ocean </borderingBodyOfWater>
  <BorderingBodyOfWater> Pacific Ocean </borderingBodyOfWater>  
  <BorderingBodyOfWater> Gulf of Mexico </borderingBodyOfWater> 
  <Label>Bordering Countries:</label>   
  <BorderingCountry CountryCode="1"> Canada </borderingCountry>    
  <BorderingCountry CountryCode ="52"> Mexico </borderingCountry>
</country>
 <Country CountryCode="81">
  <OfficialName> Japan </officialName>
  <Label>Common Names:</label>    
  <CommonName> Japan </commonName>
  <Label>Capital:</label>  
  <Capital>Tokyo</capital cityNum="1">
  <2ndCity cityNum="2">Osaka </2ndCity>
  <Label>Major Cities:</label>    
  <MajorCity cityNum="3">Nagoya </majorCity>
  <MajorCity cityNum="4">Osaka </majorCity>  
  <MajorCity cityNum="5’>Kobe </majorCity>  
  <Label>Bordering Bodies of Water:</label>
  <BorderingBodyOfWater>Sea of Japan </borderingBodyOfWater>
  <BorderingBodyOfWater>Pacific Ocean </borderingBodyOfWater>  
 </country>
 <Country CountryCode="254">
  <OfficialName> Republic of Kenya </officialName>
  <Label>Common Names:</label>    
  <CommonName> Kenya </commonName>
  <Label>Capital:</label>  
  <Capital cityNum=’1’>Nairobi </capital>
  <2ndCity cityNum=’2’>Mombasa</2ndCity>
  <Label>Major Cities:</label>    
  <MajorCity cityNum=’3’>Mombasa </majorCity>
  <MajorCity cityNum=’4’>Lamu </majorCity>
  <MajorCity cityNum=’5’>Malindi </majorCity>  
  <MajorCity cityNum=’6’ cityNum=’7’>Kisumu-Kericho </majorCity> 
  <Label>Bordering Bodies of Water:</label>
  <BorderingBodyOfWater <!--Also Lake Victoria --> > Indian Ocean </borderingBodyOfWater>
 </country> 
The XML document contains various syntax errors including:

  • The root node should be called "<countryCollection>" as specified in the DOCTYPE declaration.  There is also no matching closing tag;
  • The comment at line 5 contains an extra "--" between its start and end tags.  This is not allowed;
  • Start tags do not match end tags (different letter-casing) and in some instances (eg lines 11 and 12) end tags have been swapped;
  • The "2ndCity" element name is invalid as names cannot start with a number (lines 13, 32, 47);
  • Mismatching quote at lines 17 and 36;
  • Attribute placed in the end tag at line 31.  This should be placed in the start tag;
  • Single quotes used to enclose attribute values at lines 46 to 52;
  • Duplicate attribute "cityNum" at line 52;
  • Comment placed within the start tag at line 54;


Readmore...

CMT3315 LAB 02 - XML vs HTML

0 comments
 

In this post we will be having a look at the similarities and differences between XML and HTML. As discussed in my previous post XML is a subset of SGML designed to make the knowledge structure of a document known to a software package. In essence, it enables a software package to "understand" the structure of a document. HTML was also derived from SGML but appeared before XML.

HTML

HTML is made up of a standard set of tags specifically designed to create web pages meant to be understood and displayed by web browsers. Soon after its original release, HTML became very popular very quickly and web pages were being used to accomplish things that they weren't designed to do. Initally, the way an HTML document was interpreted and displayed was left entirely up to the browser. This started creating problems to web designers who wanted their pages to render the same across browsers, so HTML started being extended with other tags that defined presentation such as "<font>". These type of tags go directly against the original concept of SGML which was to separate content from layout. The browser wars did little to help the situation, and in fact made it even worse. Fierce competition gave rise to proprietary tags, and browsers became tolerant to badly written HTML documents which in the end badly hinders programmatic interpretation of web pages.

XML

A major advantage of HTML is its simplicity but it is also one of its biggest weaknesses. XML was born out of the need for having something that was simpler than SGML but alsostricter than HTML. XML tags are user-defined, which means that when creating an XML document you are defining your own standard for structuring that particular type of document. That is why XML is extensible. Every time you define a new set of tags you are effectively defining a new markup language! In fact, just like SGML, XML is itself a framework for defining markup languages. Its main objective is providing the ability to structure information in such a way as to enable any system on any platform to process that information.

XML gave rise to XHTML, a stricter version of HTML based on XML rules. Differences between HTML and XHTML include:

  • unlike HTML, XHTML documents must be well-formed XML documents;
  • XHTML is case-sensitive for element and attribute names, HTML is not;
  • Attribute minimisation (omitting the "=" sign and value) in XHTML is not allowed.

There are many more differences, most of which would require a separate blog post to explain, but the main idea is that XHTML addresses the limits and weaknesses in the original HTML specification because it is based on XML.

Summary

HTML (1990) was derived from SGML (1986) as a simpler markup language for creating hyperlinked web documents. XML (1998) is a specification for defining markup languages simpler than SGML from which it is itself derived. Its main objective is to facilitate portability of data across multi-platform systems. In 2000 XML gave rise to XHTML, a reformulation of HTML aimed at adressing limits in the latter's original specification.

Readmore...

CMT3315 Intro - XML, What it is and where it came from

0 comments
 
Given the summer recess, it has been a while since my last post but it's time to pull the proverbial socks up and get back to work.  During this semester, the blog will be focusing mostly on the eXtensible Markup Language - XML for short - and there's a lot to cover, so let's get right to it:

First Things First
So what is XML? Before answering that question one must understand where XML comes from and more importantly its purpose.  To do that we need to take a look at the early days of computing, back in the 1960's, where the idea of storing documents in computers in such a way as to make them "understandible" to software, was still fresh in the minds of researchers.

GML
Charles Goldfarb, Edward Mosher and Raymond Lorie, three researchers working for IBM came up with GML, a set of macros that implemented mark-up tags to describe the logical structure of a document.  Interestingly, GML is known today as "Generalized Markup Language" but originally the acronym stood for the researcher's surnames! 

SGML
GML was eventually extended by Goldfarb in the mid 80s into the Standard Generalized Markup Language (SGML).  SGML was designed to make it possible for large entities (such as government) to share machine-readable electronic documents.  The main idea was to embed a set of tags within a document which a software package could use to derive information about that document.  For this to work, the set of tags had to be standardised, made publicly available and most importantly be platform independent.  SGML is a very large and complex markup language, suitable for storing equally large and complex documents but rather cumbersome for use with smaller, simpler documents.  The beauty of SGML however is that it can be used to derive smaller, simpler mark-up languages better suited for smaller documents.  Such languages include HTML (Hyper-Text Markup Language) and of course XML.

XML
Going back to the original question: "What is XML?", XML is an extensible markup language used to encode documents in such a way as to make the knowledge structure of a document known to a software package.  It separates the actual content of a document from it's structure and presentation.  Being a subset of SGML it is much simpler to use and is ideally suited to store, and more importantly transport, electronic documents.  It's most common use today is to transmit data between applications, irrespective of the platform.  The beauty of XML is that you define your own tags which means that any document structure can be described in XML.

XML became a W3C Recommendation on February 10, 1998 (w3schools.com) and the first web browser to support XML was Microsoft's Internet Explorer 4.0.

Next time, I will be comparing XML to HTML.  In the mean time, have a look at this post from my blog for more on XML.

Readmore...