Today's post deals with character encoding and how it can be specified in XML documents.
Character Encoding
Character encoding is the process of converting any character into another form which facilitates its transmission over a telecommunications network or its storage. Early examples of character encoding are Morse code - which converts characters into a series of long and short presses of a telegraph key - and Baudot code a precursor to ASCII. The ASCII - American Standard Code for Information Interchange - character set was developed in the 1960s and uses a series of 8 bits (1 byte) to represent each character. It originally consisted of 128 characters but was later extended with a further 128 characters bringing the total to 256. The ASCII character set includes characters from the English language and many other European languages as well as simple mathematical characters. ASCII remained the most widely used character-encoding method right through the late 2000s and hence, much of the software in use today is designed to process ASCII documents.
Although popular, ASCII has its limitations. For starters it cannot encode documents written in non European alphabets and lacks many other technical, scientific, cultural and artistic symbols. ISO 8859 provides one solution to this problem by providing a number of different character sets allowing software to switch among the sets according to what is needed. An even better solution is to have one, much larger character set that includes as many characters and symbols from as many characters as possible. One such character set is Unicode which currently includes more than 107,000 characters and symbols from over 90 alphabets.
Character Sets in XML
Im XML, you can specify the character encoding by modifying the XML declaration in the document prolog"
<?xml version="1.0" encoding="UTF-8"?>
Here, the XML declaration specifies that the XML document uses Unicode UTF-8 character encoding. In actual fact, UTF-8 is the default character encoding and the parser will assume UTF-8 is being used if no character encoding is specified in the XML declaration.
The following are this weeks questions.
Quick Questions
Q1. What exactly does a DTD do in XML?- change the XML declaration to '<?xml version="1.0" encoding="ISO 8859-6"?>'
- change the XML declaration to '<?xml version="1.0" encoding="UTF-8"?>'
- do nothing: the declaration is fine as it is.
Although both of the first two options would work, the current declaration is fine as it is. This is because XML parsers assume that the document encoding is UTF-8 if it is not specified, and UTF-8 contains all the arabic characters.
Yes. To do so you must define the image as an external entity, mark it as non-parsable data and define the image format. You can then assign this entity to an attribute of an empty element. Here's an example:
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE koala [ <!ENTITY koalaimage SYSTEM "koala.gif" NDATA gif> <!NOTATION gif PUBLIC "image/gif"> <!ELEMENT koala (image)> <!ELEMENT image EMPTY> <!ATTLIST image source ENTITY #REQUIRED> ]> <koala> <image source="koalaimage"/> </koala>
The entity declaration at line 3 defines an entity called "koalaimage" that points to an external file "koala.gif". The entity is also marked as non-parsable data using the keyword 'NDATA' which is followed by the "gif" format code which is defined at line 4 using the "NOTATION" keyword. Finally, the "source" attribute of the empty "image"element is set to the name of our image entity, i.e "koalaimage".
Longer Questions
From the specification given, the structure of this book's XML document can be described with the following diagram:
Given this information a suitable DTD for this specification would be as follows:
<?xml version="1.0" encoding="UTF-8"?> <!ENTITY pub "STC Press, Malta"> <!ENTITY chap1 SYSTEM "chap1.xml"> <!ENTITY chap2 SYSTEM "chap2.xml"> <!ENTITY chap3 SYSTEM "chap3.xml"> <!ELEMENT book (titlePage, titlePageVerso, contents, chapter+)> <!ELEMENT titlePage (bookTitle, author+, publisher)> <!ELEMENT bookTitle (#PCDATA)> <!ELEMENT author (#PCDATA)> <!ELEMENT publisher (#PCDATA)> <!ELEMENT titlePageVerso (copyright, publishedBy, ISBN)> <!ELEMENT copyright (#PCDATA)> <!ELEMENT publishedBy (#PCDATA)> <!ELEMENT ISBN (#PCDATA)> <!ELEMENT contents (chapterName+)> <!ELEMENT chapterName (#PCDATA)> <!ATTLIST chapterName number CDATA #REQUIRED> <!ELEMENT chapter (text)> <!ATTLIST chapter number CDATA #REQUIRED name CDATA #REQUIRED> <!ELEMENT text (#PCDATA)>
From the details given, one could notice that the publisher name will appear more than once in the XML document: in the title page and also in the title page verso. This makes the publisher name an ideal candidate for an entity, as declared in line 2 of the DTD above. Furthermore, since the chapters of the book are each stored as a separate XML file, an entity for each chapter was declared (lines 3 to 5) each pointing to the corresponding external file. The rest of the DTD is quite straight forward, declaring the rest of the elements and attributes.
Here's what the book's XML document looks like:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE book SYSTEM "Lab06_book.dtd"> <book> <titlePage> <bookTitle>Toba: the worst volcanic eruption of all</bookTitle> <author>John</author> <author>Jack</author> <author>Jill</author> <author>Joe</author> <publisher>&pub;</publisher> </titlePage> <titlePageVerso> <copyright>Copyright 2010 STC Press</copyright> <publishedBy>&pub;</publishedBy> <ISBN>978-0-596-52722-0</ISBN> </titlePageVerso> <contents> <chapterName number="1">The Mystery of Lake Toba's origins</chapterName> <chapterName number="2">Volcanic Winter</chapterName> <chapterName number="3">What Toba did to the human race</chapterName> </contents> <chapter number="1" name="The Mystery of Lake Toba's origins">&chap1;</chapter> <chapter number="2" name="Volcanic Winter">&chap2;</chapter> <chapter number="3" name="What Toba did to the human race">&chap3;</chapter> </book>
The XML document references the DTD file explained earlier and makes use of the entities declared within that DTD to refer to the publisher name (lines 10 and 14) and the three book chapters (lines 22 to 24).