CharacterEncoding

From oaibp

Jump to: navigation, search

Main Page >> Shareable Metadata

Contents

[edit] Character Encoding Issues

[edit] Summary of Best Practices

  • Specify the UTF-8 character encoding within the XML declaration.
  • Ensure all encodings within an OAI record are valid UTF-8.
  • Use hexadecimal or decimal numerical entities instead of named entities (except for & < > " ').
  • Escape required characters within URLs.

Proper character encoding is essential in XML. Because XML is meant as an interchange format, a given XML document (such as an OAI metadata record) may be opened in many different environments, platforms, etc., each of which may have their own distinct preferences for character encoding. If a default character encoding is not specified by an XML instance, the intended encoding may be lost when used in other environments, often rendering the instance impossible to validate. In less extreme situations, diacritics or references to non-Latin characters may be corrupted.

[edit] Common Pitfalls

[edit] XML Declaration

Perhaps the most common pitfall with character encoding within XML is not correctly specifying the encoding with the XML declaration. The XML declaration should be the first line of any XML instance document, and it has the standard form, <? ... ?>, of an XML processing instruction. The XML declaration offers an encoding attribute that allows specification of many different Unicode, standard and proprietary encoding or codepage values. Examples include:

  • <?xml version="1.0" encoding="UTF-8"?> (for a variable-length 8-bit Unicode encoding)
  • <?xml version="1.0" encoding="ISO-8859-1"?> (for a Latin-1 single 8-bit encoding)
  • <?xml version="1.0" encoding="windows-1252"?> (for a MS Windows-specific encoding)

The OAI protocol requires the use of UTF-8 encoding for all instances and so all OAI responses must start with the UTF-8 declaration above. UTF-8 is the default encoding for XML and most modern XML authoring tools will automatically encode new XML instances in UTF-8. However, this is not always the case, and it is not uncommon for an encoder to erroneously use a different encoding scheme when entering diacritics, symbols, and the like because the software is utilizing a different encoding. So ALWAYS be sure that the XML encoding is set to UTF-8.

There are good links to further documentation for UTF-8 and the Unicode standard at http://www.utf-8.com/ .

[edit] Use of character entities

Characters (or code points in Unicode terminology) outside the simple ASCII range 32-127 (&#x20; to &#x7F;) must either be encoded as multi-byte UTF-8 sequences or using numerical entities. In environments that do not natively support UTF-8 it is often easier to use numerical entities. This choice should make no difference to harvesters and service providers because XML parsers accept the two forms interchangably.

Another common pitfall is for encoders to insert characters such as diacritics and symbols into XML instances using named entities. For example, to use an a with an acute accent (รก) in an instance, XML allows the use of:

  • hexadecimal numerical entities (&#xE1; for a-acute)
  • decimal numerical entities (&#225; for a-acute)
  • named entities (&aacute; for a-acute) -- not recommended

The use of named entities will not work for XML Schema validated instances used within OAI. This is because XML Schema do not allow for DTD-style entity specifications internal or external to a XML instance. The exceptions are the five named entities that are by XML authoring tools and parsers are:

  • &amp; = & (ampersand)
  • &lt; = < (left angle bracket, less-than sign)
  • &gt; = > (right angle bracket, greater-than sign)
  • &quot; = " (quotation mark)
  • &apos; = ' (apostrophe)

There are some subtleties as to when the characters > " and &apos; need to be encoded as entities. However, in situations other than in XML markup it is always safe to encode all five characters using the named entities and this practice is recommended. For gruesome details see: http://www.w3.org/TR/2004/REC-xml-20040204/#syntax .

It is only advisable to use hexadecimal numerical character entities when the necessary characters cannot be graphically represented within UTF-8 encoding by authoring software. Most operating systems and file systems will support UTF-8 encoding, and most characters and common scientific symbols from any language or discipline can be supported in modern systems and software. Therefore, in reference to the above example using the a-acute, it is recommended to simply enter a "What You See Is What You Get" UTF-8 encoded a-acute directly into the instance using an editor that is UTF-8 aware. However, beware of problems transfering data across operating systems and programs which may corrupt encoding. We recommend checking the actual XML responses served by your data provider for proper encoding of non-ASCII characters.

[edit] Encoding URLs inside XML instances

HTTP (Hypertext Transfer Protocol) URLs make heavy use of characters that often prove problematic for inclusion in XML without escaping. Characters such as ampersands, which HTTP GET URLs use to separate parameter values sent to a web server, should be escaped in URLs referenced by XML metadata instances. For instance, consider the following URL used inside a Dublin Core identifier element:

<dc:identifier> http://ideals.uiuc.edu/dspace-oai/request?verb=ListRecords&metadataPrefix=oai_dc </dc:identifier>

Running this fragment through a XML parser would result in an error, because the parser would see:

&metadataPrefix=oai_dc

as a malformed entity reference (notice the beginning ampersand and the lack of a trailing semicolon). Remember, XML entity references have the syntax &{variousCharacters}; . The above fragment begins with an ampersand, thus the parser expects an entity. Because the parser thinks the entity reference is lacking the closing semicolon, it returns an error. To get around this, the ampersand must be represented as the entity &amp; :

<dc:identifier> http://ideals.uiuc.edu/dspace-oai/request=ListRecords&amp;metadataPrefix=oai_dc </dc:identifier>

When the XML parser receives the &amp; entity it translates it back into an ampersand (&) in the text of the DC identifier, which will successfully parse and can be used to issue the HTTP request. Note that is is not correct to URL-encode the ampersand as %26 as although this will pass through XML parsers correctly, it will invalidate the HTTP GET request (the verb parameter in the example above would be interpreted to have the value ListRecords&metadataPrefix=oai_dc instead of there being two parameters, verb and metadataPrefix).

Escaping the characters & < > " and &apos; must be performed in addition to any URL-escaping necessary to build a valid URL (e.g. replacing spaces with %20, or ampersands that do not separate parameters with %26). This is sometimes referred to as "double escaping" but, in fact, all URL-escape characters (which have the form %XX, where X is a hexadecimal digit) will be unaffected by subsequent XML escaping.

[edit] Avoiding Common Pitfalls

Based on the discussion above, there are several things that data providers can do to avoid character encoding problems:

  • Use an editor that supports UTF-8 encoding.
  • Make sure you know that it is using UTF-8.
  • Use the UTF-8 encoding attribute in your XML documents.
  • Use UTF-8 character representations versus character entities when possible.
  • Escape problematic characters in complex URLs.

[edit] Additional Sources of Information

Personal tools