Working with Unicode

From DLXS Documentation

Revision as of 18:11, 19 July 2007 by Cboulay (Talk | contribs)
Jump to: navigation, search

Contents

Working with Unicode

This document describes in some detail the issues involved in Unicode data preparation and indexing, middleware configuration, template issues and user input. In its data preparation and indexing aspect, it is mainly applicable to TextClass, BibClass and FindaidClass. With respect to the remaining issues, it relates to all the classes.

For non-unicode specific information on data preparation for individual classes, see the following:

  • Preparing Text Class Data for Index Building / Converting Collections to Text Class
  • Image Class—where is this section? / Image Class Data Loading: My SQL
  • Transforming Bibliographic Class Files
  • Mounting a Finding Aids Collection: Preparing Data and Directories

About Unicode

The authoritative source for information about Unicode is the Unicode Consortium. You will find the complete standard and lots of helpful links to other sources of information on Unicode. Even if you are familiar with Unicode, it may help to review some basic definitions:

Character Repertoire: a collection of abstract characters independent of how they look when printed.

Coded Character Set: an assignment of a unique number to each character in a Character Repertoire.

Code Points: The unique number the ISO/IEC 10646 Coded Character Set assigns to virtually every character in in all the world's alphabets.

Character Encoding Scheme: Unicode is a standard built on top of ISO/IEC 10646 that, in addition to specifying the assignment of number to character, deals with things like collation, bi-directionality, normalization and, most importantly, encoding. A Character Encoding Scheme (encoding)[?] specifies how the number that stands for a character is stored in a file or in computer memory.

There are many Character Encoding Schemes defined by the Unicode Standard but the one of interest to us is called UTF-8. The UTF-8 encoding of the Unicode Coded Character Set is the preferred encoding for Unicode on the Web. It is a multi-byte encoding, which means that it may use from 1 to 6 bytes to encode the Unicode Code Point (number) of a given character. UTF-8 and US-ASCII (0-7F hex) are identical. Above 7F, 2 or more bytes are required to encode the number assigned to a Unicode character. With Unicode it is possible for one document to contain characters from many different alphabets and to treat them uniformly for search purposes.

Reasons to Use Unicode

  • Can represent more than one alphabet in a single document or web page
  • Searchable
  • Programming is simpler
  • User can easily enter Latin characters via XPAT mapping functionality
  • Users can enter non-ASCII characters via national keyboards, virtual keyboards, IMEs, or copy/paste
  • Can be collated
  • Fundamental to XML
  • Better font support than for character entity references

DLXS Background

Prior to release 12, DLXS depended on a variety of mechanisms to handle non-ASCII character data. These included:

  • The use of SGML character entity references (CERs) such as  in the data. These were mapped to single character gif images to display certain characters unavailable in typical browser fonts. The problem with this mechanism was that unless the user is knowledgeable enough to type the actual 7 character sequence "Â" instead of A, for example, their search fails.
  • The replacement of CERs with the corresponding ISO-8859-1 encoded character. By mapping this (typically) accented character to its unaccented ASCII equivalent, DLXS could and still can find words that contain either the accented or unaccented form of the character. This works fine but, as noted in the introduction, limits the document to a single encoding such as Latin1. In a single document one can cover German+Polish with Latin2 or German+Turkish with Latin5 but there is no single-byte encoding to properly mix German+Russian, for instance.
  • Making certain uppercase letters in the user's input stand for certain characters like Thorn or Eth and "stealing" unused 8bit values to replace these CERs in the data during conversion. This was a very cumbersome process involving custom programming and involved use of mapping in XPAT indexing and searching.

These mechanisms are not required if the data is in Unicode especially now that Unicode fonts are widely available in the current generation of web browsers.

Platform Requirements

It is necessary use the latest software versions recommended in DLXS System Requirements. There a a few terminal emulators that handle UTF-8 encoded Unicode reasonably well:

  • xterm run as
     xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'
     If running under Windows you need Version 8 of Hummingbird Exceed X Server, at least.
  • Natively, under Windows PuTTY is good. Under PuTTY Preferences->Translation select UTF-8.

Terminal emulators

  • Do you see é instead of Ã? Or: What You See Is Not (Always) What You Have (WYSINAWYH).
  • Linux
    • GNOME terminal
    • xterm -u8 -fn '-misc-fixed-medium-r-semicondensed--13-120-75-75-c-60-iso10646-1'
    • Bitstream Cyberbit and MS Arial Unicode fonts
  • Windows
    • PuTTY with Hummingbird Exceed X Server version 8 or higher on Windows
    • MS Arial Unicode
    • XMLSpy

Tools

The goal is to get your data into UTF-8 encoded XML. You need to know how characters in your data have been encoded in order to transform to another encoding.

  • iconv -c -f ISO-8859-1 -t UTF-8 -o outfile infile
  • DLXSROOT/bin/t/text/ncr2utf8
  • xpatutf8check
  • jhove -c /l/local/jhove/conf/jhove.conf -m utf8-hul
  • DLXSROOT/bin/t/text/utf8chars
  • OpenSP osx
  • XMLSpy

DLPS Unicode Examples

  • OAIster is %100 UTF-8 encoded XML indexed by xpatbldu and multirgn and searched using xpatu.
  • Supports Latin, Greek, Cyrillic, Han, Hiragana, Katakana and Hangul characters.
  • Highlighting based on .dd file character mappings.
  • OAIster data dictionary
  • Workshop example is %100 UTF-8 encoded XML containing English, French and Greek and indexed by xpatbldu and xmlrgn and searched using xpatu. Wordwheel?

Middleware Configuration, Requirements and Behavior for Unicode

XPAT version 5.3.2

   * 5.3 XPAT can read 5.2 indexes, i.e. 5.3 is backward compatible
   * 5.2 XPAT cannot read 5.3 indexes

Perl 5.8.3 or higher is required. 5.8.8 is better. Avoid 5.8.6 (debugger problem).

Configuration and Behavior

To make legacy Latin-1 encoded SGML data work:

  • The collection manager (collmgr) locale field should be set to en_US to use xpat instead of xpatu to read the index.
  • If there are character entity references like "é", declare them in the DLXSROOT/web/(c)/(collection)/entitiesdoctype.chnk file (copied from DLXSROOT/misc/sgml), if not already present in that file.

The basic assumption is that ANY input (user typed or search results form XPAT) is utf-8 encoded XML. Why? How? From what encoding?

  • user input that is not valid UTF-8 will be transcoded into UTF-8 and reserved characters are turned into character entity references like &. Why? What effect on searching for tags? debug=qmap
  • search results from XPAT are are processed through the DlpsUtils::Sgml2XmlFilter to transcode into UTF-8 and to change SGML-style singletons (e.g. <LB>) to XML-style singletons (e.g. <LB/>).

Downside: Searches for accented characters will fail in Latin-1 collections because the user's search term will be converted to UTF-8 but the collection data will be Latin-1. Unaccented searches will still work.

All XML templates have <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> elements to ensure user input is UTF-8 and to tell the browser to use UTF-8 encoding when rendering the page content.


Data Conversion

If your data does not come to you in Unicode UTF-8 encoded XML, conversion is necessary. A typical conversion might be as follows. Note that you may only need to perform just one of (A) or (B) depending on what form your data takes. That is, non-ASCII characters in your data may be represented by entities or encoded directly in, for instance, ISO-8859-1. It is possible that both steps (A) and (B) may be required.

A useful reference to Unicode characters is the file UnicodeData.txt available from the Unicode Consortium and delivered with Perl 5.8 under, for example, PERLROOT/perl/lib/5.8.3/unicore/.

(A) Convert the data to the Unicode UTF-8 encoding Use the iconv program. The following example on Linux assumes your data is initially encoded in ISO-8859-1/Latin1:

   iconv -c -f ISO-8859-1 -t UTF-8 -o outfile infile

Use the Perl Unicode.pm module in a script like the following:

   #!/l/local/bin/perl -w
   use strict;
   use Unicode::MapUTF8 qw(to_utf8);
   while( <> ) {
   print to_utf8({ -string => $_, -charset => 'ISO-8859-1' }); }

Use a program like XMLSpy to read in your file and write it out UTF-8 encoded. (B) Convert numeric character references (NCRs) and SGML character entity references (CERs) to Unicode UTF-8 encoded characters

Since your ultimate goal is to have UTF-8 encoded XML encoded recall that XML has 5 predefined CERs which you do not need to convert and which the utilities described below do not touch. They are &, <, >, &apos; and ".

Programs such as XMLSpy or osx may do the needed conversions for you but vary in their handling of SGML SDATA and NDATA entities. In some cases you may benefit from use of the following two utilities in addition..

For NCRs, i.e. references of the form &#DDDD; where D is a decimal digit or &#xXXXX; where X is a hexadecimal digit, you can use the DLXS utility program DLXSROOT/bin/t/text/ncr2utf8 run as:

   ncr2utf8 inputfile > outputfile

For CERs, e.g. references like Å, you may need to analyze the references present in your data. The program DLXSROOT/bin/t/text/findEntities.pl will generate a list of CERs in your data.

It is likely that most or even all CERs in your data will come from one of the ISO Character Entity Sets: ISOamsa, ISOamsb, ISOamsc, ISOamsn, ISOamso, ISOamsr, ISOcyr1, ISOcyr2, ISOgrk1, ISOgrk2, ISOgrk3, ISOgrk4, ISOlat1, ISOlat2, ISOmfrk, ISOnum, ISOpub, ISOtech, MMLalias or MMLextra. You can use DLXSROOT/bin/t/text/isocer2utf8 run as:

   isocer2utf8 inputfile > outputfile

to translate these CERs directly to UTF-8. Running findEntities.pl after this will identify any CERs outside these ISO sets.

Another option is to use an SGML parser like onsgmls together with Character Entity Declarations that substitute the Unicode NCR for the CER in the parsed output followed by a run of ncr2utf8 to complete the conversion.

Note that If you started with SGML, you may need to touch up the SGML to make it (and its DTD) XML compliant if you rely solely on the small utility programs supplied with the DLXS release. This process is outside the scope of this document (but see DLXSROOT/misc/sgml/textclass.stripped.xml.dtd for an example of the XML version of textclass.dtd). At this point you should have UTF-8 encoded XML data ready to index.

Indexing

Refer to files in DLXSROOT/prep/s/sampletc_utf8 and DLXSROOT/bin/s/sampletc_utf8 for the following discussion.

DLXS delivers a Makefile to take you through the process of building the main XPAT index and the fabricated region indexes. The process is very similar for Latin1 encoded SGML data and UTF-8 encoded XML data. This process is outlined in TextClass Indexing. The main difference between the non-Unicode Makefile and the Unicode Makefile is that xpatbldu, xpatu and xmlrgn are used instead of xpatbld, xpat and sgmlrgn.

Be sure your XML data file begins with the XML declaration:

   <?xml version="1.0" encoding="UTF-8"?>

Without this declaration, xmlrgn will not build correct region indexes.

The most important input to the indexing process is the XPAT Data Dictionary. If your data spans several languages, especially those languages with non-Latin alphabets, you will need to configure a Data Dictionary that takes this into account. The sampletc_utf8.blank.dd can be used as a starting point and with some editing is sufficient for Latin based languages. There are two sections in the Data Dictionary that need attention: the Index Points and the Mappings.

Once these sections in the Data Dictionary have been configured the indexing process can proceed via the Makefile. Note that if you have XML element or attribute names that contain non-ASCII characters in your document you should use multirgn to generate the region indexes due to a limitation in xmlrgn. It is expected that this case is rare.

Index Point specification

This specification tells XPAT what points in the data to index. Typically, XPAT is directed to index and search beginning at an alphabetic character following a blank space, i.e. a word. Here is the Index Point specification section of the sampletc_utf8.blank.dd in prep:

  <IndexPoints>
       <IndexPt> &printable.</IndexPt>
       <IndexPt>&printable.-</IndexPt>
       <IndexPt>-&printable.</IndexPt>
       <IndexPt>&printable.&lt.</IndexPt>
       <IndexPt>&printable.&amp.</IndexPt>
       <IndexPt> &Latin.</IndexPt>
       <IndexPt>&Latin.-</IndexPt>
       <IndexPt>-&Latin.</IndexPt>
       <IndexPt>&Latin.&lt.</IndexPt>
       <IndexPt>&Latin.&amp.</IndexPt>
       <IndexPt> &Greek.</IndexPt>
       <IndexPt>&Greek.-</IndexPt>
       <IndexPt>-&Greek.</IndexPt>
       <IndexPt>&Greek.&lt.</IndexPt>
       <IndexPt>&Greek.&amp.</IndexPt>
     </IndexPoints>

The sampletc_utf8.xml data file contains characters from the Latin and Greek alphabets. Index points are defined for the characters from each of those alphabets using XPAT Unicode metacharacters like "&Latin." and "&Greek.". These metacharacters group Unicode characters into "blocks" which correspond roughly to alphabets. The document The XPAT Data Dictionary has a list of these Unicode metacharacters together with the characters that belong to each block (about midway through the section). If your character data is Latin-based it will probably suffice to simply remove the Greek elements from sampletc_utf8.blank.dd.

It is not advisable to create a Data Dictionary that specifies all the blocks so as to create s "universal" Data Dictionary. This would impose a performance and memory penalty on XPAT at runtime.

Not all languages have a concept of upper and lower case.

Languages such as Chinese do not separate "words" with spaces. This presents a problem for XPAT. A partial solution is to specify every character to be an index point:

<IndexPt>&Hangul.&Hangul.</IndexPt>

This would result in an index 4 times the size of the data and a large runtime memory requirement for the XPAT index point table and as of this writing should be considered experimental. There is a probability of false hits but that should decrease as the length of the query increases.

Mappings specification

Case insensitivity makes it easier for users to enter query terms. This is implemented in the Mappings section by mapping uppercase characters to their lowercase equivalent. Keyboards in the United States usually do not have keys for the accented characters used in European languages. These accented characters are mapped to their unaccented forms in the Mappings section. This allows search and retrieval whether the character appears accented or unaccented in the data. Apropos of Unicode, here is a part of the Mappings section devoted to mapping uppercase Greek to lowercase:

       ...
       <Map><From>U+0391</From><To>U+03B1</To></Map>
       <Map><From>U+0392</From><To>U+03B2</To></Map>
       <Map><From>U+0393</From><To>U+03B3</To></Map>
       <Map><From>U+0394</From><To>U+03B4</To></Map>
       <Map><From>U+0395</From><To>U+03B5</To></Map>
       ...

Note that the Greek characters are specified using the "U+" Unicode notation. The number following the "U+" is the Unicode Code Point for the character expressed in hexadecimal notation. From this one can see that the Data Dictionary can be built entirely form ASCII characters. It is not necessary to have a UTF-8 enabled editor. The XPAT Unicode implementation currently accepts values up to U+FFFF (65535). This covers all the characters defined in Unicode Plane 0 also referred to as the Basic Multilingual Plane.

While there are characters in higher planes they are relatively rare and this XPAT limitation is not expected to present an obstacle to indexing your Unicode-based texts. Should the need arise XPAT can be extended to use a full 32 bit word internally. As there is little need for this currently it is more memory efficient to use a 16 bit word to store characters in memory.

You will need to analyze your texts to decide what sort of mapping may be useful to your target audiences. There are many issues to consider. Input mechanisms dominate these considerations.

  • Do your your users have Western European keyboards? It is not necessary to map accented to unaccented characters, though it is harmless to do so for users that do not have such keyboards. The accented characters are indexed and accepted as input and can be retrieved from the text.
  • Do your target users have Input Method Editors readily available and know how to use them to enter non-Latin characters?
  • Do your users have antiquated browsers with poor font support for Unicode?

DLXS is exploring the addition of a configurable javascript popup virtual keyboard to allow users to enter characters from alphabets for which they lack a physical keyboard.

Collmgr Fields / Configuration

To put your data online you will naturally need to define a collection in the collection database using Collmgr. There are two differences between a non-Unicode collection and a Unicode collection. As of Release 12, the Wordwheel is based solely on Unicode so it is not possible to have a Wordwheel for a non-Unicode collection. Leave the wwappmodule, wwdd fields blank. To configure a Unicode collection set the locale field to a UTF-8 locale value such as en_US.UTF-8. You can get a list of locale values recognized by your Unix system by typing locale -a at the shell prompt. A UTF-8 locale setting affects several areas of functionality in the middleware.

  • The middleware will use xpatu search engine to search the collection data. This implies that the data was indexed by xpatbldu and xmlrgn/multirgn. This does not apply to ImageClass which is migrating to MySQL searching. DLXS release 11 was the first release offering xpatu and xpatbldu.
  • The middleware will expect user input to be UTF-8 encoded. More on this below.
  • The middleware will send to charset=UTF-8 to the browser when outputting processed HTML templates. This will cause the browser to interpret the output from the middleware as UTF-8 and select a Unicode font for display purposes. Browsers lacking a Unicode font will display characters in a garbled manner that includes the hollow rectangular box for some characters.
  • Perl's internal UTF-8 flag is set on string data in the middleware to handle multi-byte characters.

Templates

As of Release 12, with the exception of BibClass, all XML templates templates are transformed into HTML having a <META> tag with charset=utf-8. BibClass continues to use HTML templates. To allow the XML templates to continue to work for data from non-Unicode collections while at the same time supporting Unicode data the non-Unicode data is converted at output-time from iso-8859-1 to UTF-8 encoding. In the case of BibClass, the META tag charset value is processed on output to be set to either UTF-8 or iso-8859-1 depending on the encoding of the collection.

Unicode, User Input and Form Submission

The encoding of user input to HTML forms is a complex area not made any easier by browser bugs and standards that do not address the problem fully. The best discussion of this topic is by A.J.Flavell. Basically the problem is that there is no reliable way for the browser to convey to the middleware what encoding is in effect for the data entered into a form by the user. Quoting Mr. Flavell:

"In practice, browsers normally display the contents of text fields according to the character coding (charset) that applies for the HTML page as a whole; and when it submits the text fields they are effectively in this same coding. Thus if the server sent out the (page containing the) form with a definite charset specification, it could normally assume that the submitted data can be interpreted in accordance with the same charset. There are however anomalies of various kinds, some of which have been seen and understood by the author of this note, some of which have been seen and not understood, and some of which are only anecdotal at the moment.
In addition to these considerations, some users may be typing-in or pasting-in text from an application that uses their local character coding (practical examples being macRoman on a Mac; or MS-DOS CP850 being copied out of a DOS window on an MS Windows PC), into a text field of a document that used the author's - different - character coding (let's say for the simplest example, iso-8859-1): the user might then submit the form, disregarding that what they are seeing in the text area is not what they intended to send. [...]
Given this state of affairs we can see that user data entry is not 100% reliable. Nonetheless, it is reasonable to assume the following in a page send by the middleware with charset=UTF-8:"
  • Users typing at a plain old US keyboard are generating ASCII codes which are by default UTF-8. So If a text contains a mixture of non-Latin or accented Latin characters and character data from the ASCII character set (UTF-8 single-byte-encoded Unicode characters) it has the potential to be searched effectively from an ASCII keyboard.
  • Users copying from DLXS results in their browser window and pasting back into a DLXS search form are generating the UTF-8 encoded data expected by the middleware.
  • Users typing input via an Input Method Editor (IME) will generate UTF-8 data as expected by the middleware.
  • Users entering search strings via a javascript virtual keyboard will generate UTF-8 encoded data.
  • Users typing from national keyboards may enter UTF-8 if their system is properly configured.

Beyond these assertions it is impossible to generalize about how copying and pasting characters from arbitrary sources into an input field might be expected to behave.

Current Limitations in DLXS Middleware

The middleware supports collections with different character encodings in single collection mode and in cross-collection mode. However the encodings are limited to Unicode UTF-8 and ISO8859-1 (Latin1). Any user input that is not valid UTF-8 is assumed to be Latin1 encoded. This input is transcoded to UTF-8 under this assumption. Because ASCII is, by default, UTF-8, input is not changed and XPAT Latin1-based collection queries will proceed successfully if the data dictionary maps accented character to their unaccented base character. A Latin1 XPAT search result is converted to UTF-8 to enable the data to pass through the XML/XSLT parsers on output and to display correctly in the web template which is set to charset=UTF-8. This creates a minor deficiency if the user copies a string of accented characters from the results back into the search form. The characters are now UTF-8 encoded and will not be found in a Latin1 encoded collection.

Personal tools