Character Encoding issues

From DLXS Documentation

(Difference between revisions)
Jump to: navigation, search
Current revision (20:31, 15 August 2007) (edit) (undo)
 
(One intermediate revision not shown.)
Line 1: Line 1:
-
=== The use of IDs as attributes on elements. ===
 
-
I want to make it clear that we are NOT talking about the <span class="unixcommand">eadid</span> here, but refer to IDs used to identify the element so that it can be referred to, or referenced from, somewhere else. You no doubt all know that each ID within a document must be unique (and the DTD enforces this). However, you may not have thought about the consequences of joining all your finding aids into one collection. Your IDs will need to be unique across the entire collection. One way to ensure this is to prefix ID values with the eadid for a given document. At this time, there is no functionality in DLXS that requires you to have IDs on any elements, but you may have used them for your own internal purposes. We have run into this ourselves and I just wanted to give everyone a heads-up, on the theory that our problems are fairly typical.
 
-
The data preparation script preparedocs.pl appends a unique number to id attributes. This may or may not be appropriate for your collection. If you use id attributes in your EADs you should look at the code for preparedocs.pl and determine if you need to make appropriate changes. For more discussion of these issues and of preparedocs.pl see [#DataPrepStep2 concatenating the files into one larger XML file]
+
=== Character Encoding Issues ===
 +
 
 +
One issue that you might run into, especially if you are migrating finding aids from SGML EAD 1.0 to XML EAD 2002, is that of handling special characters. If you are authoring finding aids in multiple languages in XML using some XML authoring tool, this is unlikely to be a problem for you -- you are aware of the issues, UTF-8 is the default encoding for XML, you will have no problems. You'll just want to make sure to index with the UTF-8 enabled version of XPAT, as was discussed earlier. If you have finding aids with multiple languages and/or special characters, you've probably thought this through already. However, if you have the occasionally e acute (é) in your SGML finding aid, you'll need to think about what you want to do with these characters. A straight converstion from SGML to XML will probably convert your character entities (for example, é) in your files to numeric entities (for example, é). While this is valid, it will present a problem with regard to searching. XPAT will treat this as a string of characters, and in order to search for blesséd, you would need to key in blesséd. If all your special characters are ISO Latin 1, you can convert them to their 8-bit equivalents and index as usual. If you have a mixture, UTF-8 is the way to go. Again, this is merely a heads up that will have no bearing on the sample finding aids, which were chosen for their size and linking behaviors, and which are sadly conventional in their use of character entities (ampersand only, in fact).
 +
 
 +
[../conversion/index.html More information on character encoding issues ]

Current revision


[edit] Character Encoding Issues

One issue that you might run into, especially if you are migrating finding aids from SGML EAD 1.0 to XML EAD 2002, is that of handling special characters. If you are authoring finding aids in multiple languages in XML using some XML authoring tool, this is unlikely to be a problem for you -- you are aware of the issues, UTF-8 is the default encoding for XML, you will have no problems. You'll just want to make sure to index with the UTF-8 enabled version of XPAT, as was discussed earlier. If you have finding aids with multiple languages and/or special characters, you've probably thought this through already. However, if you have the occasionally e acute (é) in your SGML finding aid, you'll need to think about what you want to do with these characters. A straight converstion from SGML to XML will probably convert your character entities (for example, é) in your files to numeric entities (for example, é). While this is valid, it will present a problem with regard to searching. XPAT will treat this as a string of characters, and in order to search for blesséd, you would need to key in blesséd. If all your special characters are ISO Latin 1, you can convert them to their 8-bit equivalents and index as usual. If you have a mixture, UTF-8 is the way to go. Again, this is merely a heads up that will have no bearing on the sample finding aids, which were chosen for their size and linking behaviors, and which are sadly conventional in their use of character entities (ampersand only, in fact).

[../conversion/index.html More information on character encoding issues ]

Personal tools