Character Encoding issues
From DLXS Documentation
(→The use of IDs as attributes on elements.) |
|||
Line 1: | Line 1: | ||
+ | |||
+ | === Character Encoding Issues === | ||
+ | |||
+ | One issue that you might run into, especially if you are migrating finding aids from SGML EAD 1.0 to XML EAD 2002, is that of handling special characters. If you are authoring finding aids in multiple languages in XML using some XML authoring tool, this is unlikely to be a problem for you -- you are aware of the issues, UTF-8 is the default encoding for XML, you will have no problems. You'll just want to make sure to index with the UTF-8 enabled version of XPAT, as was discussed earlier. If you have finding aids with multiple languages and/or special characters, you've probably thought this through already. However, if you have the occasionally e acute (é) in your SGML finding aid, you'll need to think about what you want to do with these characters. A straight converstion from SGML to XML will probably convert your character entities (for example, é) in your files to numeric entities (for example, é). While this is valid, it will present a problem with regard to searching. XPAT will treat this as a string of characters, and in order to search for blesséd, you would need to key in blesséd. If all your special characters are ISO Latin 1, you can convert them to their 8-bit equivalents and index as usual. If you have a mixture, UTF-8 is the way to go. Again, this is merely a heads up that will have no bearing on the sample finding aids, which were chosen for their size and linking behaviors, and which are sadly conventional in their use of character entities (ampersand only, in fact). | ||
+ | |||
+ | [../conversion/index.html More information on character encoding issues ] |
Current revision
[edit] Character Encoding Issues
One issue that you might run into, especially if you are migrating finding aids from SGML EAD 1.0 to XML EAD 2002, is that of handling special characters. If you are authoring finding aids in multiple languages in XML using some XML authoring tool, this is unlikely to be a problem for you -- you are aware of the issues, UTF-8 is the default encoding for XML, you will have no problems. You'll just want to make sure to index with the UTF-8 enabled version of XPAT, as was discussed earlier. If you have finding aids with multiple languages and/or special characters, you've probably thought this through already. However, if you have the occasionally e acute (é) in your SGML finding aid, you'll need to think about what you want to do with these characters. A straight converstion from SGML to XML will probably convert your character entities (for example, é) in your files to numeric entities (for example, é). While this is valid, it will present a problem with regard to searching. XPAT will treat this as a string of characters, and in order to search for blesséd, you would need to key in blesséd. If all your special characters are ISO Latin 1, you can convert them to their 8-bit equivalents and index as usual. If you have a mixture, UTF-8 is the way to go. Again, this is merely a heads up that will have no bearing on the sample finding aids, which were chosen for their size and linking behaviors, and which are sadly conventional in their use of character entities (ampersand only, in fact).
[../conversion/index.html More information on character encoding issues ]