Working with XML & XSLT
From DLXS Documentation
(4 intermediate revisions not shown.) | |||
Line 1: | Line 1: | ||
+ | [[DLXS Wiki|Main Page]] > [[Data Conversion and Preparation]] > Working with XML & XSLT | ||
+ | |||
===General Principles and Rules of Thumb=== | ===General Principles and Rules of Thumb=== | ||
Line 22: | Line 24: | ||
There are rare exceptions to some of these principles. For example, data that comes from XPat is not always well-formed and needs to be "massaged" into well-formed-ness. Here are several instances where this occurs: | There are rare exceptions to some of these principles. For example, data that comes from XPat is not always well-formed and needs to be "massaged" into well-formed-ness. Here are several instances where this occurs: | ||
- | *Query "hits", aka keywords in context (KWICs) are snippets of text that come from the data. In earlier releases of DLXS, KWICs had all tags removed before being displayed. In Release 12a, while we still strip off beginning and ending "half tags" (where an open or close tag is incomplete because of where the snippet starts or ends, we now run the snippet through DlpsUtils::Twigify which will balance all existing tags and therefore result in a well-formed "twig" of XML. This allows for the situation where collection-specific mark up, left in tact as much as possible is handled downstream by a collection-specific XSL stylesheet. | + | *Query "hits", aka keywords in context (KWICs) are snippets of text that come from the data. In earlier releases of DLXS, KWICs had all tags removed before being displayed. In Release 12a, while we still strip off beginning and ending "half tags" (where an open or close tag is incomplete because of where the snippet starts or ends, we now run the snippet through '''DlpsUtils::Twigify''' which will balance all existing tags and therefore result in a well-formed "twig" of XML. This allows for the situation where collection-specific mark up, left in tact as much as possible is handled downstream by a collection-specific XSL stylesheet. |
*In Text Class, when nested DIVs are collected from XPat, divheads are retrieved. In many cases, divheads need to be twigified as well. Moreover, they and their child DIVs' heads and KWICs need to be properly wrapped with closing DIV tags so that it is well-formed XML and so that the XSL templates can match them all properly. The proper nesting of DIVs and divheads and so forth is done in the CGI. Other than nesting, the CGI no longer does any filtering on DIVs or divheads. | *In Text Class, when nested DIVs are collected from XPat, divheads are retrieved. In many cases, divheads need to be twigified as well. Moreover, they and their child DIVs' heads and KWICs need to be properly wrapped with closing DIV tags so that it is well-formed XML and so that the XSL templates can match them all properly. The proper nesting of DIVs and divheads and so forth is done in the CGI. Other than nesting, the CGI no longer does any filtering on DIVs or divheads. | ||
Line 36: | Line 38: | ||
*the dlxs database in MySQL has been changed so that data is stored UTF-8-encoded and retrieved as UTF-8 | *the dlxs database in MySQL has been changed so that data is stored UTF-8-encoded and retrieved as UTF-8 | ||
- | *a file in $DLXSROOT/misc/sgml called entitiesdoctype.chnk is delivered with Release 12a. It is a DOCTYPE declaration that can optionally be "chunked in" via a CHUNK processing instruction, as an internal subset if need be. It can be used if data is not yet converted to XML and still has a number of character entity references. This allows the XML parser to proceed as if the data were XML with all non-XML character entity references converted. | + | *a file in $DLXSROOT/misc/sgml called entitiesdoctype.chnk is delivered with Release 12a. It is a DOCTYPE declaration that can optionally be "chunked in" via a CHUNK processing instruction, [link UI] as an internal subset if need be. It can be used if data is not yet converted to XML and still has a number of character entity references. This allows the XML parser to proceed as if the data were XML with all non-XML character entity references converted. |
*The DlpsUtils::Twigify subroutine was written to convert snippets of, possibly broken, XML into well-formed XML by removing tags that have been truncated (as in "KWICs" returned in XPat queries) and by closing open tags and opening closed ones, in order to arrive at a well-formed XML "twig". | *The DlpsUtils::Twigify subroutine was written to convert snippets of, possibly broken, XML into well-formed XML by removing tags that have been truncated (as in "KWICs" returned in XPat queries) and by closing open tags and opening closed ones, in order to arrive at a well-formed XML "twig". | ||
Line 54: | Line 56: | ||
'''Data conversion''' | '''Data conversion''' | ||
- | See Data Conversion | + | See [[Data Conversion and Preparation#Unicode, XML and Normalization|Data Conversion and Preparation: Unicode, XML and Normalization.]] |
'''Writing your own XSL to customize your collections' look and feel''' | '''Writing your own XSL to customize your collections' look and feel''' | ||
Line 70: | Line 72: | ||
===[[XML/XSLT Frequently Asked Questions]]=== | ===[[XML/XSLT Frequently Asked Questions]]=== | ||
- | + | ||
+ | [[#top|Top]] |
Current revision
Main Page > Data Conversion and Preparation > Working with XML & XSLT
Contents |
[edit] General Principles and Rules of Thumb
In re-architecting the DLXS system to use XML and XSL, UM established certain principles. These guide the writing of code, the creation of XML, and the writing of XSL in most cases. We have only broken the rules when there are good reasons to do so. Here is a list of some of the rules we created and their respective rationales:
Principle | Rationale |
---|---|
PIs are wrapped in the XML files with an XML element | The XSL templates, at least at the highest levels of the XML tree, can expect tags that are explicit in the XML file. |
XSL files to be used are listed in the XML file rather than using the more conventional <?xml-stylesheet ?> PI | The building of a virtual stylesheet from the listed XSL files, allows those XSL files to be arrived at through fallback |
All URLs to be used by the CGI are built by the CGI | The XSL stylesheets should not have to "know" anything about what URL parameters are needed for the CGI to work |
Cookie rather than URL parameter holds session IDs | This allows for dynamic browsing (no longer need .tpl files to be built for browsing); allows one session even if a user switches from one class to another; cookie can be deleted with browser is quit; etc |
Filtering of XML data is done by the XSL stylesheets | Separation of content and display, of perl code and user interface |
Most of the principles are really about division of labor between the different subsystems of DLXS.
There are rare exceptions to some of these principles. For example, data that comes from XPat is not always well-formed and needs to be "massaged" into well-formed-ness. Here are several instances where this occurs:
- Query "hits", aka keywords in context (KWICs) are snippets of text that come from the data. In earlier releases of DLXS, KWICs had all tags removed before being displayed. In Release 12a, while we still strip off beginning and ending "half tags" (where an open or close tag is incomplete because of where the snippet starts or ends, we now run the snippet through DlpsUtils::Twigify which will balance all existing tags and therefore result in a well-formed "twig" of XML. This allows for the situation where collection-specific mark up, left in tact as much as possible is handled downstream by a collection-specific XSL stylesheet.
- In Text Class, when nested DIVs are collected from XPat, divheads are retrieved. In many cases, divheads need to be twigified as well. Moreover, they and their child DIVs' heads and KWICs need to be properly wrapped with closing DIV tags so that it is well-formed XML and so that the XSL templates can match them all properly. The proper nesting of DIVs and divheads and so forth is done in the CGI. Other than nesting, the CGI no longer does any filtering on DIVs or divheads.
- When Text Class data retrieved from XPat contains PB or FIGURE tags, or when in Findaid Class, daos are retrieved, some CGI manipulation is necessary. For example, an href for a Finding Aid's dao may need to be resolved by the CGI. That href then needs to be communicated to the XSL for proper link-building. In Findaid Class, the dao element is modified in its href attribute, so that the XSL can build a proper link.
[edit] Some notes about internal changes
Quite a lot of changes in the middleware and supporting systems were necessary in the move to XML. Some examples:
- the XPat.pm module now does some conversion, if necessary, to results retrieved from XPat-indexed data in case the data stored is SGML and not XML. In order for the rest of the middleware to properly handle the data, it must be XML. So, SGML singleton tags are converted to XML singleton tags (with a trailing /).
- the dlxs database in MySQL has been changed so that data is stored UTF-8-encoded and retrieved as UTF-8
- a file in $DLXSROOT/misc/sgml called entitiesdoctype.chnk is delivered with Release 12a. It is a DOCTYPE declaration that can optionally be "chunked in" via a CHUNK processing instruction, [link UI] as an internal subset if need be. It can be used if data is not yet converted to XML and still has a number of character entity references. This allows the XML parser to proceed as if the data were XML with all non-XML character entity references converted.
- The DlpsUtils::Twigify subroutine was written to convert snippets of, possibly broken, XML into well-formed XML by removing tags that have been truncated (as in "KWICs" returned in XPat queries) and by closing open tags and opening closed ones, in order to arrive at a well-formed XML "twig".
[edit] Working with XML and XSL in DLXS
Working with an XML file that has not had its dynamic content filled in yet or a virtual XSL stylesheet can be difficult. That is where three new debug values come in handy. To see the XML file, with all its dynamic content filled in, that the CGI will send to be processed by the XSLT transformation engine, add debug=xml to the URL. The CGI will fill in all the PIs and then send the untransformed block of XML to the browser. If you use a browser that can display untransformed XML, you will see the content and the form of the XML data. You can also use the browser's view source to get untransformed XML and view it or save it to a file for debugging purposes
Since the virtual stylesheet is created only at run time by importing a number of XSLT files, each of which is likely to have been gotten via fallback, the best way to see the full paths of the files which are being imported and used is to add debug=xslt to the URL. The contents of the virtual stylesheet will be delivered to the browser as XML. If need be, you can copy the source and paste it into an XSLT processor or debugger.
Description of xsltwrite command goes here: add debug=xsltwrite to the URL.
Here at DLPS, we use a variety of tools for editing and debugging XML and XSL, everything from Oxygen to Dreamweaver to xemacs to Saxon to XMLSpy. The debug switches allow us to get at what's inside the CGI code as it runs.
[edit] Data migration issues
Data conversion
See Data Conversion and Preparation: Unicode, XML and Normalization.
Writing your own XSL to customize your collections' look and feel
There is no easy way to convert the HTML templates from previous releases of DLXS into XSL. Your best course of action will likely be to start with the XSL files that are delivered with Release 12a. Then, for any collection that needs a specfic look, behavior, or filtering into HTML, copy, place in the collection subdirectory, and modify only those XSL files or templates that you need to modify.
[edit] collmgr entries
The dlxs database has been changed to store and retrieve UTF-8 encoded strings in its tables. Therefore, when entering data into collmgr fields, be sure to enter proper UTF-8. This can include the actual character, if you have the ability to enter things directly through an alternate keyboard mapping; a hexadecimal entity; or a decimal entity (though hex may be the best choice).
[edit] Implications for users
You may receive a variety of questions from users, some of which may be difficult to track down. We are still learning ourselves about the vagaries of how different platforms, browsers, etc. handle UTF-8. For example, if a user copies Latin 1 data from a web page or application and pastes it into a UTF-8 web form on a DLXS search page, will their applications and operating system properly transcode the pasted data into UTF-8? Will the data that is received by the CGI be UTF-8 or will it be improperly encoded? Will older browsers or operating systems have the fonts available to display the languages you can now index and include in your collections? We've also learned that different browsers handle character interpretation issues differently, with everything from the classic Mac empty box to a triangle with a question mark to offering up what "seems" to be the right choice (inevitably Chinese). The move to cookies for tracking sessions also means that the problems with users who refuse all cookies are now an issue for DLXS, as it has been for other products in the library world for some time.