Building the Text Class Index
From DLXS Documentation
Line 2: | Line 2: | ||
<hr> | <hr> | ||
- | After you have followed all the steps to set up your directories and prepare your files, as found in the [[Text Class preparation documentation]], indexing the collection is fairly straightforward. To create an index for use with the Text Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" structures based on a combination of elements (for example, defining what the "main entry" is, without adding a <MAINENTRY> tag around the appropriate <AUTHOR> or <TITLE> element). The following commands can be used to make the index, alone or in combination. | + | After you have followed all the steps to set up your directories and prepare your files, as found in the [[Data Preparation and Conversion: Unicode, XML, and Normalization|Text Class preparation documentation]], indexing the collection is fairly straightforward. To create an index for use with the Text Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" structures based on a combination of elements (for example, defining what the "main entry" is, without adding a <MAINENTRY> tag around the appropriate <AUTHOR> or <TITLE> element). The following commands can be used to make the index, alone or in combination. |
# Ensure that your collection XML is valid by using the <tt>make validate</tt> command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile | # Ensure that your collection XML is valid by using the <tt>make validate</tt> command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile | ||
# '''make singledd''' indexes words for texts that have been concatenated into one large file for a collection. This is the recommended process, as a data dictionary built from a single concatenated file is faster for searching and more reliable than one built using multi-file system indexing. Use the <tt>make singledd</tt> command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile . | # '''make singledd''' indexes words for texts that have been concatenated into one large file for a collection. This is the recommended process, as a data dictionary built from a single concatenated file is faster for searching and more reliable than one built using multi-file system indexing. Use the <tt>make singledd</tt> command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile . | ||
- | # '''make xml''' indexes the XML structure by reading the DTD, and validates as it indexes. It is slower than multiregion indexing (see [[XPAT documentation]] for more information) for this reason. However, it is necessary for collections that have nested elements of the same name (even when separated by an intervening element, such as a <P> within <NOTE1> that is itself within a <P>). Use the <tt>make xml</tt>command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile | + | # '''make xml''' indexes the XML structure by reading the DTD, and validates as it indexes. It is slower than multiregion indexing (see [[Working with XPAT|XPAT documentation]] for more information) for this reason. However, it is necessary for collections that have nested elements of the same name (even when separated by an intervening element, such as a <P> within <NOTE1> that is itself within a <P>). Use the <tt>make xml</tt>command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile |
# '''make post''' builds and indexes fabricated regions based on the XPAT queries stored in the $DLXSROOT/prep/c/collid/collid.extra.srch file. Because every collection is different, this file will need to be adapted after you have determined what you want to use as a "poem" for text (e.g., perhaps every DIV1 TYPE="sonnet" and DIV2 TYPE="poem" in the collection) and how many levels of division heads you have in your collection (e.g., at least one text is nested to DIV4, so you'll need to fabricate up to div4head). If the extra.srch file references elements not used in your text collection, you will see errors like <tt>Error found: <Error>syntax error before: ")</Error> </tt> when you use the <tt>make post</tt> command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile. Remove unnecessary lines. | # '''make post''' builds and indexes fabricated regions based on the XPAT queries stored in the $DLXSROOT/prep/c/collid/collid.extra.srch file. Because every collection is different, this file will need to be adapted after you have determined what you want to use as a "poem" for text (e.g., perhaps every DIV1 TYPE="sonnet" and DIV2 TYPE="poem" in the collection) and how many levels of division heads you have in your collection (e.g., at least one text is nested to DIV4, so you'll need to fabricate up to div4head). If the extra.srch file references elements not used in your text collection, you will see errors like <tt>Error found: <Error>syntax error before: ")</Error> </tt> when you use the <tt>make post</tt> command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile. Remove unnecessary lines. | ||
Current revision
Main Page > Mounting Collections: Class-specific Steps > Mounting a Text Class Collection > Building the Text Class Index
After you have followed all the steps to set up your directories and prepare your files, as found in the Text Class preparation documentation, indexing the collection is fairly straightforward. To create an index for use with the Text Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" structures based on a combination of elements (for example, defining what the "main entry" is, without adding a <MAINENTRY> tag around the appropriate <AUTHOR> or <TITLE> element). The following commands can be used to make the index, alone or in combination.
- Ensure that your collection XML is valid by using the make validate command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile
- make singledd indexes words for texts that have been concatenated into one large file for a collection. This is the recommended process, as a data dictionary built from a single concatenated file is faster for searching and more reliable than one built using multi-file system indexing. Use the make singledd command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile .
- make xml indexes the XML structure by reading the DTD, and validates as it indexes. It is slower than multiregion indexing (see XPAT documentation for more information) for this reason. However, it is necessary for collections that have nested elements of the same name (even when separated by an intervening element, such as a <P> within <NOTE1> that is itself within a <P>). Use the make xmlcommand in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile
- make post builds and indexes fabricated regions based on the XPAT queries stored in the $DLXSROOT/prep/c/collid/collid.extra.srch file. Because every collection is different, this file will need to be adapted after you have determined what you want to use as a "poem" for text (e.g., perhaps every DIV1 TYPE="sonnet" and DIV2 TYPE="poem" in the collection) and how many levels of division heads you have in your collection (e.g., at least one text is nested to DIV4, so you'll need to fabricate up to div4head). If the extra.srch file references elements not used in your text collection, you will see errors like Error found: <Error>syntax error before: ")</Error> when you use the make post command in the Makefile stored at $DLXSROOT/bin/c/collid/Makefile. Remove unnecessary lines.
You have now built indexes and region files for your collection. You can test that things are properly indexed by issuing the commandxpatu $DLXSROOT/idx/c/collid/collid.dd and doing searches, such as for a common word like the or an element that should appear such as region "main" or region "HEADER". Strategically, it is good to test this from a directory other than the one you indexed in, to ensure that relative or absolute paths are resolving appropriately.
Everything is now set up to build the XPATu index. The Makefile in the bin directory contains the commands necessary to build the index, and can be executed easily.
make dd
cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.blank.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l/local/bin/xpatbldu -m 256m -D /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd cp /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.presgml.dd
make xml
cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.presgml.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l/local/bin/xmlrgn -D /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/misc/sgml/xml.dcl /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.inp /l1/workshop/sooty/dlxs/obj/w/workshoptc/workshoptc.xml /l/local/bin/xmlrgn:/l1/workshop/sooty/dlxs/misc/sgml/xml.dcl:1:W: SGML declaration was not implied cp /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.prepost.dd
make post
cp /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.prepost.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd touch /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.init /l/local/bin/xpatu -q /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd < /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.srch | /l1/workshop/sooty/dlxs/bin/t/text/output.dd.frag.pl /l1/workshop/sooty/dlxs/idx/w/workshoptc/ > /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.dd /l1/workshop/sooty/dlxs/bin/t/text/inc.extra.dd.pl /l1/workshop/sooty/dlxs/prep/w/workshoptc/workshoptc.extra.dd /l1/workshop/sooty/dlxs/idx/w/workshoptc/workshoptc.dd
Sorting and browse building require that you have only one maintitle, mainauthor and maindate per text, so that you have one value on which to sort. Your extra.srch files may need to be changed in order to be more specific. If you do not, some sort operations will give you a sortkey assertion failure.
Some examples of more specific searches in your extra.srch are provided below. The first relies on identifying metadata that has been specified through the use of attributes; the second merely chooses the first occurrence as an indication that it is the "main" value.
(((region TITLE incl "type=main") within region TITLESTMT) within region SOURCEDESC); {exportfile "/l1/idx/e/example/maintitle.rgn"}; export; ~sync "maintitle";
(((region AUTHOR within (region "<TITLESTMT".."</AUTHOR>")) within (region TITLESTMT within region FILEDESC)) not within (region SOURCEDESC)); {exportfile "/l1/idx/e/example/mainauthor.rgn"}; export; ~sync "mainauthor";