Building the Index
From DLXS Documentation
Line 1: | Line 1: | ||
[[DLXS Wiki|Main Page]] > [[Mounting Collections: Class-specific Steps]] > [[Mounting a Finding Aids Collection]] > Building the Index | [[DLXS Wiki|Main Page]] > [[Mounting Collections: Class-specific Steps]] > [[Mounting a Finding Aids Collection]] > Building the Index | ||
+ | |||
+ | ==Indexing Overview== | ||
After you have followed all the steps to set up your directories and prepare your files, as described in [[#Validating_and_Normalizing_Your_Data|Validating and Normalizing Your Data]], indexing the collection is fairly straightforward. To create an index for use with the Findaid Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" regions based on a combination of elements (for example, defining what the "main entry" is, without adding a <MAINENTRY> tag around the appropriate <AUTHOR> or <TITLE> element). | After you have followed all the steps to set up your directories and prepare your files, as described in [[#Validating_and_Normalizing_Your_Data|Validating and Normalizing Your Data]], indexing the collection is fairly straightforward. To create an index for use with the Findaid Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" regions based on a combination of elements (for example, defining what the "main entry" is, without adding a <MAINENTRY> tag around the appropriate <AUTHOR> or <TITLE> element). | ||
Line 27: | Line 29: | ||
when you use the make post command | when you use the make post command | ||
- | + | ==Step by Step Instructions for Indexing== | |
- | + | ===<span id="indexing_step1">'''Step 1: Indexing the text'''</span>=== | |
<pre> | <pre> | ||
cd $DLXSROOT/bin/w/workshopfa | cd $DLXSROOT/bin/w/workshopfa | ||
Line 45: | Line 47: | ||
</pre> | </pre> | ||
- | + | ===<span id="indexing_step2">'''Step 2: Indexing the the XML'''</span>=== | |
<pre> | <pre> | ||
Line 67: | Line 69: | ||
- | + | ===<span id="indexing_step3">'''Step 3: Configuring fabricated regions'''</span>=== | |
Fabricated regions are set up in the $DLXSROOT/prep/c/collection/collection.extra.srch file. The sample file $DLXSROOT/prep/s/samplefa/samplefa.extra.srch was designed for use with the Bentley's encoding practices. If your encoding practices differ from the Bentley's, or if your collection does not have all the elements that the samplefa.extra.srch xpat queries expect, you will need to edit your *.extra.srch file. | Fabricated regions are set up in the $DLXSROOT/prep/c/collection/collection.extra.srch file. The sample file $DLXSROOT/prep/s/samplefa/samplefa.extra.srch was designed for use with the Bentley's encoding practices. If your encoding practices differ from the Bentley's, or if your collection does not have all the elements that the samplefa.extra.srch xpat queries expect, you will need to edit your *.extra.srch file. | ||
Line 77: | Line 79: | ||
# Up front analysis | # Up front analysis | ||
- | + | ====<span id="fabregions_post">Run the "make post" and iterate until there are no errors reported.</span>==== | |
Run the "make post" step and look at the errors reported. Then modify *.extra.srch and rerun "make post". Repeat this until "make post" does not report any errors. | Run the "make post" step and look at the errors reported. Then modify *.extra.srch and rerun "make post". Repeat this until "make post" does not report any errors. | ||
Line 100: | Line 102: | ||
If you don't expect to ever use an element, then you can eliminate it from the fabricated region definitions. An alternative that is useful if you have only a small sample of the EADs you will be mounting and you expect that some of the EADs you will be getting later might have the element that is currently missing from your collection, is to add a "dummy" EAD to your collection. The "dummy" ead should contains all the elements you will ever expect to use (or that are required by the *.extra.srch file). The "dummy" EAD should have all elements except the <eadid> empty. | If you don't expect to ever use an element, then you can eliminate it from the fabricated region definitions. An alternative that is useful if you have only a small sample of the EADs you will be mounting and you expect that some of the EADs you will be getting later might have the element that is currently missing from your collection, is to add a "dummy" EAD to your collection. The "dummy" ead should contains all the elements you will ever expect to use (or that are required by the *.extra.srch file). The "dummy" EAD should have all elements except the <eadid> empty. | ||
- | + | ====<span id="fabregions_ui">Exercise the web user interface</span>==== | |
Once make post does not report errors, you can follow the rest of the steps to put your collection on the web. Then carefully exercise the web user interface looking for the following symptoms: | Once make post does not report errors, you can follow the rest of the steps to put your collection on the web. Then carefully exercise the web user interface looking for the following symptoms: | ||
* Searches that don't work properly because they depend on fabricated regions that don't match your encoding practices. | * Searches that don't work properly because they depend on fabricated regions that don't match your encoding practices. | ||
Line 111: | Line 113: | ||
*[[ #Fabricated_regions_commonly_found_in_Findaid_Class |Fabricated regions commonly found in Findaid Class]] | *[[ #Fabricated_regions_commonly_found_in_Findaid_Class |Fabricated regions commonly found in Findaid Class]] | ||
- | + | ====<span id="fabregions_analysis">Analysis of your collection</span>==== | |
You may be able to analyze your collection prior to running make post and determine what changes you want to make in the fabricated regions. If your analysis misses any changes, you can find this out by using the two previous techniques. | You may be able to analyze your collection prior to running make post and determine what changes you want to make in the fabricated regions. If your analysis misses any changes, you can find this out by using the two previous techniques. | ||
Line 133: | Line 135: | ||
Then you use the "regions.out" file you just created to sort and examine the list of fabricated regions which occur in your finding aids and compare them to the fabricated region queries in your copy of samplefa.extra.srch. | Then you use the "regions.out" file you just created to sort and examine the list of fabricated regions which occur in your finding aids and compare them to the fabricated region queries in your copy of samplefa.extra.srch. | ||
- | + | ===<span id="indexing_step4">'''Step 4: Indexing fabricated regions'''</span>=== | |
Line 168: | Line 170: | ||
---- | ---- | ||
- | + | ==Testing the index== | |
At this point it is a good idea to do some testing of the newly created index. Strategically, it is good to test this from a directory other than the one you indexed in, to ensure that relative or absolute paths are resolving appropriately. Invoke xpat with the following command | At this point it is a good idea to do some testing of the newly created index. Strategically, it is good to test this from a directory other than the one you indexed in, to ensure that relative or absolute paths are resolving appropriately. Invoke xpat with the following command |
Revision as of 18:12, 14 September 2007
Main Page > Mounting Collections: Class-specific Steps > Mounting a Finding Aids Collection > Building the Index
Contents |
Indexing Overview
After you have followed all the steps to set up your directories and prepare your files, as described in Validating and Normalizing Your Data, indexing the collection is fairly straightforward. To create an index for use with the Findaid Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" regions based on a combination of elements (for example, defining what the "main entry" is, without adding a <MAINENTRY> tag around the appropriate <AUTHOR> or <TITLE> element).
The main work in the indexing step is making sure that the fabricated regions in the workshopfa.extra.srch file match the characteristics of your collection.
Note: If the final "make validate" step in Data Preparation Step 5:Validating the normalized file against the dlxsead2002 DTD produced errors, you will need to fix the problem before running the indexing steps. Attempting to index an invalid document will lead to indexing problems and/or corrupt indexes.
The Makefile in the $DLXSROOT/bin/c/collection directory contains the commands necessary to build the index, and can be executed easily.
To create an index for use with the Findaid Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" structures based on a combination of elements (for example, defining who the "main author" of a finding aid is, without adding a <mainauthor> tag around the appropriate <author> in the eadheader element).
The Makefile should be in the $DLXSROOT/bin/c/collection directory.
cd $DLXSROOT/bin/c/collection
The following commands can be used to make the index:
make singledd indexes words for texts that have been concatenated into one large file for a collection.
make xml indexes the XML structure by reading the DTD. It validates as it indexes.
make post builds and indexes fabricated regions based on the XPAT queries stored in the workshopfa.extra.srch file. Because every collection is different, the *extra.srch file will probably need to be adapted for your collection. If you try to index/build fabricated regions from elements not used in your finding aids collection, you will see errors like:
Error found: <Error>syntax error before: ")</Error>
when you use the make post command
Step by Step Instructions for Indexing
Step 1: Indexing the text
cd $DLXSROOT/bin/w/workshopfa make singledd
The make file runs the following commands:
cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.blank.dd /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd /l/local/xpat/bin/xpatbld -m 256m -D /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd cp /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.presgml.dd
Step 2: Indexing the the XML
make xml
The makefile runs the following commands:
cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.presgml.dd /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd /l/local/xpat/bin/xmlrgn -D /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd /l1/workshop/test02/dlxs/misc/sgml/xml.dcl /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.inp /l1/workshop/test02/dlxs/obj/w/workshopfa/workshopfa.xml cp /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.prepost.dd
Step 3: Configuring fabricated regions
Fabricated regions are set up in the $DLXSROOT/prep/c/collection/collection.extra.srch file. The sample file $DLXSROOT/prep/s/samplefa/samplefa.extra.srch was designed for use with the Bentley's encoding practices. If your encoding practices differ from the Bentley's, or if your collection does not have all the elements that the samplefa.extra.srch xpat queries expect, you will need to edit your *.extra.srch file.
We recommend a combination of the following:
- Iterative work to insure make post does not report errors
- Iterative work to insure that searching and rendering work properly with your encoding practices.
- Up front analysis
Run the "make post" and iterate until there are no errors reported.
Run the "make post" step and look at the errors reported. Then modify *.extra.srch and rerun "make post". Repeat this until "make post" does not report any errors.
The most common cause of "make post" errors related to fabricated regions result from a fabricated region being defined which includes an element which is not in your collection.
For example if you do not have any <corpname> elements in any of the EADs in your collection and you are using the out-of-the-box samplefa.extra.srch, you will see an error message when xpat tries to index the mainauthor region using this rule:
( (region "persname" + region "corpname" + region "famname" + region "name") within (region "origination" within ( region "did" within (region "archdesc") ) ) ); {exportfile /l1/workshop/user11/dlxs/idx/s/samplefa/mainauthor.rgn"}; export;~sync "mainauthor";
If you don't expect to ever use an element, then you can eliminate it from the fabricated region definitions. An alternative that is useful if you have only a small sample of the EADs you will be mounting and you expect that some of the EADs you will be getting later might have the element that is currently missing from your collection, is to add a "dummy" EAD to your collection. The "dummy" ead should contains all the elements you will ever expect to use (or that are required by the *.extra.srch file). The "dummy" EAD should have all elements except the <eadid> empty.
Exercise the web user interface
Once make post does not report errors, you can follow the rest of the steps to put your collection on the web. Then carefully exercise the web user interface looking for the following symptoms:
- Searches that don't work properly because they depend on fabricated regions that don't match your encoding practices.
- Rendering that does not work properly. An example is that the name/title of the finding aid may not show up if your <unititle> element precedes your <origination> element in the top level <did>. See also Troubleshooting: Title of finding aid doe s not show up
For more information on regions used for searching and rendering see
Analysis of your collection
You may be able to analyze your collection prior to running make post and determine what changes you want to make in the fabricated regions. If your analysis misses any changes, you can find this out by using the two previous techniques.
- Once you have run "make xml", but before you run "make post", start up xpatu running against the newly created indexes:
xpatu $DLXSROOT/idx/s/samplefa/samplefa.dd
then run the command
>> {ddinfo regionnames}
This will give you a list of all the XML elements, and attributes
Alternatively you can create a file called xpatregions and insert the following text:
{ddinfo regionnames}
Then run this command
$ xpatu /l1/dev/tburtonw/idx/s/samplefa/samplefa.dd < xpatregions > regions.out
Then you use the "regions.out" file you just created to sort and examine the list of fabricated regions which occur in your finding aids and compare them to the fabricated region queries in your copy of samplefa.extra.srch.
Step 4: Indexing fabricated regions
make post
The makefile runs the following commands:
cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.prepost.dd /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd touch /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.init /l/local/xpat/bin/xpat -q /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd < /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.srch | /l1/workshop/test02/dlxs/bin/t/text/output.dd.frag.pl /l1/workshop/test02/dlxs/idx/w/workshopfa/ > /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.dd /l1/workshop/test02/dlxs/bin/t/text/inc.extra.dd.pl /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.dd /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
If you get an "invalid endpoints" message from "make post", the most likely cause is XML processing instructions or some other corruption. The second "make validate" step should have caught these. Other possible causes of errors during the "make post" step include syntax errors in workshopfa.extra.srch, or the absense of a particular region that is listed in the *.extra.srch file but not present in your collection. For example if you do not have any <corpname> elements in any of the EADs in your collection and you are using the out-of-the-box samplefa.extra.srch, you will see an error message when xpat tries to index the mainauthor region using this rule:
((region "persname" + region "corpname" + region "famname" + region "name") within (region "origination" within ( region "did" within (region "archdesc")))); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/mainauthor.rgn"}; export; ~sync "mainauthor";
The easiest solution is to modify *extra.srch to match the characteristics of your collection. An alternative is to include a "dummy" EAD that contains all the elements that you expect in your collection with no content.
Warning! If "make post" produces errors, you need to fix them. Otherwise searching and display of your finding aids may produce inconsistant results and crashes of the cgi script. See also Working with fabricated regions in findaid class
Testing the index
At this point it is a good idea to do some testing of the newly created index. Strategically, it is good to test this from a directory other than the one you indexed in, to ensure that relative or absolute paths are resolving appropriately. Invoke xpat with the following command
xpatu $DLXSROOT/idx/w/workshopfa/workshopfa.dd
For more information about searching, see the XPAT manual.
Try searching for some likely regions. Its a good idea to test some of the fabricated regions. Here are a few sample queries:
>> region "ead" 1: 3 matches >> region "eadheader" 2: 3 matches >> region "mainauthor" 3: 3 matches >> region "maintitle" 4: 3 matches >> region "admininfo" 5: 3 matches