Building the Index

From DLXS Documentation

Revision as of 15:56, 30 September 2007 by Cboulay (Talk | contribs)
Jump to: navigation, search

Main Page > Mounting Collections: Class-specific Steps > Mounting a Finding Aids Collection > Building the Index

Contents

Indexing Overview

After you have followed all the steps to set up your directories and prepare your files, as described in Validating and Normalizing Your Data, indexing the collection is fairly straightforward. To create an index for use with the Findaid Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" regions based on a combination of elements (for example, defining what the "main entry" is, without adding a <MAINENTRY> tag around the appropriate <AUTHOR> or <TITLE> element).

The main work in the indexing step is making sure that the fabricated regions in the workshopfa.extra.srch file match the characteristics of your collection.

Note: If the final "make validate" step in Validating the normalized file against the dlxsead2002 DTD produced errors, you will need to fix the problem before running the indexing steps. Attempting to index an invalid document will lead to indexing problems and/or corrupt indexes.

The Makefile in the $DLXSROOT/bin/c/collection directory contains the commands necessary to build the index, and can be executed easily.

To create an index for use with the Findaid Class interface, you will need to index the words in the collection, then index the XML (the structural metadata, if you will), and then finally "fabricate" structures based on a combination of elements (for example, defining who the "main author" of a finding aid is, without adding a <mainauthor> tag around the appropriate <author> in the eadheader element).

The Makefile should be in the $DLXSROOT/bin/c/collection directory.

cd $DLXSROOT/bin/c/collection

The following commands can be used to make the index:


make singledd indexes words for texts that have been concatenated into one large file for a collection.

make xml indexes the XML structure by reading the DTD. It validates as it indexes.

make post builds and indexes fabricated regions based on the XPAT queries stored in the workshopfa.extra.srch file. Because every collection is different, the *extra.srch file will probably need to be adapted for your collection. If you try to index/build fabricated regions from elements not used in your finding aids collection, you will see errors like:

Error found: <Error>syntax error before: ")</Error>  

when you use the make post command

Step by Step Instructions for Indexing

Step 1: Indexing the text

 cd $DLXSROOT/bin/w/workshopfa
 make singledd

The make file runs the following commands:

 cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.blank.dd
 	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
 /l/local/xpat/bin/xpatbld -m 256m -D /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
 cp /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
 	/l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.presgml.dd

Step 2: Indexing the the XML

 make xml

The makefile runs the following commands:

 cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.presgml.dd
 	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
 /l/local/xpat/bin/xmlrgn -D /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
 	/l1/workshop/test02/dlxs/misc/sgml/xml.dcl
 	/l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.inp
 	/l1/workshop/test02/dlxs/obj/w/workshopfa/workshopfa.xml
 
 cp /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
 	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.prepost.dd


Step 3: Configuring fabricated regions

Fabricated regions are set up in the $DLXSROOT/prep/c/collection/collection.extra.srch file. The sample file $DLXSROOT/prep/s/samplefa/samplefa.extra.srch was designed for use with the Bentley's encoding practices. If your encoding practices differ from the Bentley's, or if your collection does not have all the elements that the samplefa.extra.srch xpat queries expect, you will need to edit your *.extra.srch file.

We recommend a combination of the following:

  1. Iterative work to insure make post does not report errors
  2. Iterative work to insure that searching and rendering work properly with your encoding practices.
  3. Up front analysis

Run the "make post" and iterate until there are no errors reported.

Run the "make post" step and look at the errors reported. Then modify *.extra.srch and rerun "make post". Repeat this until "make post" does not report any errors.

The most common cause of "make post" errors related to fabricated regions result from a fabricated region being defined which includes an element which is not in your collection.

For example if you do not have any <corpname> elements in any of the EADs in your collection and you are using the out-of-the-box samplefa.extra.srch, you will see an error message when xpat tries to index the mainauthor region using this rule:

(
     (region "persname" + region "corpname" + region "famname" + region "name")
      within 
       (region "origination" within 
          ( region "did" within 
               (region "archdesc")
          )
       )
      ); 
{exportfile /l1/workshop/user11/dlxs/idx/s/samplefa/mainauthor.rgn"}; export;~sync "mainauthor"; 


If you don't expect to ever use an element, then you can eliminate it from the fabricated region definitions. An alternative that is useful if you have only a small sample of the EADs you will be mounting and you expect that some of the EADs you will be getting later might have the element that is currently missing from your collection, is to add a "dummy" EAD to your collection. The "dummy" ead should contains all the elements you will ever expect to use (or that are required by the *.extra.srch file). The "dummy" EAD should have all elements except the <eadid> empty.

Exercise the web user interface

Once make post does not report errors, you can follow the rest of the steps to put your collection on the web. Then carefully exercise the web user interface looking for the following symptoms:

  • Searches that don't work properly because they depend on fabricated regions that don't match your encoding practices.
  • Rendering that does not work properly. An example is that the name/title of the finding aid may not show up if your <unititle> element precedes your <origination> element in the top level <did>. See also Troubleshooting: Title of finding aid doe s not show up

For more information on regions used for searching and rendering see

Analysis of your collection

You may be able to analyze your collection prior to running make post and determine what changes you want to make in the fabricated regions. If your analysis misses any changes, you can find this out by using the two previous techniques.

  • Once you have run "make xml", but before you run "make post", start up xpatu running against the newly created indexes:
 xpatu $DLXSROOT/idx/s/samplefa/samplefa.dd

then run the command

 >> {ddinfo regionnames}

This will give you a list of all the XML elements, and attributes

Alternatively you can create a file called xpatregions and insert the following text:

{ddinfo regionnames}

Then run this command

$ xpatu /l1/dev/tburtonw/idx/s/samplefa/samplefa.dd < xpatregions > regions.out

Then you use the "regions.out" file you just created to sort and examine the list of fabricated regions which occur in your finding aids and compare them to the fabricated region queries in your copy of samplefa.extra.srch.

Step 4: Indexing fabricated regions

 make post

The makefile runs the following commands:

 cp /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.prepost.dd
 	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
 touch /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.init
 /l/local/xpat/bin/xpat -q /l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd
 	< /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.srch
 	| /l1/workshop/test02/dlxs/bin/t/text/output.dd.frag.pl
 	/l1/workshop/test02/dlxs/idx/w/workshopfa/
 	> /l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.dd
 /l1/workshop/test02/dlxs/bin/t/text/inc.extra.dd.pl
 	/l1/workshop/test02/dlxs/prep/w/workshopfa/workshopfa.extra.dd
 	/l1/workshop/test02/dlxs/idx/w/workshopfa/workshopfa.dd


If you get an "invalid endpoints" message from "make post", the most likely cause is XML processing instructions or some other corruption. The second "make validate" step should have caught these. Other possible causes of errors during the "make post" step include syntax errors in workshopfa.extra.srch, or the absense of a particular region that is listed in the *.extra.srch file but not present in your collection. For example if you do not have any <corpname> elements in any of the EADs in your collection and you are using the out-of-the-box samplefa.extra.srch, you will see an error message when xpat tries to index the mainauthor region using this rule:

((region "persname" + region "corpname" + region "famname" + region "name") within (region "origination" within ( region "did" within (region "archdesc")))); {exportfile "/l1/workshop/user11/dlxs/idx/s/samplefa/mainauthor.rgn"}; export; ~sync "mainauthor";

The easiest solution is to modify *extra.srch to match the characteristics of your collection. An alternative is to include a "dummy" EAD that contains all the elements that you expect in your collection with no content.

Warning! If "make post" produces errors, you need to fix them. Otherwise searching and display of your finding aids may produce inconsistant results and crashes of the cgi script. See also Working with Fabricated Regions in Findaid Class


Testing the index

At this point it is a good idea to do some testing of the newly created index. Strategically, it is good to test this from a directory other than the one you indexed in, to ensure that relative or absolute paths are resolving appropriately. Invoke xpat with the following command

xpatu $DLXSROOT/idx/w/workshopfa/workshopfa.dd

For more information about searching, see the XPAT manual.

Try searching for some likely regions. Its a good idea to test some of the fabricated regions. Here are a few sample queries:

>> region "ead"
  1: 3 matches

>> region "eadheader"
  2: 3 matches

>> region "mainauthor"
  3: 3 matches

>> region "maintitle"
  4: 3 matches

>> region "admininfo"
  5: 3 matches

Top

Personal tools