Preparing Data and Directories

From DLXS Documentation

(Difference between revisions)
Jump to: navigation, search
Line 1: Line 1:
[[DLXS Wiki|Main Page]] > [[Mounting Collections: Class-specific Steps]] > [[Mounting a Finding Aids Collection]] > Preparing Data and Directories
[[DLXS Wiki|Main Page]] > [[Mounting Collections: Class-specific Steps]] > [[Mounting a Finding Aids Collection]] > Preparing Data and Directories
 +
 +
==Overview of Data Preparation and Indexing Steps==
 +
 +
'''Data Preparation'''
 +
 +
# [[#dataprep_step1|Validate the files individually]] against the EAD ''2002'' DTD<br />'''make validateeach'''<br />
 +
# [[#dataprep_step2|Concatenate the files into one larger XML file]]<br />'''make prepdocs'''<br />
 +
# [[#dataprep_step3| Validate the concatenated file against the ''dlxsead2002'' DTD]]:<br />'''make validate'''<br />
 +
# [[#dataprep_step4| Normalize the concatenated file.]]<br />'''make norm'''<br />
 +
# [[#dataprep_step5| Validate the normalized concatenated file against the ''dlxsead2002'' DTD]]<br />'''make validate'''<br />
 +
 +
The end result of these steps is a file containing the concatenated EADs wrapped in a &lt;COLL&gt; element which validates against the dlxsead2002 and is ready for indexing:
 +
 +
&lt;COLL&gt;<br />&lt;ead&gt;&lt;eadheader&gt;&lt;eadid&gt;1&lt;/eadid&gt;...&lt;/eadheader&gt;... content&lt;/ead&gt;<br />&lt;ead&gt;&lt;eadheader&gt;&lt;eadid&gt;2&lt;/eadid&gt;...&lt;/eadheader&gt;... content&lt;/ead&gt;<br />&lt;ead&gt;&lt;eadheader&gt;&lt;eadid&gt;3&lt;/eadid&gt;...&lt;/eadheader&gt;... content&lt;/ead&gt;<br />&lt;/COLL&gt;
 +
 +
 +
'''WARNING!''' If there are extra characters or some other problem with the part of the program that strips out the xml declaration and the doctype declaration the file will end up like:
 +
 +
 +
&lt;COLL&gt;<br />baddata&lt;ead&gt;&lt;eadheader&gt;&lt;eadid&gt;1&lt;/eadid&gt;...&lt;/eadheader&gt;... content&lt;/ead&gt;<br />baddata&lt;ead&gt;&lt;eadheader&gt;&lt;eadid&gt;2&lt;/eadid&gt;...&lt;/eadheader&gt;... content&lt;/ead&gt;<br />baddata&lt;ead&gt;&lt;eadheader&gt;&lt;eadid&gt;3&lt;/eadid&gt;...&lt;/eadheader&gt;... content&lt;/ead&gt;<br />&lt;/COLL&gt;
 +
 +
In this case you will get "character data not allowed" or similar errors during the make validate step. You can troubleshoot by looking at the concatenated file and/or checking your original EADs.
 +
 +
'''Indexing'''
 +
 +
# '''make singledd''' indexes words for texts that have been concatenated into on large file for a collection.
 +
# '''make xml''' indexes the XML structure by reading the DTD. Validates as it indexes.
 +
# '''make post''' builds and indexes fabricated regions based on the XPAT queries stored in the workshopfa.extra.srch file.
===Set Up Directories and Files for Data Preparation===
===Set Up Directories and Files for Data Preparation===

Revision as of 18:49, 14 September 2007

Main Page > Mounting Collections: Class-specific Steps > Mounting a Finding Aids Collection > Preparing Data and Directories

Contents

Overview of Data Preparation and Indexing Steps

Data Preparation

  1. Validate the files individually against the EAD 2002 DTD
    make validateeach
  2. Concatenate the files into one larger XML file
    make prepdocs
  3. Validate the concatenated file against the dlxsead2002 DTD:
    make validate
  4. Normalize the concatenated file.
    make norm
  5. Validate the normalized concatenated file against the dlxsead2002 DTD
    make validate

The end result of these steps is a file containing the concatenated EADs wrapped in a <COLL> element which validates against the dlxsead2002 and is ready for indexing:

<COLL>
<ead><eadheader><eadid>1</eadid>...</eadheader>... content</ead>
<ead><eadheader><eadid>2</eadid>...</eadheader>... content</ead>
<ead><eadheader><eadid>3</eadid>...</eadheader>... content</ead>
</COLL>

WARNING! If there are extra characters or some other problem with the part of the program that strips out the xml declaration and the doctype declaration the file will end up like:


<COLL>
baddata<ead><eadheader><eadid>1</eadid>...</eadheader>... content</ead>
baddata<ead><eadheader><eadid>2</eadid>...</eadheader>... content</ead>
baddata<ead><eadheader><eadid>3</eadid>...</eadheader>... content</ead>
</COLL>

In this case you will get "character data not allowed" or similar errors during the make validate step. You can troubleshoot by looking at the concatenated file and/or checking your original EADs.

Indexing

  1. make singledd indexes words for texts that have been concatenated into on large file for a collection.
  2. make xml indexes the XML structure by reading the DTD. Validates as it indexes.
  3. make post builds and indexes fabricated regions based on the XPAT queries stored in the workshopfa.extra.srch file.

Set Up Directories and Files for Data Preparation

You will need to set up a directory structure where you plan to store your EAD2002 XML source files, your object files (used by xpat for indexing), index files (including region index files)and other information such as data dictionaries, and files you use to prepare your data.

The convention used by DLXS is to use subdirectories named with the first letter of the collection id and the collection name:$DLXSROOT/xxx/{c}/{coll}/ where $DLXSROOT is the "tree" where you install all DLXS components, {c} is the first letter of the name of the collection you are indexing, and {coll} is the collection ID of the collection you are indexing. For example, if your collection ID is "bhlead" and your DLXSROOT is "/l1", you will place the Makefile in /l1/bin/b/bhlead/ , e.g., /l1/bin/b/bhlead/Makefile. See the DLPS Directory Conventions section and Workshop discussion of Directory Conventionsfor more information.

When deciding on your collection id consider that it needs to be unique across all classes to enable cross-collection searching. So you don't want both a text class collection with a collid of "my_coll" and a finding aid class collection with a collection id of "my_coll". You will also probably want to make your collection ids rather short and make sure they don't contain any special characters, since they will also be used for sub-directory names.

Note that the Makefile we provide along with most of the data preparation scripts supplied with DLXS assume this directory structure.

We recommend you use the following directory structure:

  • Store specialized scripts for preparing and/or preprocessing your collection and its Makefile in $DLXSROOT/bin/{c}/{coll}/ where $DLXSROOT is the "tree" where you install all DLXS components, {c} is the first letter of the name of the collection you are indexing, and {coll} is the collection ID of the collection you are indexing. For example, if your collection ID is "bhlead" and your DLXSROOT is "/l1", you will place the Makefile in /l1/bin/b/bhlead/ , e.g., /l1/bin/b/bhlead/Makefile. See the DLPS Directory Conventions section for more information.
  • Store your source finding aids in $DLXSROOT/prep/{c}/{coll}/data/.
  • Store any DTDs, doctype, and files for preparing your data in $DLXSROOT/prep/{c}/{coll}/. Unlike the contents of other directories, everything in prep should be expendable when actually running the indexes.
  • After running all the targets in the Makefile, the finalized, concatenated XML file for your finding aids collection will be created in $DLXSROOT/obj/{c}/{coll}/ , e.g., /l1/obj/b/bhlead/bhlead.xml.
  • Store index, region, data dictionary, and init files in $DLXSROOT/idx/{c}/{coll}/ , e.g., /l1/idx/b/bhlead/bhlead.idx. These will be updated as the index related targets in the Makefile are run. See the XPAT documentation for more on these types of files.

Fixing paths

The installation script should have changed all instances of /l1/ to your $DLXSROOT and all bang prompts "#!/l/local/bin/perl" to your location of perl. However, you may wish to check the following scripts:

  • $DLXSROOT/bin/f/findaid/output.dd.frag.pl
  • $DLXSROOT/bin/f/findaid/inc.extra.dd.pl
  • $DLXSROOT/bin/s/samplefa/preparedocs.pl

If you use the Makefile in $DLXSROOT/bin/s/samplefa you should check that the paths in the Makefile are correct for the locations of xpat, oxs, and osgmlnorm as installed on your system. These are the Make varibles that should be checked:

  • XPATBINDIR
  • OSX
  • OSGMLNORM


Step by step instructions for setting up Directories for Data Preparation

You can use the scripts and files from the sample finding aids collection "samplefa" as a basis for creating a new collection. In the instructions that follow you would use /{c}/{coll} instead of /w/workshopfa where {c} is the first letter of your collection id and {coll} is your collection id. So for example if your collection id was mycoll instead of

cp $DLXSROOT/prep/s/samplefa/samplefa.extra.srch $DLXSROOT/prep/w/workshopfa/workshopfa.extra.srch

you would do

cp $DLXSROOT/prep/s/samplefa/samplefa.extra.srch $DLXSROOT/prep/m/mycoll/mycoll.extra.srch


This documentation will make use of the concept of the $DLXSROOT, which is the place at which your DLXS directory structure starts. We generally use /l1/.

To check your $DLXSROOT, type the following command at the command prompt:

echo $DLXSROOT

The prep directory under $DLXSROOT is the space for you to take your encoded finding aids and "package them up" for use with the DLXS middleware. Create your basic directory $DLXSROOT/prep/w/workshopfa and its data subdirectory with the following command:

mkdir -p $DLXSROOT/prep/w/workshopfa/data

Move into the prep directory with the following command:

cd $DLXSROOT/prep/w/workshopfa

This will be your staging area for all the things you will be doing to your EADs, and ultimately to your collection. At present, all it contains is the data subdirectory you created a moment ago. Unlike the contents of other collection-specific directories, everything in prep should be ultimately expendable in the production environment.

Copy the necessary files into your data directory with the following commands:

cp $DLXSROOT/prep/s/samplefa/data/*.xml $DLXSROOT/prep/w/workshopfa/data/.

We'll also need a few files to get us started working. They will need to be copied over as well, and also have paths adapted and collection identifiers changed. Follow these commands:


cp $DLXSROOT/prep/s/samplefa/validateeach.csh $DLXSROOT/prep/w/workshopfa/.
cp $DLXSROOT/prep/s/samplefa/samplefa.xml.inp $DLXSROOT/prep/w/workshopfa/workshopfa.xml.inp
cp $DLXSROOT/prep/s/samplefa/samplefa.text.inp $DLXSROOT/prep/w/workshopfa/workshopfa.text.inp
mkdir -p $DLXSROOT/obj/w/workshopfa
mkdir -p $DLXSROOT/bin/w/workshopfa
cp $DLXSROOT/bin/s/samplefa/preparedocs.pl $DLXSROOT/bin/w/workshopfa/.
cp $DLXSROOT/bin/s/samplefa/Makefile $DLXSROOT/bin/w/workshopfa/Makefile

(If you have installed the Release 13 August 24th patch substitute these instructions)

Now you'll need to edit these files to ensure that the paths match your $DLXSROOT and that the collection name is workshopfa instead of samplefa.

STOP!! Make sure you edit the files before going to the next steps!!

Make sure you change these files:

  • $DLXSROOT/prep/w/workshopfa/validateeach.csh
  • $DLXSROOT/bin/w/workshopfa/Makefile (see below for details)

You can run this command to check to see if you forgot to change samplefa to workshopfa:

grep "samplefa" $DLXSROOT/bin/w/workshopfa/* $DLXSROOT/prep/w/workshopfa/* |grep -v "#"

With the ready-to-go ead2002 encoded finding aids files in the data directory, we are ready to begin the preparation process. This will include:

  1. Validating the files individually against the EAD 2002 DTD
  2. concatenating the files into one larger XML file
  3. validating the concatenated file against the dlxsead2002 DTD
  4. "normalizing" the concatenated file.
  5. validating the normalized concatenated file against the dlxsead2002 DTD

These steps are generally handled via the Makefile in $DLXSROOT/bin/s/samplefa which we have copied to $DLXSROOT/bin/w/workshopfa. Example Makefile.

Make sure you changed your copy of the Makefile to reflect

/w/workshopfa instead of /s/samplefa. You will want to change lines 2 and 3 accordingly


   1  
   2  NAMEPREFIX = samplefa
   3  FIRSTLETTERSUBDIR = s

Tip: Be sure not to add any space after the workshopfa or w. The Makefile ignores space immediately before and after the equals sign but treats all other space as part of the string. If you accidentally put a space after the FIRSTLETTERSUBDIR = s , you will get an error like "[validateeach] Error 127" If you look closely at the first line of what the Makefile reported to standard output (see below) you will see that instead of running the command:

/l1/workshop/tburtonw/dlxs/prep/w/workshopfa/validateeach.csh

which just calls the validateeach c-shell script

it tried to run a directory name: "/l1/workshop/tburtonw/dlxs/prep/w" with the argument "/workshopfa/validateeach.csh" which does not make sense

 % make validateeach
/l1/workshop/tburtonw/dlxs/prep/w /workshopfa/validateeach.csh
make: execvp: /l1/workshop/tburtonw/dlxs/prep/w: Permission denied
make: [validateeach] Error 127 (ignored)

Further note on editing the Makefile: If you modify or write your own Make targets, you need to make sure that a real "tab" starts each command line rather than spaces. The easiest way to check for these kinds of errors is to use "cat -vet Makefile" to show all spaces, tabs and newlines

You should make sure you thate $DLXSROOT, and the locations of the various binaries to have been changed to match your installation.


  • Change $DLXSROOT /l1/ to your $DLXSROOT on every line that uses it
  • Change XPATBINDIR = /l/local/bin/ to the location of the xpat binary in your installation
  • Change the location of the osx binary from
OSX = /l/local/bin/osx
to the location in your installation
  • Change the location of the osgmlnorm binary from
OSGMLNORM = /l/local/bin/osgmlnorm
to the location in your installation

Tip: oxs and osgmlnorm are installed as part of the OpenSP package. If you are using linux, make sure that the OpenSP package for your version of linux is installed and make sure the paths above are changed to match your installation. If you are using Solaris you will have to install (and possibly compile) OpenSP. You may also need to make sure the $LD_LIBRARY_PATH environment variable is set so that the OpenSP programs can find the required libraries. For troubleshooting such problems the unix ldd utility is invaluble. See also links to OpenSP package on the tools page: Useful Tools


Set Up Directories and Files for XPAT Indexing

Please substitute /{c}/{coll} where {c} is the first letter of your collection id 
and {coll}is your collection id  for any instance of /w/workshopfa 
and substitute {coll} wherever you see "workshopfa" in the following instructions.

First, we need to create the rest of the directories in the workshopfa environment with the following commands:

mkdir -p $DLXSROOT/idx/w/workshopfa

The bin directory we created when we prepared directories for data preparation holds any scripts or tools used for the collection specifically; obj ( created earlier) holds the "object" or XML file for the collection, and idx holds the XPAT indexes. Now we need to finish populating the directories.

 cp $DLXSROOT/prep/s/samplefa/samplefa.blank.dd  DLXSROOT/prep/w/workshopfa/workshopfa.blank.dd
 cp $DLXSROOT/prep/s/samplefa/samplefa.extra.srch $DLXSROOT/prep/w/workshopfa/workshopfa.extra.srch


Each of these files need to be edited to reflect the new collection name and the paths to your particular directories. Failure to change even one file can result in puzzling errors, because the scripts are working, just not necessarily in the directories you are looking at.

grep -l "samplefa" $DLXSROOT/prep/w/workshopfa/*

will check for any leftover unchanged instances of the string /samplefa. You also need to check that "/l1/" has been replacedby whatever $DLXSROOT is on your server. If you don't have an /l1 directory on your server (which is very likely if you are not here using a DLPS machine) you can check with:

grep -l "l1" $DLXSROOT/prep/w/workshopfa/*

Top

Personal tools