Preparing Data and Directories
From DLXS Documentation
Main Page > Mounting Collections: Class-specific Steps > Mounting a Finding Aids Collection > Preparing Data and Directories
Contents |
[edit] Overview of Data Preparation and Indexing Steps
Data Preparation
- Validate the files individually against the EAD 2002 DTD
make validateeach
- Concatenate the files into one larger XML file
make prepdocs
- Validate the concatenated file against the dlxsead2002 DTD:
make validate
- Normalize the concatenated file.
make norm
- Validate the normalized concatenated file against the dlxsead2002 DTD
make validate
The end result of these steps is a file containing the concatenated EADs wrapped in a <COLL> element which validates against the dlxsead2002 and is ready for indexing:
<COLL>
<ead><eadheader><eadid>1</eadid>...</eadheader>... content</ead>
<ead><eadheader><eadid>2</eadid>...</eadheader>... content</ead>
<ead><eadheader><eadid>3</eadid>...</eadheader>... content</ead>
</COLL>
WARNING! If there are extra characters or some other problem with the part of the program that strips out the xml declaration and the doctype declaration the file will end up like:
<COLL>
baddata<ead><eadheader><eadid>1</eadid>...</eadheader>... content</ead>
baddata<ead><eadheader><eadid>2</eadid>...</eadheader>... content</ead>
baddata<ead><eadheader><eadid>3</eadid>...</eadheader>... content</ead>
</COLL>
In this case you will get "character data not allowed" or similar errors during the make validate step. You can troubleshoot by looking at the concatenated file and/or checking your original EADs.
Indexing
- make singledd indexes all the words in the concatenated file.
- make xml indexes the XML structure by reading the DTD. Validates as it indexes.
- make post builds and indexes fabricated regions based on the XPAT queries stored in the workshopfa.extra.srch file.
[edit] Set Up Directories and Files for Data Preparation
You will need to set up a directory structure where you plan to store your EAD2002 XML source files, your object files (used by xpat for indexing), index files (including region index files)and other information such as data dictionaries, and files you use to prepare your data.
The convention used by DLXS is to use subdirectories named with the first letter of the collection id and the collection name:$DLXSROOT/xxx/{c}/{coll}/ where $DLXSROOT is the "tree" where you install all DLXS components, {c} is the first letter of the name of the collection you are indexing, and {coll} is the collection ID of the collection you are indexing. For example, if your collection ID is "bhlead" and your DLXSROOT is "/l1", you will place the Makefile in /l1/bin/b/bhlead/ , e.g., /l1/bin/b/bhlead/Makefile. See the DLPS Directory Conventions section and Workshop discussion of Directory Conventionsfor more information.
When deciding on your collection id consider that it needs to be unique across all classes to enable cross-collection searching. So you don't want both a text class collection with a collid of "my_coll" and a finding aid class collection with a collection id of "my_coll". You will also probably want to make your collection ids rather short and make sure they don't contain any special characters, since they will also be used for sub-directory names.
Note that the Makefile we provide along with most of the data preparation scripts supplied with DLXS assume this directory structure.
We recommend you use the following directory structure:
- Store specialized scripts for preparing and/or preprocessing your collection and its Makefile in $DLXSROOT/bin/{c}/{coll}/ where $DLXSROOT is the "tree" where you install all DLXS components, {c} is the first letter of the name of the collection you are indexing, and {coll} is the collection ID of the collection you are indexing. For example, if your collection ID is "bhlead" and your DLXSROOT is "/l1", you will place the Makefile in /l1/bin/b/bhlead/ , e.g., /l1/bin/b/bhlead/Makefile. See the DLPS Directory Conventions section for more information.
- Store your source finding aids in $DLXSROOT/prep/{c}/{coll}/data/.
- Store any DTDs, doctype, and files for preparing your data in $DLXSROOT/prep/{c}/{coll}/. Unlike the contents of other directories, everything in prep should be expendable when actually running the indexes.
- After running all the targets in the Makefile, the finalized, concatenated XML file for your finding aids collection will be created in $DLXSROOT/obj/{c}/{coll}/ , e.g., /l1/obj/b/bhlead/bhlead.xml.
- Store index, region, data dictionary, and init files in $DLXSROOT/idx/{c}/{coll}/ , e.g., /l1/idx/b/bhlead/bhlead.idx. These will be updated as the index related targets in the Makefile are run. See the XPAT documentation for more on these types of files.
[edit] Fixing paths
The installation script should have changed all instances of /l1/ to your $DLXSROOT and all bang prompts "#!/l/local/bin/perl" to your location of perl. However, you may wish to check the following scripts:
- $DLXSROOT/bin/f/findaid/output.dd.frag.pl
- $DLXSROOT/bin/f/findaid/inc.extra.dd.pl
- $DLXSROOT/bin/s/samplefa/preparedocs.pl
If you use the Makefile in $DLXSROOT/bin/s/samplefa you should check that the paths in the Makefile are correct for the locations of xpat, oxs, and osgmlnorm as installed on your system. These are the Make varibles that should be checked:
- XPATBINDIR
- OSX
- OSGMLNORM
[edit] Step by step instructions for setting up Directories for Data Preparation
You can use the scripts and files from the sample finding aids collection "samplefa" as a basis for creating a new collection. In the instructions that follow you would use /{c}/{coll} instead of /w/workshopfa where {c} is the first letter of your collection id and {coll} is your collection id. So for example if your collection id was mycoll instead of
cp $DLXSROOT/prep/s/samplefa/samplefa.extra.srch $DLXSROOT/prep/w/workshopfa/workshopfa.extra.srch
you would do
cp $DLXSROOT/prep/s/samplefa/samplefa.extra.srch $DLXSROOT/prep/m/mycoll/mycoll.extra.srch
This documentation will make use of the concept of the $DLXSROOT, which is the place at which your DLXS directory structure starts. We generally use /l1/.
To check your $DLXSROOT, type the following command at the command prompt:
echo $DLXSROOT
The prep directory under $DLXSROOT is the space for you to take your encoded finding aids and "package them up" for use with the DLXS middleware. Create your basic directory $DLXSROOT/prep/w/workshopfa and its data subdirectory with the following command:
mkdir -p $DLXSROOT/prep/w/workshopfa/data
Move into the prep directory with the following command:
cd $DLXSROOT/prep/w/workshopfa
This will be your staging area for all the things you will be doing to your EADs, and ultimately to your collection. At present, all it contains is the data subdirectory you created a moment ago. Unlike the contents of other collection-specific directories, everything in prep should be ultimately expendable in the production environment.
Copy the necessary files into your data directory with the following commands:
cp $DLXSROOT/prep/s/samplefa/data/*.xml $DLXSROOT/prep/w/workshopfa/data/.
We'll also need a few files to get us started working. They will need to be copied over as well, and also have paths adapted and collection identifiers changed. Follow these commands:
cp $DLXSROOT/prep/s/samplefa/validateeach.csh $DLXSROOT/prep/w/workshopfa/. cp $DLXSROOT/prep/s/samplefa/samplefa.xml.inp $DLXSROOT/prep/w/workshopfa/workshopfa.xml.inp cp $DLXSROOT/prep/s/samplefa/samplefa.text.inp $DLXSROOT/prep/w/workshopfa/workshopfa.text.inp mkdir -p $DLXSROOT/obj/w/workshopfa mkdir -p $DLXSROOT/bin/w/workshopfa cp $DLXSROOT/bin/s/samplefa/preparedocs.pl $DLXSROOT/bin/w/workshopfa/. cp $DLXSROOT/bin/s/samplefa/Makefile $DLXSROOT/bin/w/workshopfa/Makefile
(If you have installed the Release 13 August 24th patch substitute these instructions)
If release 13, make sure you install the patch files as listed in the README: README
cp $DLXSROOT/prep/s/samplefa/samplefa.xml.inp $DLXSROOT/prep/w/workshopfa/workshopfa.xml.inp cp $DLXSROOT/prep/s/samplefa/samplefa.text.inp $DLXSROOT/prep/w/workshopfa/workshopfa.text.inp mkdir -p $DLXSROOT/obj/w/workshopfa mkdir -p $DLXSROOT/bin/w/workshopfa cp $DLXSROOT/bin/s/samplefa/preparedocs.pl $DLXSROOT/bin/w/workshopfa/. cp $DLXSROOT/bin/s/samplefa/fixdoctype.pl $DLXSROOT/bin/w/workshopfa/. cp $DLXSROOT/bin/s/samplefa/validateeach.sh $DLXSROOT/bin/w/workshopfa/. cp $DLXSROOT/bin/s/samplefa/Makefile $DLXSROOT/bin/w/workshopfa/Makefile
Make sure you have edited the perl bang prompt and the paths to your shell and directories in these files:
* $DLXSROOT/bin/f/findaid/stripdoctype.pl * $DLXSROOT/bin/s/samplefa/fixdoctype.pl * $DLXSROOT/bin/s/samplefa/preparedocs.pl * $DLXSROOT/bin/w/workshopfa/validateeach.sh * $DLXSROOT/bin/w/workshopfa/Makefile
Now you'll need to edit these files to ensure that the paths match your $DLXSROOT and that the collection name is workshopfa instead of samplefa.
STOP!! Make sure you edit the files before going to the next steps!!
Make sure you change these files:
- $DLXSROOT/prep/w/workshopfa/validateeach.csh
- $DLXSROOT/prep/w/workshopfa/validateeach.sh
- $DLXSROOT/bin/w/workshopfa/Makefile (see below for details)
You can run this command to check to see if you forgot to change samplefa to workshopfa:
grep "samplefa" $DLXSROOT/bin/w/workshopfa/* $DLXSROOT/prep/w/workshopfa/* |grep -v "#"
With the ready-to-go ead2002 encoded finding aids files in the data directory, we are ready to begin the preparation process. This will include:
- Validating the files individually against the EAD 2002 DTD
- Concatenating the files into one larger XML file
- Validating the concatenated file against the dlxsead2002 DTD
- "Normalizing" the concatenated file.
- Validating the normalized concatenated file against the dlxsead2002 DTD
These steps are generally handled via the Makefile $DLXSROOT/bin/s/samplefa which we have copied to $DLXSROOT/bin/w/workshopfa. Example Makefile.
Make sure you changed your copy of the Makefile to reflect
/w/workshopfa instead of /s/samplefa. You will want to change lines 2 and 3 accordingly
1 2 NAMEPREFIX = samplefa 3 FIRSTLETTERSUBDIR = s
1 DLXSROOT = /l1 2 NAMEPREFIX = samplefa 3 FIRSTLETTERSUBDIR = s
/l1/workshop/tburtonw/dlxs/prep/w/workshopfa/validateeach.csh
which just calls the validateeach c-shell script
it tried to run a directory name: "/l1/workshop/tburtonw/dlxs/prep/w" with the argument "/workshopfa/validateeach.csh" which does not make sense
% make validateeach /l1/workshop/tburtonw/dlxs/prep/w /workshopfa/validateeach.csh make: execvp: /l1/workshop/tburtonw/dlxs/prep/w: Permission denied make: [validateeach] Error 127 (ignored)
Further note on editing the Makefile: If you modify or write your own Make targets, you need to make sure that a real "tab" starts each command line rather than spaces. The easiest way to check for these kinds of errors is to use "cat -vet Makefile" to show all spaces, tabs and newlines
You should make sure you that $DLXSROOT, and the locations of the various binaries to have been changed to match your installation.
- Change $DLXSROOT /l1/ to your $DLXSROOT on every line that uses it
- Change XPATBINDIR = /l/local/bin/ to the location of the xpat binary in your installation
- Change the location of the osx binary from
OSX = /l/local/bin/osx to the location in your installation
- Change the location of the osgmlnorm binary from
OSGMLNORM = /l/local/bin/osgmlnorm to the location in your installation
[edit] Set Up Directories and Files for XPAT Indexing
Please substitute /{c}/{coll} where {c} is the first letter of your collection id and {coll}is your collection id for any instance of /w/workshopfa and substitute {coll} wherever you see "workshopfa" in the following instructions.
First, we need to create the rest of the directories in the workshopfa environment with the following commands:
mkdir -p $DLXSROOT/idx/w/workshopfa
The bin directory we created when we prepared directories for data preparation holds any scripts or tools used for the collection specifically; obj ( created earlier) holds the "object" or XML file for the collection, and idx holds the XPAT indexes. Now we need to finish populating the directories.
cp $DLXSROOT/prep/s/samplefa/samplefa.blank.dd DLXSROOT/prep/w/workshopfa/workshopfa.blank.dd cp $DLXSROOT/prep/s/samplefa/samplefa.extra.srch $DLXSROOT/prep/w/workshopfa/workshopfa.extra.srch
Each of these files need to be edited to reflect the new collection name and the paths to your particular directories. Failure to change even one file can result in puzzling errors, because the scripts are working, just not necessarily in the directories you are looking at.
grep -l "samplefa" $DLXSROOT/prep/w/workshopfa/*
will check for any leftover unchanged instances of the string /samplefa. You also need to check that "/l1/" has been replacedby whatever $DLXSROOT is on your server. If you don't have an /l1 directory on your server (which is very likely if you are not here using a DLPS machine) you can check with:
grep -l "l1" $DLXSROOT/prep/w/workshopfa/*