Finding Aids Data Preparation

From DLXS Documentation

Revision as of 13:04, 14 September 2007 by Cboulay (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

Contents

Preprocessing

Validating and Normalizing Your Data

Step 1: Validating the files individually against the EAD 2002 DTD

cd $DLXSROOT/bin/w/workshopfa
make validateeach


The Makefile runs the following command:

% $DLXSROOT/prep/w/workshopfa/validateeach.csh


What's happening: The makefile is running the c-shell script validateeach.csh in the prep directory. The script creates a temporary file without the public DOCTYPE declaration, and then runs onsgmls on each of the resulting XML files in the data subdirectory to make sure they conform with the EAD 2002 DTD. If validation errors occur, error files will be in the data subdirectory with the same name as the finding aids file but with an extension of .err. If there are validation errors, fix the problems in the source XML files and re-run.

Check the error files by running the following commands

 ls -l $DLXSROOT/prep/w/workshopfa/data/*err

if there are any *err files, you can look at them with the following command:

 less  $DLXSROOT/prep/w/workshopfa/data/*err
Common error messages and solutions:
onsgmls: Command not found
path to your installation of the onsgmls binary incorrect in $DLXSROOT/prep/s/samplefa/validateeach.csh
onsgmls:/l1/dev/tburtonw/misc/sgml/xml.dcl:1:W: SGML declaration was not implied
This is a warning (note the :W:) not an error and can be ignored. This warning can be silenced by changing line 6 of $DLXSROOT/prep/s/samplefa/validateeach.csh (or your customized version)

from:

onsgmls -s -f $file.err $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/samplefa/samplefa.text.inp $file.tmp

to:

onsgmls -wxml -w no-explicit-sgml-decl -s -f $file.err $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT/prep/s/samplefa/samplefa.text.inp $file.tmp
entityref errors such as "general entity 'foobar' not defined"
If you use entityrefs in your EADs, you may see errors relating to problems resolving entities. Example entityref errors. The solution is to add the entityref declarations to the doctype declaration in these two files:
$DLXSROOT/prep/s/samplefa/samplefa.text.inp
This is the doctype declaration used by the validateeach.csh script that points to the EAD2002 DTD.
$DLXSROOT/prep/s/samplefa/samplefa.xml.inp
This is the doctype declaration that points to the dlxs2002 dtd. The dlxs2002 dtd is used by the "make validate" target of the Makefile to validate the concatenated file containing all of your EADs.

Step 2: Concatentating the files into one larger XML file (and running some preprocessing commands)

cd $DLXSROOT/bin/w/workshopfa make prepdocs
The Makefile runs the following command:
$DLXSROOT/bin/w/workshopfa/preparedocs.pl $DLXSROOT/prep/w/workshopfa/data $DLXSROOT/obj/w/workshopfa/workshopfa.xml $DLXSROOT/prep/w/workshopfa/logfile.txt
This runs the preparedocs.pl script on all the files in the specified data directory and writes the output to the workshopfa.xml file in the appropriate /obj subdirectory. It also outputs a logfile to the /prep directory:

The Perl script does two sets of things:

  1. Concatenates all the files
  2. Runs a number of preprocessing steps on all the files

Concatenating the files

The script finds all XML files in the data subdirectory,and then strips off the XML declaration and doctype declaration from each file before concatenating them together. It also wraps the concatenated EADs in a <COLL> tag . The end result looks like:


<COLL>
<ead><eadheader><eadid>1</eadid>...</eadheader>... content</ead>
<ead><eadheader><eadid>2</eadid>...</eadheader>... content</ead>
<ead><eadheader><eadid>3</eadid>...</eadheader>... content</ead>
</COLL>

WARNING! If are extra characters or some other problem with the part of the program that strips out the xml declaration and the docytype declearation the file will end up like:


<COLL>
baddata<ead><eadheader><eadid>1</eadid>...</eadheader>... content</ead>
baddata<ead><eadheader><eadid>2</eadid>...</eadheader>... content</ead>
baddata<ead><eadheader><eadid>3</eadid>...</eadheader>... content</ead>
</COLL>

This will cause the document to be invalid since the dlxsead2002.dtd does not allow anything between the closing tag of one </ead> and the opening tag of the next one <ead>

Some of the possible causes of such a problem are:

  • UTF-8 Byte Order Marks at the beginning of the file
  • DOCTYPE declaration on more than one line
  • XML processing instructions

Preprocessing steps

The perl program also does some preprocessing on all the files. These steps are customized to the needs of the Bentley. You should look at the perl code and modify it so it is appropriate for your encoding practices.

The preprocessing steps are:

  • finds all id attributes and prepends a number to them
  • adds a prefix string "dao-bhl" to all DAO links (You probably will want to change this)
  • removes empty persname, corpname, and famname elements

The output of the combined concatenation and preprocessing steps will be the one collection named xml file which is deposited into the obj subdirectory.

If your collections need to be transformed in any way, or if you do not want the transformations to take place (the DAO changes, for example), edit preparedocs.pl file to effect the changes. Some changes you may want to make include:

  • Changing the algorithm used to make id attibute unique. For example if your encoding practices use id attributes and targets, the out-of-the-box algorithm will remove the relationship between the attributes and targets. One possible modification might be to modify the algorithm to prepend the eadid or filename to all id and target attributes.
  • Modifying the program to read a list of files or list of eadids so that the files are concatenated in a particular order. The default sort order for search results is in occurance order, which translates to the order in which the eads are concatenated. If you write a script which looks at the eads for some element that you want to sort by and then outputs a list of filenames sorted by that order, you could then pass that file to a modified preparedocs.pl so it would concatenate the files in the order listed.

Step 3: Validating the concatenated file against the dlxsead2002 DTD

make validate

The Makefile runs the following command:

onsgmls -wxml -s -f $DLXSROOT/prep/w/workshopfa/workshopfa.errors $DLXSROOT/misc/sgml/xml.dcl   $DLXSROOT/prep/w/workshopfa/workshopfa.xml.inp $DLXSROOT/obj/w/workshopfa/workshopfa.xml

This runs the onsgmls command against the concatenated file using the dlxs2002dtd, and writes any errors to the workshopfa.errors file in the appropriate subdirectory in $DLXSROOT/prep/c/collection.. | More details

Note that we are running this using workshopfa.xml.inp not workshop.text.inp. The workshopfa.xml.inp file points to $DLXSROOT/misc/sgml/dlxsead2002.ead which is the dlxsead2002 DTD. The dlxsead2002 DTDis exactly the same as the EAD2002 DTD, but adds a wrapping element, <COLL>, to be able to combine more than one ead element, more than one finding aid, into one file. It is, of course, a good idea to validate the file now before going further.


Run the following command

 ls -l $DLXSROOT/prep/w/workshopfa/workshopfa.errors

If there is a workshopfa.errors file then run the following command to look at the errors reported

 less $DLXSROOT/prep/w/workshopfa/workshopfa.errors


Common common causes of error messages and solutions
make: onsgmls: Command not found
OSGMLNORM variable in Makefile does not point to correct location of onsgmls for your installation or openSP is not installed.
If there were no errors when you ran "make validateeach" but you are now seeing errors
there was very likely a problem with the preparedocs.pl processing.
  • The DOCTYPE declaration did not get completely removed. (The current scripts don't always remove multiline DOCTYPE declarations)
  • There was a UTF-8 Byte Order Mark at the begginning of one or more of the concatenated files
onsgmls
/l1/dev/tburtonw/misc/sgml/xml.dcl:1:W: SGML declaration was not implied
The above error can be ignored, but if you see any other errors STOP! You need to determine the cause of the problem, fix it, and rerun the steps until there are no errors from make validate. If you continue with the next steps in the process with an invalid xml document, the errors will compound and it will be very difficult to trace the cause of the problem. To avoid seeing this error add the "-w no-explicit-sgml-decl" flag to the Makefile on line 83. Change line 83 of the Makefile

from:

onsgmls -wxml -s -f $(PREPDIR)$(NAMEPREFIX).errors $(XMLDECL) $(XMLDOCTYPE) $(XMLFILE)

to:

onsgmls -wxml -w no-explicit-sgml-decl -s -f $(PREPDIR)$(NAMEPREFIX).errors $(XMLDECL) $(XMLDOCTYPE) $(XMLFILE)

There is a patch available which will strip off Byte Order Marks, remove XML processing instructions, removes multiline DOCTYPE declarations and also implements the change to the onsgmls warning flag noted above DLXS13 August 24 Findaid Class Patch

Step 4: Normalizing the concatenated file

make norm

The Makefile runs a series of copy statements and two main commands:


1.)   /l/local/bin/osgmlnorm -f $DLXSROOT/prep/s/samplefa/samplefa.errors $DLXSROOT/misc/sgml/xml.dcl $DLXSROOT$DLXSROOT/prep/s/samplefa/samplefa.xml.inp $DLXSROOT/obj/s/samplefa/samplefa.xml.prenorm > /l1/dev/tburtonw/obj/s/samplefa/samplefa.xml.postnorm
2.)  /l/local/bin/osx -bUTF-8 -xlower -xempty -xno-nl-in-tag -f /l1/dev/tburtonw/prep/s/samplefa/samplefa.errors /l1/dev/tburtonw/misc/sgml/xml.dcl /l1/dev/tburtonw/prep/s/samplefa/samplefa.xml.inp /l1/dev/tburtonw/obj/s/samplefa/samplefa.xml.postnorm > /l1/dev/tburtonw/obj/s/samplefa/samplefa.xml.postnorm.osx 


These commands ensure that your collection data is normalized. What this means is that any attributes are put in the order in which they were defined in the DTD. Even though your collection data is XML and attribute order should be irrelevant (according to the XML specification), due to a bug in one of the supporting libraries used by xmlrgn (part of the indexing software), attributes must appear in the order that they are definded in the DTD. If you have "out-of-order" attributes and don't run make norm, you will get "invalid endpoints" errors during the make post step.

Step one, which normalizes the document writes its errors to $DLXSROOT/prep/s/samplefa/samplefa.errors. Be sure to check this file.

Step 2, which runs osx to convert the normalized document back into XML produces lots of error messages which are written to standard output. These are caused because we are using an XML DTD (the EAD 2002 DTD) and osx is using it to validate against the SGML document created by the osgmlnorm step. These are the only errors which may generally be ignored. However, if the next recommended step, which is to run "make validate" again reveals an invalid document, you may want to rerun osx and look at the errors for clues. (Only do this if you are sure that the problem is not being caused by XML processing instructions in the documents as explained below)

Step 5: Validating the normalized file against the dlxsead2002 DTD

make validate

We run this step again to make sure that the normalization process did not produce an invalid document. This is necessary because under some circumstances the "make norm" step can result in invalid XML. One known cause of this is the presense of XML processing instructions. For example: "<?Pub Caret1?>". Although XML processing instructions are supposed to be ignored by any XML application that does not understand them, the problem is that when we use sgmlnorm and osx, which are SGML tools, they end up munging the output XML. The recommended workaround is to add a preprocessing step to remove any XML processing instructions from your EADs before you run "make prepdocs", or to include some code in preparedocs.pl that will strip out XML priocessing instructions prior to concatenating the EADs.

Personal tools