Index building

From DLXS Documentation

(Difference between revisions)
Jump to: navigation, search
Line 3: Line 3:
* Store SGML or XML files in <tt>/{DLXSROOT}/obj/{s}/{sample}/</tt> where <tt>{DLXSROOT}</tt> is the "tree" where you install all DLXS components, <tt>{s}</tt> is the first letter of the name of the collection you are indexing, and <tt>{sample}</tt> is the collection ID of the collection you are indexing. For example, if your collection id is "nyt" and your DLXSROOT is "/l1", you will place the <tt>nyt.xml</tt> file in <tt>/l1/obj/n/nyt/</tt>, e.g., <tt>/l1/obj/n/nyt/nyt.xml</tt>. See [http://docs.umdl.umich.edu/arch/general/directory-conventions.html directory conventions] for more information.
* Store SGML or XML files in <tt>/{DLXSROOT}/obj/{s}/{sample}/</tt> where <tt>{DLXSROOT}</tt> is the "tree" where you install all DLXS components, <tt>{s}</tt> is the first letter of the name of the collection you are indexing, and <tt>{sample}</tt> is the collection ID of the collection you are indexing. For example, if your collection id is "nyt" and your DLXSROOT is "/l1", you will place the <tt>nyt.xml</tt> file in <tt>/l1/obj/n/nyt/</tt>, e.g., <tt>/l1/obj/n/nyt/nyt.xml</tt>. See [http://docs.umdl.umich.edu/arch/general/directory-conventions.html directory conventions] for more information.
* Store index, region, data dictionary, and init files in <tt>/{DLXSROOT}/idx/{s}/{sample}/</tt>, e.g., <tt>/l1/idx/n/nyt/nyt.idx</tt>. See the [../../xpat/index.html XPAT documentation] for more on these types of files.
* Store index, region, data dictionary, and init files in <tt>/{DLXSROOT}/idx/{s}/{sample}/</tt>, e.g., <tt>/l1/idx/n/nyt/nyt.idx</tt>. See the [../../xpat/index.html XPAT documentation] for more on these types of files.
-
 
The instructions below assume a sample collection named "nyt" and a DLXSROOT of "/l1", as in the above examples. Please replace these sample names with your local filenames.
The instructions below assume a sample collection named "nyt" and a DLXSROOT of "/l1", as in the above examples. Please replace these sample names with your local filenames.
Line 15: Line 14:
## and <tt>b/bib-sample/bib-sample.init</tt> with <tt>n/nyt/nyt.init</tt>
## and <tt>b/bib-sample/bib-sample.init</tt> with <tt>n/nyt/nyt.init</tt>
# Copy the sample init file ([[bib-sample.init]]) to <tt>/l1/idx/n/nyt/</tt> and rename as <tt>nyt.init</tt>
# Copy the sample init file ([[bib-sample.init]]) to <tt>/l1/idx/n/nyt/</tt> and rename as <tt>nyt.init</tt>
-
# Index your collection using the following command, replacing the value <tt>10m</tt> with an appropriate amount of memory. Please see [../../xpat/index.html XPAT documentation] to determine how much memory to allocate.<br />     
+
# Index your collection using the following command, replacing the value <tt>10m</tt> with an appropriate amount of memory. Please see [../../xpat/index.html XPAT documentation] to determine how much memory to allocate.<br />    <tt>xpatbldu -m 10m -D /l1/idx/n/nyt/nyt.dd</tt>
-
<tt>xpatbldu -m 10m -D /l1/idx/n/nyt/nyt.dd</tt>
+
# Create your region files by issuing the following command.<br />    <tt>multirgn -f -D /l1/idx/n/nyt.dd -t bib-regions.tags</tt><br /> The file [[bib-regions.tags]] can be located in any directory and can be deleted after the regions have been indexed. DLPS keeps a copy of this file in <tt>/l1/obj/lib/sgml/bib-regions.tags</tt>
-
# Create your region files by issuing the following command.<br />     
+
-
<tt>multirgn -f -D /l1/idx/n/nyt.dd -t bib-regions.tags</tt><br />  
+
-
The file [[bib-regions.tags]] can be located in any directory and can be deleted after the regions have been indexed. DLPS keeps a copy of this file in <tt>/l1/obj/lib/sgml/bib-regions.tags</tt>
+
You have now built indexes and region files for your collection. You can test that things are properly indexed by issuing the command<br />     
You have now built indexes and region files for your collection. You can test that things are properly indexed by issuing the command<br />     
<tt>xpatu /l1/idx/n/nyt/nyt.dd</tt><br />  
<tt>xpatu /l1/idx/n/nyt/nyt.dd</tt><br />  
and then searching a common word (e.g., "<tt>the</tt>") and<br />
and then searching a common word (e.g., "<tt>the</tt>") and<br />
-
     <tt>region A</tt><br />
+
     <tt>region A</tt>
Strategically, it is good to test this from a directory other than the one you indexed in, to ensure that relative or absolute paths are resolving appropriately.
Strategically, it is good to test this from a directory other than the one you indexed in, to ensure that relative or absolute paths are resolving appropriately.

Revision as of 12:19, 9 September 2007

You will need to identify a directory or directories where you plan to store your SGML or XML source file, your index file (approximately 75% of the size of your bibliographic information), your "region" files, and other information such as data dictionaries. We recommend you use the following structure:

  • Store SGML or XML files in /{DLXSROOT}/obj/{s}/{sample}/ where {DLXSROOT} is the "tree" where you install all DLXS components, {s} is the first letter of the name of the collection you are indexing, and {sample} is the collection ID of the collection you are indexing. For example, if your collection id is "nyt" and your DLXSROOT is "/l1", you will place the nyt.xml file in /l1/obj/n/nyt/, e.g., /l1/obj/n/nyt/nyt.xml. See directory conventions for more information.
  • Store index, region, data dictionary, and init files in /{DLXSROOT}/idx/{s}/{sample}/, e.g., /l1/idx/n/nyt/nyt.idx. See the [../../xpat/index.html XPAT documentation] for more on these types of files.

The instructions below assume a sample collection named "nyt" and a DLXSROOT of "/l1", as in the above examples. Please replace these sample names with your local filenames.

  1. Ensure that your SGML is fully validated or normalized, or that your XML is fully validated. Use a validating parser such as nsgmls to accomplish this. NB: Building indexes without validation can cause problems such as unreliable results; data that will not validate should not be put online.
  2. Ensure that your data is Unicode (see [../unicode.html DLXS Unicode Data Preparation and Online Presentation Issues]).
  3. Assuming XML, put the file nyt.xml in /l1/obj/n/nyt/nyt.xml
  4. Copy the sample data dictionary file (bib-sample.dd) to /l1/idx/n/nyt/ and rename as nyt.dd
  5. Edit the nyt.dd file to replace
    1. b/bib-sample/bib-sample.xml with n/nyt/nyt.xml
    2. b/bib-sample/bib-sample.idx with n/nyt/nyt.idx
    3. and b/bib-sample/bib-sample.init with n/nyt/nyt.init
  6. Copy the sample init file (bib-sample.init) to /l1/idx/n/nyt/ and rename as nyt.init
  7. Index your collection using the following command, replacing the value 10m with an appropriate amount of memory. Please see [../../xpat/index.html XPAT documentation] to determine how much memory to allocate.
    xpatbldu -m 10m -D /l1/idx/n/nyt/nyt.dd
  8. Create your region files by issuing the following command.
    multirgn -f -D /l1/idx/n/nyt.dd -t bib-regions.tags
    The file bib-regions.tags can be located in any directory and can be deleted after the regions have been indexed. DLPS keeps a copy of this file in /l1/obj/lib/sgml/bib-regions.tags

You have now built indexes and region files for your collection. You can test that things are properly indexed by issuing the command
xpatu /l1/idx/n/nyt/nyt.dd
and then searching a common word (e.g., "the") and

    region A

Strategically, it is good to test this from a directory other than the one you indexed in, to ensure that relative or absolute paths are resolving appropriately.

Personal tools