XPAT Index Building Overview
From DLXS Documentation
Contents |
Overview
Index building is an intrinsic part of setting up an XPAT database. Indices provide a means by which text can be searched for rapidly. Index building is actually composed of a sequence of individual index-building steps, each performed by specialized utilities. Each of these utilities has its own command line requirements. Each utility has default characteristics that can be customized or optimized by the use of command-line parameters. These parameters can be adjusted in order to exert greater control over, and to improve, the index-building process.
Before discussing the actual details of the index-building process (see Chapter 3, Detailed Index Building), we first review the functionality of the individual utilities that comprise a complete set of indices. The actual sequence of utilities used depends greatly on the type of database in use. Later, Chapter 3, Detailed Index Building, provides full details for each of the individual index-building utilities.
Components of the Index-building Process
The index building process consists of two primary and required parts. The first of these is a string index, generated by xpatbld. When using MFS databases, other XPAT utilities help to prepare the data so that xpatbld can build a string index; these additional steps are not required in consolidated databases. The xpatbld string index is used to find occurrces of words, phrases, and ranges of values. The second component part of an index building process is the region index. The concept of region indexes is covered in Chapter 2, section 2, XPAT Regions. Briefly, region indexes make it possible to find information in within specified (and demarcated) segments of the text, and even segments nested within each other. These two indices--the string index and the region index--work together to provide users and system builders with the ability to find information in document collections using XPAT. The following table provides a brief inventory of the utilities used, their relevance to a consolidated index build, an MFS index build, or both, and the files read, modified, and created.
Index builder | Function | MFS or Consol-idated | Files read | Files modified | Files created |
mfsbld | Builds the FileMap for an MFS database. This utility reads in the FiltcrChain segments defined in the DD ("patents.dd") and creates a list of all the files that those segments represent. It then creates one entry in the FileMap file for each file in the list. Additional information (date, time, etc.) about each file is added to the file. | MFS | DAT, DD | FMP, VIR | |
mfsmeta | Produces a metadata structure file | MFS | DD, FMP | STR | |
multirgn | Produces Region Sub-Indices for the metadata regions created in the FileMap by the mfsbld utility. | Both | DD, TAG | DD | RGN |
sgmlrgn/xmlrgn | Builds SGML or XML region indices based on the structure indicated in the Data Type Definition (e.g., textclass.dtd) file. | Both | DD, INP, SOC (xml only), DTD, RGN | DD, RGN | |
Produces the Main Index by using information in the DD to determine indexing parameters (character mappings, index point specifications, stopwords). Command line options play an important role, especially the memory option, "-m". See below for a discussion of memory allocation and consult Chapter 3, Detailed Index Building, for full details on the "-m" option. | Both | DD, FMP | DD | IDX | |
patffw, patffi | Produce the Fast Find Word List and the Fast Find Indices, respectively. These two utilities must be used in sequence for the Fast Find indices to be built properly. | DD | FF |
Memory Allocation for Index Building
Memory allocation is an important part of the index-building process. It is not required; however, its proper use will increase the speed with which the indices are built. There are two simple rules for deciding how much memory to allocate:
- Allocate as much physical RAM as possible
- Do not allocate more RAM than you have available.
The second rule sounds rather ridiculous; however attempting to allocate more RAM than available will cause the system to "thrash." Your machine will be so busy trying to allocate physical memory via virtual memory that it will not be able to begin building the indices. For example, you may have 200Mb of data and 256Mb of RAM. The operating system takes up, say, 6Mb of RAM, leaving 250Mb for index building. If you allocate the default, which is 512Kb, the index-building process will take far longer than if you allocate, say 50Mb, or even 100Mb. In this case, you can allocate up to 250Mb. However, if you allocate 300Mb, you have exceeded the available RAM. This will cause your machine to swap physical and virtual memory back and forth, and thus it will not get around to building the indices. Consult your system administrator for information on the amount of physical RAM your machine has available.
Summary
The type of database in use determines the sequence of individual index builds. Different index-building utilities with different user-selected parameters control the indexing process. Further details of each index-building utility are available in Chapter 3, Detailed Index Building.
Miscellaneous
The table below shows a list of the files that may be either read, created, or modified by an individual index-builder utility. In addition to the filename suffix, there is also a short form notation column. These short form notations are used in various chapters in the XPAT documentation.
File or Suffix | Notation | Description |
Various | DAT | Source data files of any type |
".dd" | DD | Data Dictionary (main control file) |
".dtd" | DTD | Data Type Definition (SGML or XML element hierarchy) |
".ff?" | FF | Fast Find Word List and Fast Find Indices |
".fmp" | FMP | File Map Files |
".idx" | IDX | Main Index File |
".inp" | INP | Data Type Declaration File (SGML or XML) |
".log" | LOG | Message Log File |
".rgn" | RGN | Region Sub-Indices File |
".tag" | TAG | Region tags file |
"tempfile.mfs" | VIR | Virtual data file consolidated from source data |