NAME

Nextrieve - Perl access to Nextrieve Search Engine


SYNOPSIS

 use Nextrieve;
 use Liz qw(Contents WriteContents);

 $index = new Nextrieve( 'token' );
 $index->Tree( '/export/home/local/www/customer/root' );
 $resource = $index->Index;
 WriteContents( '/export/home/resourcefile',$resource );

 $search = new Nextrieve( Contents('/export/home/resourcefile') );
 $hits = $search->Search( $string );
 print "There were $hits hits:<P>\n";
 while( ($title,$mappedname,$preview) = $search->FetchRow ) {
   $preview =~ s#\\b#<B>#sg; $preview =~ s#\\r#</B>#sg;
   print "<A HREF=\"$mappedname\">$title</A><BR>$preview<P>\n";
 }


EXAMPLES

to be expanded

 Nextrieve->ProgramVersion( '1.1.0' );
 Nextrieve->ProgramDirectory( '/opt/nextrieve' );
 Nextrieve->IndexProgram( 'newntvindex' );
 Nextrieve->SearchProgram( 'newnextriev' );
 Nextrieve->TempDirectory( '/export/home/local/tmp' );

 Nextrieve->DontIndex( 'bak include' );
 $index->DontIndex( ' text data' );

 Nextrieve->FuzzyFactor( 5 );
 Nextrieve->PreviewHighlight( 5 );
 Nextrieve->DisplayedHits( 100 );
 Nextrieve->TotalHits( 500 );

 $hits = $search->Hits;
 $attributes = $search->Attributes;
 $search->Attributes( 'Y1997 M199801' );
 $search->Fields( 'title mappedname preview score' );


DESCRIPTION

Provide basic access to the functionality of the Nextrieve fuzzy search engine from Perl.

It provides an object-oriented model for indexing information as specified by a so-called ``resource-file'' (in which Nextrieve is told where resources reside) and a so-called ``command-file'' (a list of commands and filenames that need to be indexed). It also provides an object-oriented model for fuzzy searching of indexes that were previously created.

Nextrieve is a product of Nexial Systems ( http://www.nexial.nl ).


BASIC METHODS

This section provides the basic methods to do a simple indexing of a file tree and to obtain the results of that search and display the results to the user.


new

Create a new Nextrieve object. When called with a resource specification, it can be used to perform searches with Search. Otherwise it can only be used to Index data, which returns the resource specification.

Input Parameters

 1 token to be used for index / resources to be used for search
   (default: token '_DEFAULT')
Output Parameters

 1 instantiated object


Tree

Create list of HTML-files to be indexed from an initial directory and all directories below it. Inserts proper attribute statements and ignores files and directories when so indicated.

.noindex

Do not index any files from this directory and all directories below it if this file exists but is empty. If the file is not empty, then it is considered to be a list of files that should not be indexed and all directories below will also be scanned for files to be indexed.

.attribute

Specify the attributes that should be assigned to all files in the current directory and all files below it. Attributes are specified as words with whitespace between them (either on a single line or on more than one line).

If there is only one attribute and that attribute is ``*directory'', then no attributes are assigned to the current directory, but all directories below will be automatically assigned an attribute with the same name as the name of the directory. This is especially useful on sites that use a rigid structure in which each directory from the root directory, is a separate section within the site of which it is likely that people would want to search in seperately from the rest of the contents of the site.

Input Parameters

 1 initial directory to be indexed
Note

Works best with relative directory specification: first chdir to the directory ``above'' the tree to be indexed, then use a relative spec to create a list of files in that tree.


Index

Index the information indicated by the object.

If it is indicated that Nextrieve should run as a server (by calling methods Server and Port previously), then the Nextrieve server will be (re-)started after the indexing is complete.

Input Parameters

 1 directory in which to place resulting database
   (default: 'indexes')
 2 type of information to be indexed
   (default: 'html', other values: ascii, mail and auto)
Output Parameters

 1 resource specification
   (to be used later to create a search object with "new")


StartServer

Start a Nextrieve server with the resource information that is available in the Nextrieve object.

Input Parameters

 1 index directory to be used
   (default: field INDEXDIR)
Output Parameters

 1 pid number of process of server
   (undef if server start failed)


StopServer

Stop a Nextrieve server with the resource information that is available in the Nextrieve object.

Output Parameters

 1 pid number of process of server that was stopped
   (undef if server stop failed)


Search

Search the Nextrieve database of this object and store the result list in the object and return the number of hits of which info was collected, and the total number of hits that were found.

Attributes and other search properties should have been set by the appropriate methods before calling this method.

Input Parameters

 1 string to search for
 2 resource information to be used
   (default: what was specified with the new() )
Output Parameters

 1 number of hits returned
 2 number of hits found


FetchRow

Fetch a row of the search result as specified with the Fields method. Automatically fetches next row of the search result if no specific entry is specified.

Input Parameters

 1 entry in result list of which to obtain row
   (default: next, starts at 1)
 2 fields to obtain
   (default: as specified with "Fields")
Output Parameters

 1 array with values in order of fields specification


HighlightText

Place highlight codes in the specified text using the specified query and highlight values.

Input Parameters

 1 text to place highlight codes in
 2 search query to use to highlight
   (default: field 'SEARCHQUERY')
 3 highlight length to be used
   (default: field 'HIGHLIGHT' or auto)
 4 string to be used for start highlight
   (default: '<B>'
 5 string to be used for end highlight
   (default: '</B>'
Output Parameters

 1 array with values in order of fields specification


ADVANCED METHODS

This section provides the advanced methods that allow you to specify attributes to be selected and different versions of Nextrieve to be used and much more.


DontIndex

Specify or return the files and directories that will be skipped when creating the list of files to be indexed. Files and directory names are specified as a list of words seperated by spaces.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: 'bak cgi cgi-bin include old'.

Input Parameters

 1 new list of files to be skipped (e.g. 'text data' )
   (default: no change, if starts with space: append to previous value)
Output Parameters

 1 current/old list of files and directories to be skipped


ProgramVersion

Specify or return the version of Nextrieve that will be used.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: '1.1.1'.

Input Parameters

 1 new version of Nextrieve (e.g. '1.2.0')
   (default: no change)
Output Parameters

 1 current/old version of Nextrieve


ProgramDirectory

Specify or return the directory where Nextrieve is installed.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: '/usr/local/nextrieve'.

Input Parameters

 1 new install directory of Nextrieve (e.g. '/opt/nextrieve')
   (default: no change)
Output Parameters

 1 current/old install directory of Nextrieve


IndexProgram

Specify or return the name of the program that Nextrieve uses to index files.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: 'ntvindex'.

Input Parameters

 1 new name of Nextrieve's indexing program (e.g. 'newntvindex')
   (default: no change)
Output Parameters

 1 current/old name of Nextrieve's indexing program


Port

Specify the port at which the Nextrieve server is supposed to run as a server. Should be set before calling method Index. The server will be (re-)started automatically after indexing is complete.

Call method Server to specify the IP-number or name of the server on which Nextrieve should run.

Input Parameters

 1 port number to be used
   (default: 2034)
Output Parameters

 1 current/old port on which Nextrieve runs


SearchProgram

Specify or return the name of the Nextrieve program that will be used to perform searches.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: 'nextriev'.

Input Parameters

 1 new name of Nextrieve's search program (e.g. 'newnextriev')
   (default: no change)
Output Parameters

 1 current/old name of Nextrieve's search program


Server

Specify the name of the IP-number of the server where Nextrieve is supposed to run as a server. Should be set before calling method Index. The server will be (re-)started automatically after indexing is complete.

Call method Port to specify the IP-port on which the server should run.

Input Parameters

 1 name or IP-number of server to be used
   (default: localhost)
Output Parameters

 1 current/old server on which Nextrieve runs


TempDirectory

Specify or return the directory in which temporary files are stored.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: '/tmp'.

Input Parameters

 1 new directory to be used to store temporary files
   (default: no change)
Output Parameters

 1 current/old directory setting


FuzzyFactor

Specify or return the fuzziness factor for subsequent searches.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

Currently suggested values are:

   0 exact
   1 minimal
   3 normal (default)
  10 moderate
  50 very
 100 extremely
Input Parameters

 1 new fuzzy factor to be applied to searches
   (default: no change)
Output Parameters

 1 current/old fuzzy factor
Note

Current versions of the Nextrieve engine do not implement exact searching yet. Currently, this module simulates this by doing a search with the lowest possible fuzzy value (1) and then check the previews for each seperate word in the original query string. If each word of the original query string is found, only then the entry is included in the final list.


PreviewHighlight

Specify or return the number of characters that must match in a preview in order for a word to be highlighted.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: 0 (adaptive).

Possible values:

   0 adaptive highlight
     (length of shortest word in query minus one, maximum 5)
  >0 highlight to be used
Input Parameters

 1 new number of characters for highlighting words
   (default: no change)
Output Parameters

 1 current/old number of characters


DisplayedHits

Specify or return the maximum number of hits that will be returned from a search.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: 200.

Input Parameters

 1 new maximum number of hits
   (default: no change)
Output Parameters

 1 current/old maximum number of hits


TotalHits

Specify or return the maximum number of hits that is investigated.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: 1000.

Input Parameters

 1 new maximum number of hits to investigate
   (default: no change)
Output Parameters

 1 current/old maximum number of hits to investigate


Distinct

Specify or return whether only distinct filenames should be returned.

The default is: all occurrences of filenames.

Input Parameters

 1 new setting of distinct flag
   (default: no change)
Output Parameters

 1 current/old setting of distinct flag
Example

 $search->Distinct( 1 );
 $distinct = $search->Distinct;


Attributes

Specify or return the current attributes for a search query with Search.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: '' (do not perform attribute matching).

Input Parameters

 1 new attributes specification
   (default: no change)
Output Parameters

 1 current/old attributes specification


Fields

Specify or return the current setting of the fields returned by FetchRow. Fields must be specified seperated by spaces.

If called as a class method, sets or returns the defaults with which new objects will be created. If called as an object method, only changes or returns the value for that object.

The default is: 'title mappedname preview attributes'.

Input Parameters

 1 new fields specification
   (default: no change)
Output Parameters

 1 current/old fields specification
Note

The following fields may be specified:

filename

This is the name of the file relative to the indexdir variable defined in the resource file. This may well be an absolute filename of course.

title

This is the title of the document or the subject in the case of E-mail.

filetype

This is the type of the file and currently can have the values ascii, html or mail.

score

The score of the hit. Arbitrary value that only indicates a relative ordering, not an absolute quality of the hit.

percent

An approximate percentage rating. Similar to score.

pagenum

The page number of this hit. The page number is the basic unit that is indexed.

offset

The absolute offset with the file to this page.

length

The length from the offset of this page.

preview

The preview text. This uses the convention of \b for bold and \r for normal text again with \ itself represented as \\ for easy parsing and in the style of nroff.

mappedname

This is the name of the file after being ``mapped''. This is usually the relative URL of the hit.

attributes

This is a string that contains all (If any) attributes that have been tagged to this document.

document

This is the document number. Usually needed for internal purposes only.


SPECIAL METHODS

This section provides the special methods that allow you access to the ``innards'' of the module. It is intended for debugging and for allowing access from subclasses.


ResultList

Return the raw result list of the search query.

Output Parameters

 1 raw result list from Nextrieve


INTERNAL SUBROUTINES

This section provides documentation to internal subroutines (not methods).


AdaptiveHighlight

Return the adaptive highlight value for a specific query. Adaptive highlight value is the length of the longest word in the query minus one with a maximum of 5.

Input Parameters

 1 query to obtain adaptive highlight of
Output Parameters

 1 adaptive highlight value


AUTHOR

Elizabeth Mattijsen ( lizperl@INC.nl )


COPYRIGHT

(C) 1997-2000 International Network Consultants


HISTORY

Version 0.69, 15 May 2000

Changed check for valid HTML-file from -T to -s in method Tree, which should be faster and skip empty files.

Version 0.68, 30 September 1999

Method Search now waits for the child process to exit and checks the return code in case of a failure.

Now no longer puts Exporter in ISA: it was not needed.

Version 0.67, 28 January 1999

Added new method Distinct: allows one to specify whether only distinct hits should be returned. Adapted method Search to honor distinct flag setting.

Adapted Tree and Search so that attributes are internally always prefixed with an ``a'', and that no invalid characters can occur in attribute names.

Version 0.66, 26 January 1999

Adapted Tree so that file and directory names with spaces are acceptable to Nextrieve by replacing spaces in filenames and directory names into carets. This change is marked with KLUDGE!

Version 0.65, 25 January 1999

Fixed problem in FuzzyFactor which would cause the default fuzzy factor to be returned if it was 0.

Version 0.64, 6 October 1998

Reduced memory footprint by using only fully qualified global variables and external subroutines.

Version 0.63, 27 August 1998

Method Search now sets the field ERROR in the object with any error message returned by Nextrieve.

Version 0.62, 26 August 1998

Fixed problem in Search caused by the fact that a global variable @parameters was being used, which caused problems in a ModPerl environment.

New method ShellCmd: returns the commands to be executed in a shell to reproduce the search.

Version 0.61, 22 June 1998

Methods FuzzyFactor, PreviewHighlight, DisplayedHits, TotalHits and Fields now return the appropriate default value if the associated field is not set yet.

Method FetchRow now allows to override the fields to be obtained.

Method Search now allows the resource information to be specified directly.

Version 0.6, 19 June 1998

New method HighlightText: highlites text according to a query and a preview highlight length specification.

New internal subroutine AdaptiveHighlight: returns the best highlight for a given query.

Version 0.5, 4 June 1998

New method StartServer: allows Nextrieve to run as a server for the resource in the object.

New method StopServer: stops the Nextrieve server that is running for the resource information in the object.

Methods new, Index and Search upgraded to use the Nextrieve server feature.

Upgraded to use new Liz module.

Version 0.4, 29 March 1998

``new'' now allows the filename of a resource-file to be directly specified.

Version 0.3, 9 February 1998

Adapted Search so that a fuzzy factor of 0 is interpreted as 1 with postprocessing to filter out entries in the resultlist that do not have exact matches with the words in the query.

Version 0.2, 7 February 1998

Changed to use a static Nextrieve resource-file in the index directory. This seems to fix the problem with attributes being mixed up.

Version 0.1, 2 February 1998

First version of this true Perl module, based on routines developed for the NextrieveMsql library.