j  � ht://Dig: How it works� � 0  

 How it works



W ht://Dig Copyright © 1995-2002 The ht://Dig Group
8 Please see the file COPYING for license information.




7 The system performs three major tasks that should be$ performed in the following order:



1.  ! Digging



= Before you can search, a database of all the documents that( need to be searched has to be created.



2.  ! Merging



+ Merging consists of two processes:
    2
  1. Converting the databases of all documents to8 specialized databases for simple, fast searching.
  2. 1
  3. Merging changed information into previously existing databases.
  4. 
< Even though this task could be performed at the same time9 as the Digging, it is a separate process for efficiency5 reasons. This also allows for more control over the% processes implemented when merging.



3.  % Searching



: Finally, the databases that were created in the previous2 steps can be used for actual searches. Normally,3 searches will be invoked by a CGI (Common Gateway; Interface; a program running on the webserver) which gets+ input from the user through an HTML form.




Digging



@ Digging is the first step in creating a search database. This? system uses the word digging while other systems call< it harvesting or gathering. In the ht://Dig> system, the program htdig performs@ the information gathering stage. In this process, the program= will act as a regular web user, except that it will follow> all hyperlinks that it comes across. (Actually, it> will not follow all of them, just those that are within the3 domain it needs to gather information on...)
: Each document it visits is examined and all the unique= words in this document are extracted and stored, excepting? those specified as > too short, too9 long, or to be% excluded by the configuration.



@ The digging process will create at least two files. The first; one is the list of all the words and the second one is a? database of URLs and information about the URLs. Other files& may be created for a list of all URLs seen, href="attrs.html#create_image_list">all images seen, ASCII versions of the databases, etc.




Merging



9 Once the digging process is complete, the data must be? converted into something the search engine can actually use.< The htmerge program does this.

% The term "merge" is used? because data from several databases is gathered together and< merged into several other databases. The source databasesG include the databases created by the latest "dig" but also? any previous merged databases. The latest dig will produce a6 database that provides information on new pages and; information on changes to previously existing pages; the; information on the new pages, and the new information on4 changes to old pages is merged with the unchanged. information to create up-to-date databases.



C There are other, optional, tasks which are categorized under the merge phase:



 Expiration notification:

8 The ht://Dig system includes a handy reminder service,8 "htnotify." This8 allows HTML authors to add some ht://Dig specific meta9 information in HTML documents. This meta information is; used to email authors after a specified date. Very useful5 to maintain lists that contain those annoying 'new': graphics with new items. (Hint: things really aren't all) that 'new' anymore after 6 months!)


 Fuzzy word index creation:

= Allows searches using "fuzzy" algorithms to match; words. The htfuzzy program can2 create indexes for several different algorithms.




$ Searching



6 Searching is where all the information gathered and= organized during the dig and merge stages gets put to use.- The = htsearch program performs the actual searches. The CGI9 program, using the HTML "search form" on the= website as input performs the search and produces the HTMLE output (or, the "failed search") which is seen by users.


, Last modified: $Date: 2002/01/28 03:56:10 $ ÿÿwords input parameter. 
LASTDISPLAYED

+ The index of the last match on this page.

LOGICAL_WORDS

8 A string of the search words with either "and" or "or"5 between the words, depending on the type of search.

MATCH_MESSAGE

C This is either all or some depending on the match method used.

 MATCHES

. The total number of matches that were found.

# MATCHES_PER_PAGE

7 The configured maximum number of matches on this page

 MAX_STARS

6 The configured maximum number of stars to display in matches.

" METADESCRIPTION

> The meta description text (if any) for the matched document.

 METHOD

7 Expands to an HTML menu of all the available matching6 methods. The current method will be the default one.1 The menu is composed of choices itemized in the4 method_names7 attribute. The expansion of this template variable is! described in more detail in the= select list documentation.

 MODIFIED

2 The date and time the document was last modified

 NEXTPAGE

" This expands to the value of the; next_page_text or> no_next_page_text> attributes depending on whether there is a next page or not.

NSTARS

> The number of stars calculated for this document as an4 integer, up to a maximum specified by the max_stars attribute.

 PAGE

 The current page number.

 PAGEHEADER

) This expands to either the value of the? page_list_header orB no_page_list_header3 attributes depending on how many pages there are.

 PAGELIST

0 This expands to a list of hyperlinks using the@ page_number_text andB no_page_number_text attributes.

 PAGES

 The total number of pages.

 PERCENT

9 The match score as a percentage. Its range is 1 to 100,8 without a percent sign. The minimum is always 1 so the5 variable can be used as the value for an HTML WIDTH attribute.

! PLURAL_MATCHES

E If the MATCHES variable is other than 1, this will be a single 's'.

 PREVPAGE

" This expands to the value of the; prev_page_text or> no_prev_page_textB attributes depending on whether there is a previous page or not.

 SCORE

The score of the current match

" SELECTED_FORMAT

The currently selected format.

" SELECTED_METHOD

) The currently selected matching method.

SELECTED_SORT

( The currently selected sorting method.

 SIZE

0 The size of the document for the current match

 SIZEK

= The size in kilobytes of the document for the current match

 SORT

6 Expands to an HTML menu of all the available sorting6 methods. The current method will be the default one.1 The menu is composed of choices itemized in the0 sort_names7 attribute. The expansion of this template variable is! described in more detail in the= select list documentation.

 STARSLEFT

: A set of HTML <img> tags with the stars aligned on the left.

 STARSRIGHT

: A set of HTML <img> tags with the stars aligned on the right.

U STARTYEAR, STARTMONTH, STARTDAY,N ENDYEAR, ENDMONTH, ENDDAY

; The currently specified date range for restricting search results.

 SYNTAXERROR

5 Is the text of the boolean expression syntax error.

 TITLE

1 The title of the document for the current match

 URL

/ The URL to the document for the current match

 VERSION

 The ht://Dig version number

 WORDS

6 A string of the search words with spaces in between.
 
, Last modified: $Date: 2002/01/27 05:33:20 $ ÿÿ