j 8 ht://Dig: Configuration file attributes 0  1

Configuration file format -- Attributes



W ht://Dig Copyright © 1995-2002 The ht://Dig Group
8 Please see the file COPYING for license information.



F See the sample htdig.conf file for some examples of usage.




" Alphabetical list of attributes




6 accents_db



 type:

string

 used by:

) htfuzzy and5 htsearch

 default:

 ${database_base}.accents.db

 description:

8 The database file used for the fuzzy "accents" search) algorithm. This database is created by1 htfuzzy and used by6 htsearch.

 example:

& accents_db: ${database_base}.uml.db






+ % add_anchors_to_excerpt



 type:

boolean

 used by:

5 htsearch

 default:

 true

 description:

7 If set to true, the first occurrence of each matched4 word in the excerpt will be linked to the closest6 anchor in the document. This only has effect if the: EXCERPT variable is used in the output> template and the excerpt is actually going to be displayed.

 example:

 add_anchors_to_excerpt: no






< allow_in_form



 type:

 string list

 used by:

5 htsearch

 default:

 <empty>

 description:

> Allows the specified config file attributes to be specified< in search forms as separate fields. This could be used to= allow form writers to design their own headers and footers= and specify them in the search form. Another example would7 be to offer a menu of search_algorithms in the form. 
6  <SELECT NAME="search_algorithm">
`  <OPTION VALUE="exact:1 prefix:0.6 synonyms:0.5 endings:0.1" SELECTED>fuzzy
3  <OPTION VALUE="exact:1">exact
  </SELECT>
3 The general idea behind this is to make an input/ parameter out of any configuration attribute1 that's not already automatically handled by an1 input parameter. You can even make up your own1 configuration attribute names, for purposes of3 passing data from the search form to the results0 output. You're not restricted to the existing0 attribute names. The attributes listed in the4 allow_in_form list will be settable in the search4 form using input parameters of the same name, and2 will be propagated to the follow-up search form3 in the results template using template variables4 of the same name in upper-case. You can also make5 select lists out of any of these input parameters,* in the follow-up search form, using the7 build_select_lists configuration attribute.

 example:


; allow_in_form: search_algorithm search_results_header






< allow_numbers



 type:

boolean

 used by:

! htdig

 default:

 false

 description:

5 If set to true, numbers are considered words. This7 means that searches can be done on number as well as8 regular words. All the same rules apply to numbers as# to words. See the description of= valid_punctuation for the* rules used to determine what a word is.

 example:

 allow_numbers: true






( " allow_virtual_hosts



 type:

boolean

 used by:

! htdig

 default:

 true

 description:

8 If set to true, htdig will index virtual web sites as1 expected. If false, all URL host names will be8 normalized into whatever the DNS server claims the IP5 address to map to. If this option is set to false,3 there is no way to index either "soft" or "hard" virtual web sites.

 example:

 allow_virtual_hosts: false






"  anchor_target



 type:

string

 used by:

5 htsearch

 default:

 <empty>

 description:

7 When the first matched word in the excerpt is linked5 to the closest anchor in the document, this string4 can be set to specify a target in the link so the4 resulting page is displayed in the desired frame.& This value will only be used if the? add_anchors_to_excerpt9 attribute is set to true, the EXCERPT2 variable is used in the output template and the- excerpt is actually displayed with a link.

 example:

 anchor_target: body






!  any_keywords



 type:

boolean

 used by:

5 htsearch

 default:

 false

 description:

# If set to true, the words in the/ keywords input parameter in. the search form will be joined with logical1 ORs rather than ANDs, so that any of the words0 provided will do. Note that this has nothing2 to do with limiting the search to words in META2 keywords tags. See the 5 search form documentation for details on this.=
"
 example:
o
 any_keywords: yes
m
r


:
o
c"  authorization

i

 type:
i
string
e
 used by:
e
! htdig
p
 default:
a
 <empty><

 description:
n
( This tells htdig to send the supplied- username:password0 with each HTTP request. The credentials will. be encoded using the "Basic" authentication- scheme. There must be a colon (:)m) between the username and password.
2 This attribute can also be specified on htdig's0 command line using the -u option, and will be/ blotted out so it won't show up in a processd5 listing. If you use it directly in a configuratione5 file, be sure to protect it so it is readable only 6 by you, and do not use that same configuration file for htsearch.
"
 example:
m
' authorization: myusername:mypassword:
_
.



>
$  backlink_factor

e

 type:d

number:

 used by:

5 htsearcht
"
 default:

 1000u

 description:
d
: This is a weight of "how important" a page is, based on3 the number of URLs pointing to it. It's actuallye; multiplied by the ratio of the incoming URLs (backlinks)m7 to outgoing URLs (links on the page), to balance outo6 pages with lots of links to pages that link back to8 them. The ratio gives a lower weight to "link farms",7 which often have many links to them. This factor cand7 be changed without changing the database in any way.8 However, setting this value to something other than 0' incurs a slowdown on search results.<
d
 example:

 backlink_factor: 501.1




/

# " bad_extensions 

t

 type:&
;
 string list
d
 used by:
A
! htdig
d
 default:
i
: .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif .jpgB .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css

 description:

1 This is a list of extensions on URLs which are;7 considered non-parsable. This list is used mainly ton: supplement the MIME-types that the HTTP server provides: with documents. Some HTTP servers do not have a correct2 list of MIME-types and so can advertise certain7 documents as text while they are some binary format.e= See also valid_extensions.f
t
 example:
l
! bad_extensions: .foo .bar .bad
p



u
s
p! s bad_querystr 



 type:u
h
 string list
i
 used by:

! htdigm
w
 default:
o
 <empty>t
i
 description:
c
: This is a list of CGI query strings to be excluded from? indexing. This can be used in conjunction with CGI-generatedl3 portions of a website to control which pages area indexed.

 example:


G bad_querystr: forum=private section=topsecret&passwd=required
<
d


t
s
n< bad_word_list

d

 type:
b
string
<
 used by:
<
% htdig andd5 htsearche

 default:
/
 ${common_dir}/bad_words
n
 descriptions:e
b
: This specifies a file which contains words which should: be excluded when digging or searching. This list should8 include the most common words or other words that you6 don't want to be able to search on (things like 5 sex or smut are examples of these.)
6 The file should contain one word per line. A sample= bad words file is located in the contrib/examplesu directory.l
_
 example:
d

d/ bad_word_list: ${common_dir}/badwords.txt
.
i


d
<
t0 bin_dir

d

 type:>

stringn

 used by:
L
" htdig,( htnotify,& htfuzzy,) htmerge and"5 htsearch
<
 default:
a
BIN_DIR
/
 description:

1 This is the directory in which the executablese6 related to ht://Dig are installed. It is never used8 directly by any of the programs, but other attributes' can be defined in terms of this one. 

9 The default value of this attribute is determined at< compile time.

t
 example:
p
 bin_dir: /usr/local/bin
p
>



h
aB boolean_keywords

r

 type:e
n
 string list
y
 used by:
l
5 htsearchc
e
 default:
<
and or not

 description:
r
3 These 3 strings are used as the keywords used in 4 constructing the LOGICAL_WORDS template variable,2 and in parsing the words input7 parameter when the method parameter"8 or match_method attribute is set to boolean.p

 example:

 boolean_keywords: et ou non

f


_
s
L boolean_syntax_errors

>

 type:

 quoted string listh
o
 used by:

5 htsearchA
t
 default:
o
I Expected 'a search word' 'at the end' 'instead of' 'end of expression'o
e
 description:
/
7 These 5 strings are used to construct various syntaxm3 error messages for errors encountered in parsing:7 the words input parameter, when thel' method parameter ora5 match_method attributei7 is set to boolean. They are used in conjunction withd the words in the/3 boolean_keywords 3 attribute, and comprise all the English-specific4 parts of these error messages. The order in which4 the strings are put together may not be ideal, or9 even gramatically correct, for all languages, but theyt6 can be used to make fairly intelligible messages in many languages.
H
 example:
n
^ boolean_syntax_errors: Attendait "un mot" "à la fin" "au lieu de" "fin d'expression"
a
r




' o! build_select_listss

i

 type:i
i
 quoted string listu
c
 used by:
d
5 htsearch/
d
 default:
d
 <empty>s

 description:
l
4 This list allows you to define any htsearch input3 parameter as a select list for use in templates,d- provided you also define the corresponding / name list attribute which enumerates all thee1 choices to put in the list. It can be used for"0 existing input parameters, as well as any you- define using the # allow_in_form attribute. Thed3 entries in this list each consist of an octuple,w0 a set of eight strings defining the variables0 and how they are to be used to build a select0 list. The attribute can contain many of these/ octuples. The strings in the string list aree1 merely taken eight at a time. For each octuplep. of strings specified in build_select_lists,+ the elements have the following meaning:t
    a1
  1. the name of the template variable to bei1 defined as a list, optionally followed by ar comma and the type of list2
  2. the input parameter name that the select list will setd0
  3. the name of the user-defined attribute containing the name list4
  4. the tuple size used in the name list above.
  5. the index into a name list tuple for the value/2
  6. the index for the corresponding label on the selector/
  7. the configuration attribute where the , default value for this input parameter is defined,
  8. the default label, if not an empty1 string, which will be used as the label forg3 an additional list item for the current input.3 parameter value if it doesn't match any value. in the given list 
. See the select1 list documentation for more information onc this attribute.
s
 example:
E
 d  M
r build_select_lists: c7 MATCH_LIST matchesperpage matches_per_page_list \
2 1 1 1 matches_per_page "Previous Amount" \
H RESTRICT_LIST,multiple restrict restrict_names 2 1 2 restrict "" \
> FORMAT_LIST,radio format template_map 3 2 1 template_name ""
l
t


d
t
a> case_sensitive



 type:

boolean

 used by:
>
! htdigu

 default:
m
 true
m
 description:
i
7 This specifies whether ht://Dig should consider URLsT= case-sensitive or not. If your server is case-insensitive, ) you should probably set this to false.

 example:

 case_sensitive: false

r



r
t6 common_dir



 type:
l
stringn
b
 used by:

" htdig,( htnotify,& htfuzzy,) htmerge andi5 htsearche
s
 default:
f
COMMON_DIR
$
 description:

8 Specifies the directory for files that will or can be7 shared among different search databases. The defaulte7 value for this attribute is defined at compile time.m
w
 example:

 common_dir: /tmp

o


s
s
%  common_url_parts 

t

 type:y
d
 string list

 used by:

" htdig,$ htdump,$ htload,( htnotify,) htmerge and<5 htsearch<
m
 default:
/
9 http:// http://www. ftp:// ftp://ftp. /pub/ .html .giff7 .jpg .jpeg /index.html /index.htm .com/ .com mailto:t

 description:
f
0 Sub-strings often found in URLs stored in the6 database. These are replaced in the database by an/ internal space-saving encoding. If a stringRA specified in url_part_aliases,d/ overlaps any string in common_url_parts, theb* common_url_parts string is ignored.
0 Note that when this attribute is changed, the3 database should be rebuilt, unless the effect ofo2 "changing" the affected URLs in the database is wanted.
d
 example:
d
 > p  d
 common_url_parts:p $ http://www.htdig.org/ml/ \
 .html \
o http://www.htdig.org/r
m
m



n
D compression_level

"

 type:d
>
number>
<
 used by:

! htdig
T
 default:
d
 0
I
 description:

 If specified and the zlib 3 compression library was available when compiled,  this attribute controls+ the amount of compression used in the doc_db file. Defaults to zero to3 provide backward compatility with old databases.
>
 example:
t
 compression_level: 6
d
d




>6 config_dir

/

 type:h
=
string

 used by:

" htdig,( htnotify,& htfuzzy,) htmerge andn5 htsearcha
o
 default:
w
CONFIG_DIRe
h
 description:

9 This is the directory which contains all configurationo. files related to ht://Dig. It is never used8 directly by any of the programs, but other attributes2 or the include directive' can be defined in terms of this one.s

9 The default value of this attribute is determined ate compile time.

l

 example:
d
 config_dir: /var/htdig/conf
g



m
e
& create_image_list

e

 type:/
d
boolean
t
 used by:
o
! htdig>

 default:
/
 false
g
 description:
e
6 If set to true, a file with all the image URLs that9 were seen will be created, one URL per line. This list5 will not be in any order and there will be lots of/9 duplicates, so after htdig has completed, it should bed7 piped through sort -u to get a unique list.
r
 example:

 create_image_list: yeso
n
a



e
c$  create_url_list

y

 type:o

boolean

 used by:
o
! htdigi
r
 default:
u
 false
i
 description:
e
: If set to true, a file with all the URLs that were seen8 will be created, one URL per line. This list will not8 be in any order and there will be lots of duplicates,3 so after htdig has completed, it should be pipedi1 through sort -u to get a unique list.r
m
 example:

 create_url_list: yes




r
d
e< database_base

a

 type:
f
string
h
 used by:

" htdig,$ htdump,$ htload,@ htnotify, 9 htfuzzy, htmerge and htsearch
o
 default:
l
 ${database_dir}/dbf
t
 description:

8 This is the common prefix for files that are specific6 to a search database. Many different attributes use3 this prefix to specify filenames. Several searchl: databases can share the same directory by just changing( this value for each of the databases.

 example:

' database_base: ${database_dir}/sales1




t
t
: database_dir



 type:<
r
strings
_
 used by:

" htdig,$ htdump,$ htload,( htnotify,& htfuzzy,) htmerge and 5 htsearch/
d
 default:
l
 DATABASE_DIRe
n
 description:
u
8 This is the directory which contains all database and4 other files related to ht://Dig. It is never used8 directly by any of the programs, but other attributes$ are defined in terms of this one.

9 The default value of this attribute is determined atd compile time.

/
d
 example:
m
 database_dir: /var/htdig
t
>


t
h
t8 date_factor

t

 type:h
=
number

 used by:

5 htsearchn

 default:
y
 0

 description:

 This factor, like backlink_factor can be>9 changed without modifing the database. It gives higher:: rankings to newer documents and lower rankings to older9 documents. Before setting this factor, it's advised tom6 make sure your servers are returning accurate dates1 (check the dates returned in the long format).9 Additionally, setting this to a nonzero value incurs a performance hit on searching.
>
 example:
<
 date_factor: 0.35

f


<

"8 date_format

a

 type:m
e
stringd
:
 used by:
l
5 htsearch

 default:
t
 <empty>u
s
 description:
i
6 This format string determines the output format for9 modification dates of documents in the search results.a7 It is interpreted by your system's strftime 6 function. Please refer to your system's manual page4 for this function, for a description of available6 format codes. If this format string is empty, as it is by default, 5 htsearch 2 will pick a format itself. In this case, the iso_8601 attribute can be used( to modify the appearance of the date.

 example:
m
 date_format: %Y-%m-%d
d
<


/

F description_factor

_

 type:"

number/
/
 used by:

! htdigd
r
 default:

 150

 description:
>
; Plain old "descriptions" are the text of a link pointingh: to a document. This factor gives weight to the words of8 these descriptions of the document. Not surprisingly,9 these can be pretty accurate summaries of a document'suC content. See also title_factor or text_factor. Changing this. factor will require updating your database.

 example:

 description_factor: 350
t
r


c
i
r/ m) description_meta_tag_namesn



 type:

 string list

 used by:
<
! htdigf
t
 default:
h
 description

 description:
t
0 The words in this list are used to search for5 descriptions in HTML META tags. This listd3 can contain any number of strings that each willr/ be seen as the name for whatever description 0 convention is used. While words in any of the2 specified description contents will be indexed,2 only the last meta tag containing a description1 will be kept as the meta description field fore/ the document, for use in search results. Thed1 order in which the names are specified in thisc/ configuration attribute is irrelevant, as it/ is the order in which the tags appear in ther documents that matters.
8 The META tags have the following format:
N  <META name="somename" content="somevalue">
y
 example:
t
 s
d? description_meta_tag_names: htdig-description description>
e




t
e
t. doc_db

l

 type:b

stringe

 used by:
<
" htdig,$ htdump,$ htload,) htmerge and6 htsearch,
t
 default:
>
 ${database_base}.docdbt

 description:

: This file will contain a Berkeley database of documents2 indexed by URL. It contains all the information6 gathered for each document, so this file can become2 rather large if 4 max_head_length is set to a large value.

 example:
s
d h
t* doc_db: ${database_base}documents.db
h
u
g


/

4 doc_index

/

 type:r

stringa
b
 used by:
d
) htmerge and 6 htsearch,
y
 default:
t
 ${database_base}.docs.index
<
 description:
<
8 This file will contain a Berkeley database which maps8 document numbers to document URLs. It is basically an6 intermediate database from the word database to the document database.f

 example:
f
doc_index: documents.index.db
/



o
x
s2 doc_list

u

 type:e
s
stringc
e
 used by:
a
! htdig
<
 default:
d
 ${database_base}.docs

 description:

4 This file is basically a text version of the file: specified in doc_db. Its7 only use is to have a human readable database of all7 documents. The file is easy to parse with tools liket perl or tcl.a
r
 example:
t
doc_list: /tmp/documents.text
t



>
<
<. endday

t

 type:d

integer

 used by:

5 htsearcht
y
 default:
f
 <empty>
r
 description:
t
4 This specifies the day of the cutoff end date for: search results. If the start or end date are specified,5 only results with a last modified date within thise4 range are shown. The endday can be specified from7 within the configuration file, and can be overridden8 with the "endday" input parameter in the search form.9 If a negative number is given, it is taken as relative:6 to the current date. Relative days can span several7 months or even years if desired (e.g. -90 to specify  90 days from today).e
s
 example:
f
endday: 31
0
d


r
<
: end_ellipses

_

 type:t
h
stringh
r
 used by:
g
5 htsearchd
k
 default:
d
/ <b><tt> ...</tt></b>
l
 description:

9 When excerpts are displayed in the search output, this<: string will be appended to the excerpt if there is text6 following the text displayed. This is just a visual8 reminder to the user that the excerpt is only part of the complete document.

 example:
s
 end_ellipses: ...
y




t
m"  end_highlight

t

 type:u
s
strings
e
 used by:
r
5 htsearchc
h
 default:

 </strong>
t
 description:

4 When excerpts are displayed in the search output,* matched words will be highlighted using1 start_highlightr* and this string. You should ensure that0 highlighting tags are balanced, that is, this0 string should close any formatting tag opened by start_highlight.

 example:

 end_highlight: </font>






' s! endings_affix_filed

>

 type:/
d
string/
d
 used by:
d
% htfuzzyr

 default:

 ${common_dir}/english.aff
e
 description:
t
8 Specifies the location of the file which contains the: affix rules used to create the endings search algorithm* databases. Consult the documentation onA htfuzzy for more information on the< format of this file.t
a
 example:
r
i
n0 endings_affix_file: /var/htdig/affix_rules
d
f
0




<' =! endings_dictionarye

n

 type:

string/
/
 used by:

% htfuzzyd
r
 default:

 ${common_dir}/english.0

 description:
s
8 Specifies the location of the file which contains the9 dictionary used to create the endings search algorithma* databases. Consult the documentation onA htfuzzy for more information on thei format of this file.u
w
 example:
n
l i
a/ endings_dictionary: /var/htdig/dictionarye
d
r
e


d
h
a) n# endings_root2word_dbi

h

 type:s
r
stringT

 used by:
;
) htfuzzy ands5 htsearchm
e
 default:
<
 ${common_dir}/root2word.db_
m
 description:

8 This attributes specifies the database filename to be4 used in the 'endings' fuzzy search algorithm. The8 database maps word roots to all legal words with that8 root. For more information about this and other fuzzy! search algorithms, consult the>8 htfuzzy documentation.
' Note that the default value uses the @ common_dir attribute instead of the6 database_dir attribute.3 This is because this database can be shared withe different search databases.
e
 example:
r
> R
l- endings_root2word_db: /var/htdig/r2w.dbi
f




s
a
) h# endings_word2root_dbs

u

 type:p

string

 used by:
{
) htfuzzy and 5 htsearchr

 default:
_
 ${common_dir}/word2root.db
d
 description:

8 This attributes specifies the database filename to be4 used in the 'endings' fuzzy search algorithm. The: database maps words to their root. For more information8 about this and other fuzzy search algorithms, consult) the htfuzzy< documentation.
r' Note that the default value uses thei@ common_dir attribute instead of the6 database_dir attribute.3 This is because this database can be shared withn different search databases.
e
 example:
i
d d
<- endings_word2root_db: /var/htdig/w2r.bmn
r
d
d


<
:
2 endmonth

e

 type:"
>
integer

 used by:

5 htsearchn

 default:

 <empty>i
a
 description:
s
6 This specifies the month of the cutoff end date for: search results. If the start or end date are specified,5 only results with a last modified date within this 6 range are shown. The endmonth can be specified from7 within the configuration file, and can be overriddenr: with the "endmonth" input parameter in the search form.9 If a negative number is given, it is taken as relative>9 to the current month. Relative months can span several  years if desired.
t
 example:

 endmonth: 11d
>
t


e
;
0 endyear



 type:
r
integer

 used by:

5 htsearchn

 default:

 <empty>
"
 description:
e
5 This specifies the year of the cutoff end date forn: search results. If the start or end date are specified,5 only results with a last modified date within thiso5 range are shown. The endyear can be specified from 7 within the configuration file, and can be overridden 9 with the "endyear" input parameter in the search form.=9 If a negative number is given, it is taken as relative_ to the current year.m
t
 example:
d
 endyear: 1999




r
"
_# d excerpt_length/

&

 type:/

number<
i
 used by:
e
5 htsearch

 default:
t
 300
s
 description:
c
9 This is the maximum number of characters the displayed>: excerpt will be limited to. The first matched word will? be highlighted in the middle of the excerpt so that there is some surrounding context.
& The  start_ellipses and@ end_ellipses are used to6 indicate that the document contains text before and, after the displayed excerpt respectively.> The start_highlight: and end_highlight7 are used to specify what formatting tags are used toe highlight matched words.r

 example:

 excerpt_length: 500
<
a


d

h%  excerpt_show_topf

d

 type:

boolean

 used by:

5 htsearch
n
 default:
_
 false

 description:

: If set to true, the excerpt of a match will always show8 the top of the matching document. If it is false (the9 default), the excerpt will attempt to show the part of/8 the document that actually contains one of the words.

 example:
>
 excerpt_show_top: yes
o
e


c
h
0 exclude

r

 type:e
n
 string list

 used by:

' htsearch

 default:
l
 <empty>e
d
 description:

9 If a URL contains any of the space separated patterns,a7 it will be discarded in the searching phase. This is I used to exclude certain URLs from search results. O The list can be specified from within the configurationL file, and can be overridden with the "exclude" input5 parameter in the search form.n

 example:

 exclude: cgi-bin
r




h
t: exclude_urls



 type:f
u
 string list
e
 used by:

! htdigt
r
 default:

 /cgi-bin/ .cgi/
d
 description:
>
9 If a URL contains any of the space separated patterns,i4 it will be rejected. This is used to exclude such5 common things such as an infinite virtual web-tree  which start with cgi-bin.

 example:
a
& exclude_urls: students.html cgi-bin
m
d


u
>
t% d external_parsersm



 type:t
s
 quoted string listu
'
 used by:
t
! htdig
o
 default:
r
 <empty>

 description:
b
. This attribute is used to specify a list of4 content-type/parsers that are to be used to parse9 documents that cannot by parsed by any of the internalT4 parsers. The list of external parsers is examined9 before the builtin parsers are checked, so this can bea1 used to override the internal behavior without< recompiling htdig.
2 The external parsers are specified as pairs of0 strings. The first string of each pair is the4 content-type that the parser can handle while the9 second string of each pair is the path to the external/9 parsing program. If quoted, it may contain parameters,t separated by spaces.
d3 External parsing can also be done with external<0 converters, which convert one content-type to2 another. To do this, instead of just specifying, a single content-type as the first string0 of a pair, you specify two types, in the form6 type1->type2,0 as a single string with no spaces. The second+ string will define an external converter - rather than an external parser, to converti. the first type to the second. If the second. type is user-defined, then/ it's up to the converter script to put out at5 "Content-Type: type" header followedt5 by a blank line, to indicate to htdig what type it_5 should expect for the output, much like what a CGIf3 script would do. The resulting content-type must 2 be one that htdig can parse, either internally,4 or with another external parser or converter.
0 Only one external parser or converter can be1 specified for any given content-type. However,o4 an external converter for one content-type can be4 chained to the internal parser for the same type,1 by appending -internal to thee: second type string (e.g. text/html->text/html-internal)4 to perform external preprocessing on documents of( this type before internal parsing.

) The two main internal parsers are forh3 text/html and text/plain. There is also a simpled. parser for application/pdf, described under1 pdf_parser, which is 1 quite limited and is typically overridden withi an external one.

. The parser program takes four command-line2 parameters, not counting any parameters already# given in the command string:
i: infile content-type URL configuration-file
 t n o  i  e  /  n    c e > f e 
 Parametert  Descriptione Examples
R infile 6 A temporary file with the contents to be parsed.  /var/tmp/htdext.14242
 content-type $ The MIME-type of the contents.  text/htmlr
y URL  The URL of the contents. % http://www.htdig.org/attrs.html>
d configuration-file ' The configuration-file in effect.o  /etc/htdig/htdig.confh

2 The external parser is to write information for0 htdig on its standard output. Unless it is an3 external converter, which will output a documente4 of a different content-type, then its output must( follow the format described here.
: The output consists of records, each record terminated5 with a newline. Each record is a series of (unless< expressively allowed to be empty) non-empty tab-separated0 fields. The first field is a single character9 that specifies the record type. The rest of the fields>% are determined by the record type.p n  < a ! 0 > r  p o  d :! l s < d r p  f  i i s h a > p >   > h u!   u s p  f
 Record type Fields  Description=
/ w word , A word that was found in the document.
T location 4 A number indicating the normalized location of5 the word within the document. The number has to=7 fall in the range 0-1000 where 0 means the top ofm the document.e
d heading level 1 A heading level that is used to compute ther4 weight of the word depending on its context in7 the document itself. The level is in the range ofa& 0-10 and are defined as follows:
h 0
 Normal text
 1
< Title text>
 2
c Heading 1 texte
g 3
 Heading 2 textm
> 4
> Heading 3 text
 5
_ Heading 4 text
m 6
d Heading 5 text
r 7
> Heading 6 texte
l 8
a unused
u 9
s unused
l 10o
/ Keywords
/
> u > document URL - A hyperlink to another document that is:5 referenced by the current document. It must be ; complete and non-relative, using the URL parameter to < resolve any relative references found in the document.
; hyperlink description 0 For HTML documents, this would be the text2 between the <a href...> and </a> tags.e
 te title  The title of the documentg
e he head 5 The top of the document itself. This is used tot1 build the excerpt. This should only contain/ normal ASCII text<
c a< anchor 5 The label that identifies an anchor that can bed2 used as a target in an URL. This really only% makes sense for HTML documents.d
 i/ m image URL> 4 An URL that points at an image that is part of the document.
i mu g http-equiv # The HTTP-EQUIV attribute of ae. META tag. May be empty.<
_ name The NAME attribute of this. META tag. May be empty.t
 contents $ The CONTENTS attribute of this. META tag. May be empty.
p#

See also FAQ questions 4.8 and 4.9 for more examples.

m
a
 example:
p
e  p
r external_parsers: u/ text/html /usr/local/bin/htmlparser \
7 application/pdf /usr/local/bin/parse_doc.pl \
eM application/msword->text/plain "/usr/local/bin/mswordtotxt -w" \
tD application/x-gunzip->user-defined /usr/local/bin/ungzipper
p
r
o


e

t* $ extra_word_characters

t

 type:d
t
stringg
l
 used by:

% htdig andt5 htsearch
n
 default:
l
 <empty>t

 description:

2 These characters are considered part of a word.' In contrast to the characters in thee5 valid_punctuation7 Note that the locale attribute 1 is normally used to configure which charactersy constitute letter characters.
h
 example:

 extra_word_characters: _e
s
v


d
y
n5 heading_factor_1 -c heading_factor_6 



 type:s
r
number
y
 used by:
a
! htdigo

 default:

 heading_factor_1: 5
 heading_factor_2: 4
 heading_factor_3: 3
 heading_factor_4: 1
 heading_factor_5: 1
 heading_factor_6: 0
h
 description:
d
6 This is a factor which will be used to multiply the5 weight of words between <h1> and </h1>e8 tags. It is used to assign the level of importance to: certain headers. Setting a factor to 0 will cause words5 in this heading to be ignored. The number may be a& floating point number. See also the/ title_factor andi5 text_factor attributes.

 example:
t
 heading_factor_1: 7.75
 heading_factor_2: 5.3
. heading_factor_3: 2
> heading_factor_4: 0
" heading_factor_5: 0
 heading_factor_6: 0
f
e


>

r) t# htnotify_prefix_filed



 type:

stringo
m
 used by:

' htnotify/
t
 default:
o
 <empty>d
n
 description:
T
4 Specifies the file containing text to be inserted3 in each mail message sent by htnotify before thee0 list of expired webpages. If omitted, nothing is inserted.h
u
 example:
.
 r h 
w htnotify_prefix_file: e( ${common_dir}/notify_prefix.txt
d
e
r


e
p
s% t htnotify_replytor



 type:h
>
string/
t
 used by:
t
' htnotify

 default:
d
 <empty>t
n
 description:

7 This specifies the email address that htnotify email>+ messages include in the Reply-to: field.c

 example:
r
 e <
 htnotify_replyto:/  design-group@foo.come
n
h


r

$  htnotify_sender

e

 type:o

string<

 used by:
e
' htnotify

 default:

 webmaster@www
H
 description:
3
7 This specifies the email address that htnotify email<: messages get sent out from. The address is forged using4 /usr/lib/sendmail. Check htnotify/htnotify.cc for detail on how this is done.

 example:
d
d
. htnotify_sender: bigboss@yourcompany.com
d
<


>
<
) # htnotify_suffix_file<



 type:
>
string/
<
 used by:
n
' htnotify"
o
 default:
y
 <empty>
e
 description:

4 Specifies the file containing text to be inserted2 in each mail message sent by htnotify after the1 list of expired webpages. If omitted, htnotifyi" will insert a standard message.
/
 example:
h
 e   n
g htnotify_suffix_file: t( ${common_dir}/notify_suffix.txt




m

' o! htnotify_webmaster 



 type:h
e
strings
t
 used by:

' htnotify

 default:

ht://Dig Notification Service
d
 description:
n
. This provides a name for the From field, in. addition to the email address for the email! messages sent out by htnotify.

 example:

 / r u
n htnotify_webmaster:e o Notification Service>
i




n
g
t6 http_proxy



 type:a
a
stringd
t
 used by:

! htdiga

 default:

 <empty>/
<
 description:

0 When this attribute is set, all HTTP document9 retrievals will be done using the HTTP-PROXY protocol. 9 The URL specified in this attribute points to the host / and port where the proxy server resides.
h: The use of a proxy server greatly improves performance of the indexing process.m
a
 example:
p
e u
"0 http_proxy: http://proxy.bigbucks.com:3128
c
p
r



c
rF http_proxy_exclude

<

 type:u

 string list

 used by:
l
! htdigt
>
 default:
o
 <empty>

 description:

8 When this is set, URLs matching this will not use the9 proxy. This is useful when you have a mixture of sites+ near to the digging server and far away.a

 example:

p >
2 http_proxy_exclude: http://intranet.foo.com/
p
w


h
c
h$  ignore_alt_text

d

 type:>
h
boolean
a
 used by:
u
! htdiga

 default:

 false
r
 description:
>
3 If set to true, htdig will not index text in thef3 ALT attribute of IMG tags, nor include this textt in excerpts.

 example:
t
 ignore_alt_text: true

e




f( " ignore_dead_servers

a

 type:o

boolean
a
 used by:
1
! htdig

 default:
>
 trueo
l
 description:
t
, Determines whether htdig will continue to. index URLs from a server after an attempted- connection to the server fails as "no& host found" or "no server running."f1 If set to false, htdig will try everyt URL from that server.
<
 example:
b
 ignore_dead_servers: falsep




o
7
6 image_list

t

 type:<
e
string
f
 used by:

! htdigx

 default:
>
 ${database_base}.images

 description:

: This is the file that a list of image URLs gets written0 to by htdig when the? create_image_list is set tod: true. As image URLs are seen, they are just appended to6 this file, so after htdig finishes it is probably a3 good idea to run sort -u on the file tof& eliminate duplicates from the file.
n
 example:
d
 image_list: allimages






oB image_url_prefix

}

 type:<
t
string
<
 used by:
s
5 htsearchg
>
 default:
/
 IMAGE_URL_PREFIXs

 description:
t
7 This specifies the directory portion of the URL used8 to display star images. This attribute isn't directly7 used by htsearch, but is used in the default URL forr/ the star_image andf7 star_blank attributes, andn8 other attributes may be defined in terms of this one.

9 The default value of this attribute is determined at  compile time.

o
 example:

" image_url_prefix: /images/htdig

e



>
0 include

o

 type:
d
stringt

 used by:

" htdig,$ htdump,$ htload,( htnotify,& htfuzzy,) htmerge andd5 htsearchm
s
 description:
u
3 This is not quite a configuration attribute, butd0 rather a directive. It can be used within one3 configuration file to include the definitions of>4 another file. The last definition of an attribute6 is the one that applies, so after including a file,0 any of its definitions can be overridden with2 subsequent definitions. This can be useful when5 setting up many configurations that are mostly the 7 same, so all the common attributes can be maintained 9 in a single configuration file. The include directivesm2 can be nested, but watch out for nesting loops.
"
 example:
m
$ include: ${config_dir}/htdig.conf
/



d
i
2 iso_8601

t

 type:t
r
boolean
.
 used by:
a
< htsearch and htnotify

 default:
i
 false
p
 description:

7 This sets whether dates should be output in ISO 8601E format. For example, this was written on: 1998-10-31 11:28:13 EST." See also the date_format attribute, which$ can override any date format that5 htsearche picks by default.
6 This attribute also affects the format of the date7 htnotify expects to find:7 in a htdig-notification-date field.d
<
 example:
s
 iso_8601: truen
o
e


l
s
t$  keywords_factor

<

 type:t
o
numberw

 used by:
f
! htdig<

 default:
n
 100
>
 description:

6 This is a factor which will be used to multiply the9 weight of words in the list of keywords of a document.y: The number may be a floating point number. See also the/ title_factor and4 text_factorattributes.

 example:

 keywords_factor: 12
H
e



i
T, & keywords_meta_tag_names

x

 type:f
r
 string list
o
 used by:
d
! htdig>
>
 default:
p
 keywords htdig-keywords

 description:

9 The words in this list are used to search for keywordsr6 in HTML META tags. This list can contain any7 number of strings that each will be seen as the name / for whatever keyword convention is used.
6 The META tags have the following format:
J  <META name="somename" content="somevalue">
e
 example:
m

h3 keywords_meta_tag_names: keywords descriptiono
x
e
a


n
w
B limit_normalized



 type:u
i
 string list
t
 used by:
<
! htdigr
=
 default:
s
 <empty>
<
 description:
n
9 This specifies a set of patterns that all URLs have to 8 match against in order for them to be included in the; search. Unlike the limit_urls_to attribute, this is doner- after the URL is normalized and the>/ server_aliasese8 attribute is applied. This allows filtering after any: hostnames and DNS aliases are resolved. Otherwise, this" attribute is the same as the limit_urls_to attribute.>
<
 example:
i
a o
r/ limit_normalized: http://www.mydomain.comm




<

< limit_urls_to

t

 type:o
l
 string list

 used by:

! htdigs

 default:
a
 ${start_url}o
r
 description:

9 This specifies a set of patterns that all URLs have to 8 match against in order for them to be included in the2 search. Any number of strings can be specified,: separated by spaces. If multiple patterns are given, at6 least one of the patterns has to match the URL.
: Matching is a case-insensitive string match on the URL9 to be used. The match will be performed aftery9 the relative references have been converted to a valid<: URL. This means that the URL will always start with http://.
.7 Granted, this is not the perfect way of doing this, 4 but it is simple enough and it covers most cases.

 example:
>
limit_urls_to: .sdsu.edu kpbs
a
l


u
m
aD local_default_doc

a

 type:>
e
 string list
e
 used by:
e
! htdiga

 default:

index.htmlr
=
 description:
>
? Set this to the default documents in a directory used by the 6 server. This is used for local filesystem access to: translate URLs like http://foo.com/ into something like /home/foo.com/index.html
>. The list should only contain names that the3 local server recognizes as default documents for 3 directory URLs, as defined by the DirectoryIndex - setting in Apache's srm.conf, for example.8 As of version 3.1.5, this can be a string list rather3 than a single name, and htdig will use the first / name that works. Since this requires a loop, 2 setting the most common name first will improve5 performance. Special characters can be embedded in & these names using %xx hex encoding.
s
 example:
t
 d e g >
 local_default_doc: d6 default.html default.htm index.html index.htm
<
>
>


d
<
/6 local_urls

y

 type:r
t
 string list
m
 used by:
l
! htdig<

 default:
f
 <empty><
e
 description:
d
; Set this to tell ht://Dig to access certain URLs throughs: local filesystems. At first ht://Dig will try to access4 pages with URLs matching the patterns through the8 filesystems specified. If it cannot find the file, or; if it doesn't recognize the file name extension, it willa: try the URL through HTTP instead. Note the example--the; equal sign and the final slashes in both the URL and theh directory path are critical.i:
The fallback to HTTP can be disabled by setting the1 local_urls_onlyn attribute to true.c2 To access user directory URLs through the local filesystem, set2 local_user_urls.5 The only file name extensions currently recognized/5 for local filesystem access are .html, .htm, .txt,r5 .asc, .ps, .eps and .pdf. For anything else, htdig3 must ask the HTTP server for the file, so it can/) determine the MIME content-type of it. 9 As of version 3.1.5, you can provide multiple mappingsf/ of a given URL to different directories, anda/ htdig will use the first mapping that works. ( Special characters can be embedded in& these names using %xx hex encoding.4 For example, you can use %3D to embed an "=" sign in an URL pattern.t
e
 example:
e

6 local_urls: http://www.foo.com/=/usr/www/htdocs/
n
a



t
_
a$  local_urls_only

e

 type:<

boolean
o
 used by:
<
! htdig

 default:
<
 false

 description:

1 Set this to tell ht://Dig to access files only 2 through the local filesystem, for URLs matching the patterns in the* local_urls or1 local_user_urls 1 attribute. If it cannot find the file, it willd# give up rather than trying HTTP.<, This will not affect files outside of the1 scope of local_urls and local_user_urls, whiche0 will still be fetched by HTTP. To disable all2 non-local fetching of files, you'll need to set) the start_urlo1 and limit_urls_to>/ attributes to allow only URLs covered by the local filesystem.

 example:
H
 local_urls_only: true




_
g
@ local_user_urls

t

 type:s

 string list
<
 used by:
i
! htdig<
<
 default:
i
 <empty><
i
 description:

; Set this to access user directory URLs through the local.; filesystem. If you leave the "path" portion out, it willl= look up the user's home directory in /etc/password (or NISs> or whatever). As with local_urls, if the files are notE6 found, ht://Dig will try with HTTP. Again, note the? example's format. To map http://www.my.org/~joe/foo/bar.html8 to /home/joe/www/foo/bar.html, try the example below.:
The fallback to HTTP can be disabled by setting the1 local_urls_only< attribute to true.<9 As of version 3.1.5, you can provide multiple mappingsn/ of a given URL to different directories, and>/ htdig will use the first mapping that works.d( Special characters can be embedded in& these names using %xx hex encoding.4 For example, you can use %3D to embed an "=" sign in an URL pattern.<
/
 example:

r
6 local_user_urls: http://www.my.org/=/home/,/www/
t
o



a
l
i. locale

r

 type:_
r
stringa

 used by:

! htdigr

 default:
=
 C
_
 description:
m
3 Set this to whatever locale you want your search 3 database cover. It affects the way internationald7 characters are dealt with. On most systems a list ofd6 legal locales can be found in /usr/lib/locale. Also5 check the setlocale(3C) man page.s9 Note that depending the locale you choose, and whether/7 your system's locale implementation affects floating 9 point input, you may need to specify the decimal pointr4 as a comma rather than a period. This will affect? settings of search_algorithmd" and any of the scoring factors.
t
 example:
t
 locale: en_US
a



u
t
a0 logging



 type:
o
boolean
R
 used by:
s
' htsearchl
m
 default:
c
 false
i
 description:
w
< This sets whether htsearch should use the syslog() to log9 search requests. If set, this will log requests with ah> default level of LOG_INFO and a facility of LOG_LOCAL5. For? details on redirecting the log into a separate file or other 7 actions, see the syslog.conf(5) man<> page. To set the level and facility used in logging, change< LOG_LEVEL and LOG_FACILITY in the include/htconfig.h file before compiling.
t; Each line logged by htsearch contains the following:
.2 REMOTE_ADDR [config] (match_method) [words]2 [logicalWords] (matches/matches_per_page) - page, HTTP_REFERERs

/ where any of the above are null or empty, itn0 either puts in '-' or 'default' (for config).
o
 example:

 logging: true
e
e



l
d6 maintainer

t

 type:
c
stringg
s
 used by:
3
! htdiga
n
 default:

bogus@unconfigured.htdig.user
e
 description:
m
4 This should be the email address of the person in8 charge of the digging operation. This string is added3 to the user-agent: field when the digger sends a request to a server.n

 example:

$ maintainer: ben.dover@uptight.com
x



t

>: match_method

l

 type:d
y
string
/
 used by:

5 htsearch
"
 default:
>
 and
/
 description:

8 This is the default method for matching that htsearch uses. The valid choices are:e
    n
  • g or
  • r andi
  • booleanR
  • 
9 This attribute will only be used if the HTML form that 9 calls htsearch didn't have the methodn value set.
U
 example:
t
 match_method: boolean

n



t
% e matches_per_pagec

_

 type:c
e
numbers
e
 used by:
c
5 htsearcho
y
 default:
s
 10
.
 description:
i
3 If this is set to a relatively small number, thet9 matches will be shown in pages instead of all at once.5
r
 example:

 matches_per_page: 999
s
s



l
r+ x% max_description_length 



 type:

number<
/
 used by:

! htdigo
w
 default:
b
 60

 description:
r
( While gathering descriptions of URLs,8 htdig will only record those8 descriptions which are shorter than this length. This1 is used mostly to deal with broken HTML. (If a>4 hyperlink is not terminated with a </a> the9 description will go on until the end of the document.)

 example:
t
 max_description_length: 40g
f



l
,
: max_doc_size



 type:r

numbera
h
 used by:
y
! htdigs

 default:

100000
b
 description:
f
: This is the upper limit to the amount of data retrieved0 for documents. This is mainly used to prevent6 unreasonable memory consumption since each document4 will be read into memory by htdig./
d
 example:

 max_doc_size: 5000000
t
r


r
a
l: max_excerpts

/

 type:/
d
numbere
t
 used by:
<
5 htsearch
m
 default:

 1

 description:
d
7 This value determines the maximum number of excerptsp9 that can be displayed for one matching document in thec search results.

 example:
l
 max_excerpts: 10i
E
d


H
a
t$  max_head_length

/

 type:
f
numbere
y
 used by:
>
! htdig
r
 default:
g
 512
e
 description:
i
: For each document retrieved, the top of the document is5 stored. This attribute determines the size of thiso8 block. The text that will be stored is only the text; no markup is stored.
m7 We found that storing 50,000 bytes will store aboutr3 95% of all the documents completely. This reallyw8 depends on how much storage is available and how much you want to show.
>
 example:
n
 max_head_length: 50000>




t
m
/< max_hop_count



 type:t

number
t
 used by:
d
! htdig/
d
 default:

999999b
I
 description:

2 Instead of limiting the indexing process by URL8 pattern, it can also be limited by the number of hops9 or clicks a document is removed from the starting URL.t: Unfortunately, this only works reliably when a complete' index is created, not an update.
t, The starting page will have hop count 0.

 example:
e
 max_hop_count: 4a
h
_


y

a: max_keywords

c

 type:<
>
number<
r
 used by:

! htdigt

 default:
R
 -1 (no limit)

 description:

1 This attribute can be used to limit the numberd2 of keywords per document that htdig will accept2 from meta keywords tags. A value of -1 or less4 means no limit. This can help combat meta keyword/ spamming, by limiting the amount of keywords 3 that will be indexed, but it will not completelyL0 prevent irrelevant matches in a search if the2 first few keywords in an offending document are not relevant to its contents.

 example:
n
 max_keywords: 10T
c
n


p

0 * max_meta_description_length

D

 type:
a
numberh
)
 used by:

! htdigo
n
 default:
o
 512

 description:
l
; While gathering descriptions from meta description tags, / htdig will truncate<2 descriptions which are longer than this length.
:
 example:

$ max_meta_description_length: 1000
d



a

' ! max_prefix_matchesi



 type:

integer
i
 used by:
e
5 htsearchi
e
 default:

 1000>
a
 description:
:
7 The Prefix fuzzy algorithm could potentially match ar4 very large number of words. This value limits the. number of words each prefix can match. Note8 that this does not limit the number of documents that are matched in any way.
<
 example:
<
 max_prefix_matches: 100

d


u
>
t4 max_stars



 type:u
o
numberr
.
 used by:
i
5 htsearchi
l
 default:
t
 4
H
 description:
h
7 When stars are used to display the score of a match, 9 this value determines the maximum number of stars thatb can be displayed.
<
 example:
s
 max_stars: 6p
a
_


d
<
d) e# maximum_page_buttons 



 type:
r
integer
a
 used by:
e
5 htsearchs
e
 default:

 ${maximum_pages}h
h
 description:
e
: This value limits the number of page links that will be8 included in the page list at the bottom of the search9 results page. By default, it takes on the value of thes8 maximum_pages attribute,1 but you can set it to something lower to allow:: more pages than buttons. In this case, pages above this, number will have no corresponding button.
f
 example:
d
 maximum_page_buttons: 20
6
/


s
n
"  maximum_pages

>

 type:d

integer
h
 used by:
e
5 htsearcht
r
 default:
m
 10

 description:
m
: This value limits the number of page links that will be8 included in the page list at the bottom of the search9 results page. As of version 3.1.4, this will limit thed5 total number of matching documents that are shown.a; You can make the number of page buttons smaller than the ) number of allowed pages by setting the ; maximum_page_buttons> attribute.

 example:
f
 maximum_pages: 20
h
o


d
s
( " maximum_word_length

n

 type:
/
number>
a
 used by:
c
% htdig and<5 htsearche
x
 default:

 12
:
 description:
/
5 This sets the maximum length of words that will be 9 indexed. Words longer than this value will be silentlym8 truncated when put into the index, or searched in the index.d
>
 example:

 maximum_word_length: 15

t


y
o
n, & meta_description_factor



 type:/
d
number<
t
 used by:

! htdigd

 default:

 50e
y
 description:

6 This is a factor which will be used to multiply the> weight of words in any META description tags in a document.: The number may be a floating point number. See also the/ title_factor andm5 text_factor attributes.f
b
 example:
t
 meta_description_factor: 20
t
n




: metaphone_db

v

 type:s
/
stringm
e
 used by:
h
) htfuzzy andr5 htsearch>
n
 default:

${database_base}.metaphone.db
n
 description:
e
: The database file used for the fuzzy "metaphone" search) algorithm. This database is created byt1 htfuzzy and used by<6 htsearch.
t
 example:
c
o f
c* metaphone_db: ${database_base}.mp.db
k
w
p


,

r: method_names

e

 type:
u
 quoted string list<
>
 used by:
m
5 htsearch
d
 default:
>
! and All or Any boolean Booleane

 description:
t
/ These values are used to create the /8 method menu. It consists of pairs. The first8 element of each pair is one of the known methods, the7 second element is the text that will be shown in thee8 menu for that method. This text needs to be quoted if it contains spaces.. See the select1 list documentation for more information ono how this attribute is used.
b
 example:
l
 method_names: or Or and And
d
f


n
v
s* $ minimum_prefix_length



 type:
l
numbern
m
 used by:
s
5 htsearch>
p
 default:

 1
m
 description:
f
= This sets the minimum length of prefix matches used by the = "prefix" fuzzy matching algorithm. Words shorter than thisi' will not be used in prefix matching.g
s
 example:
<
 minimum_prefix_length: 2c
s



s
.
d( " minimum_word_length



 type:
l
numbern
m
 used by:
/
% htdig and>5 htsearch

 default:

 3
r
 description:
e
5 This sets the minimum length of words that will be: indexed. Words shorter than this value will be silently. ignored but still put into the excerpt.
: Note that by making this value less than 3, a lot more9 words that are very frequent will be indexed. It mighte5 be advisable to add some of these to the bad_wordsh list.

 example:
p
 minimum_word_length: 2x
c



<
<
r- a' modification_time_is_now<

d

 type:
n
boolean
<
 used by:
<
! htdig<
/
 default:
/
 true4
H
 description:
h
: This sets ht://Dig's response to a server that does not2 return a modification date. If false, it stores> nothing. By setting modification_time_is_now, it will store3 the current time if the server does not return a : date. Though this will return incorrect dates in search; results, it may cut down on reindexing from such serversb4 when doing updates, provided they still honor the4 If-Modified-Since header. Caching servers such as& WWWoffle and Squid seem to do this.

 example:
m
" modification_time_is_now: false
<
t


d
$
pD multimatch_factor

u

 type:t

number

 used by:
e
5 htsearch>
e
 default:
s
 1

 description:
b
 This factor, like backlink_factor can bee9 changed without modifing the database. It gives higher9 rankings to documents that have more than one matching<; search word when the or method is used.<8 The matching words' combined scores are multiplied by1 this factor for each additional matching word.d
>
 example:
>
 multimatch_factor: 1000
c



m
t
#  next_page_text<



 type:e
i
stringi
t
 used by:

5 htsearchl
m
 default:
c
[next]
s
 description:
b
8 The text displayed in the hyperlink to go to the next page of matches.e
d
 example:
t
0 x
/: next_page_text: <img src="/htdig/buttonr.gif">
i
n
s


n
>
>( " no_excerpt_show_top

>

 type:t
t
boolean
h
 used by:
d
5 htsearch

 default:
t
false
x
 description:

7 If no excerpt is available, this option will act theu same as excerpt_show_top, that is,( it will show the top of the document.
r
 example:
>
 no_excerpt_show_top: yess
m
e


a
p
o$  no_excerpt_text



 type:>
e
string

 used by:
d
5 htsearch

 default:
t
8 <em>(None of the search words were found in the$ top of this document.)</em>

 description:

9 This text will be displayed in place of the excerpt ifa: there is no excerpt available. If this attribute is set4 to nothing (blank), the excerpt label will not be displayed in this case.
t
 example:
d
 no_excerpt_text:
n
p


<
>
>4 noindex_start,0 noindex_end

e

 type:"
l
string
t
 used by:

! htdig

 default:

9 <!--htdig_noindex--> <!--/htdig_noindex-->T

 description:
a
L The text encompassing a section of an HTML file that should be completelyF ignored when indexing. As in the defaults, this can be SGML commentI declarations that can be inserted anywhere in the documents to excludeoL different sections from being indexed. However, existing tags can also beN used; this is especially useful to exclude some sections from being indexedI where the files to be indexed can not be edited. The example shows howdE SCRIPT sections in 'uneditable' documents can be skipped; note how>L noindex_start does not contain an ending >: this allows for all SCRIPTJ tags to be matched regardless of attributes defined (different types orG languages). Note that the match for this string is case insensitive.

 example:
r
noindex_start: <SCRIPT
 noindex_end: </SCRIPT>t
m
r



T
& no_next_page_text

w

 type:t
i
stringo
i
 used by:
_
5 htsearcha

 default:

[next]e

 description:
d
5 The text displayed where there would normally be as/ hyperlink to go to the next page of matches.i
x
 example:

 no_next_page_text:
l
>



>
u( " no_page_list_header

/

 type:

string/
d
 used by:

5 htsearch
r
 default:
e
 <empty>p
i
 description:
>
8 This text will be used as the value of the PAGEHEADER( variable, for use in templates or the= search_results_footero6 file, when all search results fit on a single page.
m
 example:
e
u <
 no_page_list_header:B <hr noshade size=2>All results on this page.<br>
>
f
>




( " no_page_number_text

t

 type:e

 quoted string listd
p
 used by:
m
5 htsearchf
l
 default:
a
 <empty>

 description:
t
: The text strings in this list will be used when putting: together the PAGELIST variable, for use in templates orA the search_results_footer@ file, when search results fit on more than page. The PAGELISTA is the list of links at the bottom of the search results page.>; There should be as many strings in the list as there are P pages allowed by the maximum_page_buttons< attribute. If there are not enough, or the list is empty,A the page numbers alone will be used as the text for the links.r? An entry from this list is used for the current page, as thegC current page is shown in the page list without a hypertext link,s6 while entries from the C page_number_text list are used for the links to other pages.nC The text strings can contain HTML tags to highlight page numbersrA or embed images. The strings need to be quoted if they containf spaces.
o
 example:
e
 m _ r <
 no_page_number_text: tI <strong>1</strong> <strong>2</strong> \
I <strong>3</strong> <strong>4</strong> \
>I <strong>5</strong> <strong>6</strong> \
dI <strong>7</strong> <strong>8</strong> \
D <strong>9</strong> <strong>10</strong>

T


h
a
c& no_prev_page_text



 type:r
m
stringo

 used by:
e
5 htsearchr
d
 default:
d
[prev]e
t
 description:
/
5 The text displayed where there would normally be ar3 hyperlink to go to the previous page of matches./

 example:
t
 no_prev_page_text:

t


y

' r! nothing_found_filem

d

 type:

string
p
 used by:
x
5 htsearchs
d
 default:
t
 ${common_dir}/nomatch.html
/
 description:
t
2 This specifies the file which contains the 8 HTML text to display when no matches were found.3 The file should contain a complete HTMLs document.
5 Note that this attribute could also be defined in/9 terms of database_base to>3 make is specific to the current search database.r
h
 example:
d
>
5 nothing_found_file: /www/searching/nothing.htmli
i
l


h
s
x"  no_title_text



 type:>
a
string>

 used by:

5 htsearchn
e
 default:

filenamed
>
 description:
s
; This specifies the text to use in search results when no>9 title is found in the document itself. If it is set to ; filename, htsearch will use the name of the file itself,>, enclosed in brackets (e.g. [index.html]).
e
 example:

" no_title_text: "No Title Found"




y
l
e( nph

b

 type:e
b
boolean
h
 used by:
p
5 htsearch

 default:

 false
>
 description:
n
3 This attribute determines whether htsearch sendst
 example:

nph: true




a
n
M% n page_list_headera

G

 type:b
a
stringn
d
 used by:
d
5 htsearch
x
 default:
w
, <hr noshade size=2>Pages:<br>
o
 description:
s
8 This text will be used as the value of the PAGEHEADER( variable, for use in templates or the= search_results_footery; file, when all search results fit on more than one page.
i
 example:
e
 page_list_header:
a
C



R
* $ page_number_separator

t

 type:

 quoted string listt

 used by:
i
5 htsearchf
h
 default:

 " "
e
 description:

( The text strings in this list will be* used when putting together the PAGELIST( variable, for use in templates or the$ . search_results_footer file, when search. results fit on more than page. The PAGELIST, is the list of links at the bottom of the* search results page. The strings in the* list will be used in rotation, and will) separate individual entries taken from:7 page_number_text and : no_page_number_text.0 There can be as many or as few strings in the1 list as you like. If there are not enough for. the number of pages listed, it goes back to/ the start of the list. If the list is empty,0 a space is used. The text strings can contain3 HTML tags. The strings need to be quoted if theyr1 contain spaces, or to specify an empty string.s
<
 example:
o
2 page_number_separator: "</td> <td>"





o
% n page_number_texth



 type:

 quoted string list

 used by:
e
5 htsearchd
>
 default:

 <empty>
<
 description:
r
: The text strings in this list will be used when putting: together the PAGELIST variable, for use in templates orA the search_results_footert@ file, when search results fit on more than page. The PAGELISTA is the list of links at the bottom of the search results page.; There should be as many strings in the list as there aremaximum_page_buttons< attribute. If there are not enough, or the list is empty,A the page numbers alone will be used as the text for the links.a@ Entries from this list are used for the links to other pages,: while an entry from the D no_page_number_text list is used for the current page, as theC current page is shown in the page list without a hypertext link.gC The text strings can contain HTML tags to highlight page numberssA or embed images. The strings need to be quoted if they containu spaces.
u
 example:
t
 t o o
m page_number_text:e i9 <em>1</em> <em>2</em> \
9 <em>3</em> <em>4</em> \
9 <em>5</em> <em>6</em> \
r9 <em>7</em> <em>8</em> \
t4 <em>9</em> <em>10</em>
;




t
;
n  pdf_parserg

8

 type:o
/
string&
t
 used by:
e
! htdig<
a
 default:
_
' path/acroread -toPostScript

 description:

= Set this to the path of the program used to parse PDFe6 files, including all command-line options. The7 program will be called with the parameters:
$ infile outfile,
4 where infile is a file to parse and8 outfile is the PostScript output of the parser.

/ The program is supposed to convert to a 3 variant of PostScript, which is then parsed>. internally. Currently, only Adobe's = acroread program has been tested as a pdf_parser. ; The default value of path is determined at 9 compile time, to include the path to the acroread>3 executable. This defaults to /usr/local/binr< if the configuration program can't find acroread.

7 To successfully index PDF files, be sure to set 4 the max_doc_size9 attribute to a value larger than the size of your 9 largest PDF file. PDF documents can not be parsed ! if they are truncated.

a4 Note: There is a bug in Acrobat 4's acroread7 command, which causes it to fail when -pairs isa8 used. Ht://Dig version 3.1.3 and later include a8 work-around for this bug such that when acroread: is the parser, and the -pairs option is not given,9 the second parameter will be the output directoryw, rather than the output file name.

2 The pdftops program that is part of the xpdf0 package is not suitable as a pdf_parser,5 because its variant of PostScript is slightly 1 different. However, an alternative is to3 use xpdf's pdftotext program as a componentf2 of an external7 parser with the xpdf 0.90 package installed>7 on your system, as described in FAQ question 4.9.c

e
 example:
t
D pdf_parser: /usr/local/Acrobat3/bin/acroread -toPostScript -pairs
m
f



k
"  plural_suffix

n

 type:/
l
string
l
 used by:
a
5 htsearchm
t
 default:
d
 s

 description:
t
5 Specifies the value of the PLURAL_MATCHES template<: variable used in the header, footer and template files.4 This can be used for localization for non-English5 languages where 's' is not the appropriate suffix.f
a
 example:
d
 plural_suffix: en
w
n


e
.
y+ s% prefix_match_character 



 type:
h
string
<
 used by:
s
5 htsearchg
>
 default:
/
 *

 description:
y
5 A null prefix character means that prefix matching_5 should be applied to every search word. Otherwise,t4 prefix matching is done on any search word ending0 with the characters specified in this string,4 with the string being stripped off before looking6 for matches. The "prefix" algorithm must be enabled6 in search_algorithm1 for this to work. You may also want to set thee; max_prefix_matches and= minimum_prefix_lengthi0 attributes to get it working as you want.
1 As a special case, in version 3.1.6 and later,_2 if this string is non-null and is entered alone3 as a search word, it is taken as a wildcard thatp1 matches all documents in the database. If thiss0 string is null, the wildcard for this special1 case will be *. This wildcarde6 doesn't require the prefix algorithm to be enabled.

 example:

 prefix_match_character: ing
>
t


w

h# v prev_page_textr

s

 type:u
/
string
f
 used by:
h
5 htsearch
t
 default:
i
[prev]i
s
 description:
g
3 The text displayed in the hyperlink to go to theg previous page of matches.
n
 example:
l
 m
e: prev_page_text: <img src="/htdig/buttonl.gif">
t
n





$  remove_bad_urls

m

 type:
b
boolean
&
 used by:
>
% htmergeg
x
 default:
<
 trued
>
 description:
q
: If TRUE, htmerge will remove any URLs which were marked: as unreachable by htdig from the database. If FALSE, it7 will not do this. When htdig is run in initial mode, 4 documents which were referred to but could not be6 accessed should probably be removed, and hence this: option should then be set to TRUE, however, if htdig is: run to update the database, this may cause documents on2 a server which is temporarily unavailable to be6 removed. This is probably NOT what was intended, so9 hence this option should be set to FALSE in that case.t

 example:
l
 remove_bad_urls: true

a


a
a
sF remove_default_doc



 type:a
f
 string list
l
 used by:

! htdign
>
 default:
o
index.htmle
t
 description:
t
@ Set this to the default documents in a directory used by theC servers you are indexing. These document names will be strippediG off of URLs when they are normalized, if one of these names appears 1 after the final slash, to translate URLs like 6 http://foo.com/index.html into http://foo.com/
= Note that you can disable stripping of these names during&9 normalization by setting the list to an empty string.&A The list should only contain names that all servers you indexrA recognize as default documents for directory URLs, as defined D by the DirectoryIndex setting in Apache's srm.conf, for example.
b
 example:
>
r
pG remove_default_doc: default.html default.htm index.html index.htm>
or
/ remove_default_doc:
<
/



t
i
2 restrict

p

 type:

 string list
/
 used by:
a
' htsearch,
a
 default:
a
 <empty><

 description:
e
N This specifies a set of patterns that all URLs have toM match against in order for them to be included in the O search results. Any number of strings can be specified,,O separated by spaces. If multiple patterns are given, attG least one of the patterns has to match the URL. O The list can be specified from within the configurationoM file, and can be overridden with the "restrict" inputoL parameter in the search form. Note that the restrict> list does not take precedence over theN exclude list - if a URL matchesL patterns in both lists it is still excluded from the' search results.
o
 example:

restrict: http://www.vh1.com/
a



1
l
u# g robotstxt_name 

d

 type:
o
stringt
i
 used by:
l
! htdigt

 default:
m
 htdig
g
 description:
a
6 Sets the name that htdig will look for when parsing: robots.txt files. This can be used to make htdig appear1 as a different spider than ht://Dig. Useful toe4 distinguish between a private and a global index.
k
 example:
d
 robotstxt_name: myhtdig
t
4




p  script_name

t

 type:l
>
stringl

 used by:
r
5 htsearch<
m
 default:
/
 <empty>

 description:
o
) Overrides the value of the SCRIPT_NAMEt+ environment attribute. This is useful if 1 htsearch is not being called directly as a CGI 0 program, but indirectly from within a dynamic0 .shtml page using SSI directives. Previously,. you needed a wrapper script to do this, but- this configuration attribute makes wrappere, scripts obsolete for SSI and possibly for' other server scripting languages, as/. well. (You still need a wrapper script when using PHP, though.)
, Check out the contrib/scriptname0 directory for a small example. Note that this- attribute also affects the value of the CGI variables used in htsearch templates.
y
 example:
t
c >
>( script_name: /search/results.shtml
d
d
n


d
A
f% _ search_algorithme

e

 type:

 string list
i
 used by:
g
5 htsearcht
e
 default:
e
exact:1
o
 description:
f
: Specifies the search algorithms and their weight to use9 when searching. Each entry in the list consists of the 8 algorithm name, followed by a colon (:) followed by a8 weight multiplier. The multiplier is a floating point6 number between 0 and 1. Note that depending on your9 locale setting, and whether youra8 system's locale implementation affects floating point8 input, you may need to specify the decimal point as a1 comma rather than a period. Current algorithmse supported are: 
<
< exact/
i3 The default exact word matching algorithm. This>) will find only exactly matched words.n
x soundexp
7 Uses a slightly modified soundex algorithm to match5 words. This requires that the soundex database be% present. It is generated with thed/ htfuzzy program.a
metaphoned
i4 Uses the metaphone algorithm for matching words.2 This algorithm is more specific to the english6 language than soundex. It is generated with the htfuzzy program.
l accents
2 Uses the accents algorithm for matching words.2 This algorithm will treat all accented letters3 as equivalent to their unaccented counterparts.. It requires the accents database, which is generated with the htfuzzy program.
< endings
6 This algorithm uses language specific word endings6 to find matches. Each word is first reduced to its7 word root and then all known legal endings are usedt7 for the matching. This algorithm uses two databasesr4 which are generated with  htfuzzy.
synonyms
a7 Performs a dictionary lookup on all the words. Thisr3 algorithm uses a database generated with the htfuzzy program.

e substring

e0 Matches all words containing the queries as; substrings. Since this requires checking every word inc5 the database, this can really slow down searches  considerably.
d
h prefixi

/ Matches all words beginning with the querym strings. Uses the option href="#prefix_match_character">prefix_match_character. to decide whether a query requires prefix6 matching. For example "abc*" would perform prefix- matching on "abc" since * is the defaultp prefix_match_character.


 example:

t
d+ search_algorithm: exact:1 soundex:0.3o
t

s


t
d
e/ e) search_results_contenttypes

l

 type:t
m
strings
e
 used by:
t
5 htsearchb
e
 default:
m
text/html
t
 description:
t
3 This specifies a Content-type to be output as ano5 HTTP header at the start of search results. If sete6 to an empty string, the Content-type header will be omitted altogether.
d
 example:
d
t
* search_results_contenttype: text/xml
>
/
e


/

* $ search_results_footer

s

 type:>

string<
m
 used by:

5 htsearch
t
 default:

 ${common_dir}/footer.html

 description:

7 This specifies a filename to be output at the end of 4 search results. While outputting the footer, some5 variables will be expanded. Variables use the samer: syntax as the Bourne shell. If there is a variable VAR,( the following will all be recognized:
  •  $VAR
  • I $(VAR)
  • ${VAR}
  • 
) The following variables are available: 
h MATCHES
h. The number of documents that were matched.
i PLURAL_MATCHES
5 If MATCHES is not 1, this will be the string "s",i7 else it is an empty string. This can be used to say  something like "$(MATCHES)) document$(PLURAL_MATCHES) were found"a
MAX_STARSa
7 The value of the max_starsh attribute.
m LOGICAL_WORDS
s5 A string of the search words with either "and" ord4 "or" between the words, depending on the type of search.n
e WORDSg
/ A string of the search words with spaces in between.
 PAGEHEADER
+ This expands to either the value of the7 page_list_header or : no_page_list_header4 attribute depending on how many pages there are.

: Note that this file will NOT be output- if no matches were found. In this case the 7 nothing_found_file  attribute is used instead. . Also, this file will not be output if it is overridden by defining thet? search_results_wrapper> attribute._
c
 example:

m
= search_results_footer: /usr/local/etc/ht/end-stuff.html
e
t
o


d

<* $ search_results_header

r

 type:
t
stringT
e
 used by:

5 htsearch
f
 default:
S
 ${common_dir}/header.html

 description:
t
9 This specifies a filename to be output at the start of4 search results. While outputting the header, some5 variables will be expanded. Variables use the sameh: syntax as the Bourne shell. If there is a variable VAR,( the following will all be recognized:
    t
  •  $VAR
  • h $(VAR)
  • > ${VAR}
  • 
) The following variables are available:p
MATCHES<
. The number of documents that were matched.
 PLURAL_MATCHES
5 If MATCHES is not 1, this will be the string "s",a7 else it is an empty string. This can be used to sayg something like "$(MATCHES)) document$(PLURAL_MATCHES) were found"
MAX_STARS
e7 The value of the max_starse attribute.
 LOGICAL_WORDSt
5 A string of the search words with either "and" ori4 "or" between the words, depending on the type of search.a
WORDSr
e/ A string of the search words with spaces ine between.

: Note that this file will NOT be output- if no matches were found. In this case ther7 nothing_found_fileu attribute is used instead.t. Also, this file will not be output if it is overridden by defining then? search_results_wrapperp attribute.

 example:

n n
>? search_results_header: /usr/local/etc/ht/start-stuff.htmlt
d
s
l


t
a
+ e% search_results_wrapper 

h

 type:

stringn
d
 used by:
r
5 htsearcht

 default:
w
 <empty>t
r
 description:

: This specifies a filename to be output at the start and0 end of search results. This file replaces theA search_results_header andr= search_results_footer<= files, with the contents of both in one file, and uses the < pseudo-variable $(HTSEARCH_RESULTS) as a0 separator for the header and footer sections.< If the filename is not specified, the file is unreadable,> or the pseudo-variable above is not found, htsearch reverts3 to the separate header and footer files instead.e While outputting the wrapper,3 some variables will be expanded, just as for theoA search_results_header andl= search_results_footerh files.
r: Note that this file will NOT be output- if no matches were found. In this case thes7 nothing_found_filen attribute is used instead.

 example:
s
<
8 search_results_wrapper: ${common_dir}/wrapper.html
t



a
>
t) q# search_rewrite_rulesF

o

 type:"
*
 string list
h
 used by:

5 htsearch

 default:
c
 <empty>t
b
 description:
<
* This is a list of pairs, regex/ replacement used to rewrite URLs in/. the search results. The left hand string is. a regex; the right hand string is a literal2 string with embedded placeholders for fragments0 that matched inside brackets in the regex. \0, is the whole matched string, \1 to \9 are/ bracketted substrings. The backslash must be2 doubled-up in the attribute setting to get past0 the variable expansion parsing. Rewrite rules1 are applied sequentially to each URL before ita& is displayed or checked against the; restrict ori2 exclude lists. Rewriting/ does not stop once a match has been made, so<2 multiple rules may affect a given URL. See also3 url_part_aliases- which allows URLs to be of one form during+ indexing and translated for results, ande5 url_rewrite_rules 4 which allows URLs to be rewritten while indexing.

 example:
>
  t p
m search_rewrite_rules: mD http://(.*)\\.mydomain\\.org/([^/]*) http://\\2.\\1.com \
@ http://www\\.myschool\\.edu/myorgs/([^/]*) http://\\1.org
o
e



o
n
t> server_aliases

B

 type:A
f
 string list
<
 used by:

! htdig

 default:
l
 <empty>t
C
 description:

= This attribute tells the indexer that servers have several? DNS aliases, which all point to the same machine and are NOT = virtual hosts. This allows you to ensure pages are indexedtA only once on a given machine, despite the alias used in a URL.S@ As shown in the example, the mapping goes from left to right,? so the server name on the right hand side is the one that isr
 example:

 g r t _
 server_aliases:d R3 foo.mydomain.com:80=www.mydomain.com:80 \
t- bar.mydomain.com:80=www.mydomain.com:80r
e
_
e


>
t
e@ server_max_docs

l

 type:i
s
integer

 used by:
d
! htdig

 default:
d
 -1 (no limit)a
_
 description:
u
D This attribute tells htdig to limit the dig to retrieve a maximum7 number of documents from each server. This can causeo9 unusual behavior on update digs since the old URLs aret9 stored alphabetically. Therefore, update digs will add<: additional URLs in pseudo-alphabetical order, up to the8 limit of the attribute. However, it is most useful to5 partially index a server as the URLs of additionalt; documents are entered into the database, marked as never< retrieved.

 example:
s
 server_max_docs: 50
u



$
_
eB server_wait_time

p

 type:e

integer
n
 used by:
p
! htdigt
h
 default:
l
 0
e
 description:

: This attribute tells htdig to ensure a server has had a4 delay (in seconds) from the beginning of the last9 connection. This can be used to prevent "server abuse" 9 by digging without delay. It's recommended to set this>: to 10-30 (seconds) when indexing servers that you don't: monitor yourself. Additionally, this attribute can slow; down local indexing if set, which may or may not be whath you intended.
u
 example:
d
 server_wait_time: 20




f
s
_* sort

_

 type:
h
stringi
o
 used by:
n
5 htsearch

 default:
w
 score

 description:

3 This is the default sorting method that htsearch > uses to determine the order in which matches are displayed. The valid choices are:s t l  t
  • r score
  • s time_
  • _ title
  •  revscorel
  •  revtime
  • / revtitles
d
d9 This attribute will only be used if the HTML form that 7 calls htsearch didn't have the sortl< value set. The words date and revdate can be used instead: of time and revtime, as both will sort by the time that9 the document was last modified, if this information is>< given by the server. The default is to sort by the score,= which ranks documents by best match. The sort methods that 3 begin with "rev" simply reverse the order of them7 sort. Note that setting this to something other thane- "score" will incur a slowdown in searches.e
h
 example:
f
 sort: revtime
s
a


c
t
>6 sort_names

,

 type:t
A
 quoted string listp
t
 used by:
i
5 htsearcho
b
 default:
h
m score Score time Time title Title revscore 'Reverse Score' revtime 'Reverse Time' revtitle 'Reverse Title'a

 description:
l
/ These values are used to create the r6 sort menu. It consists of pairs. The first= element of each pair is one of the known sort methods, ther7 second element is the text that will be shown in thel= menu for that sort method. This text needs to be quoted if  it contains spaces.. See the select1 list documentation for more information ons how this attribute is used.
/
 example:

 a  e
m sort_names:u s4 score 'Best Match' time Newest title A-Z \
8 revscore 'Worst Match' revtime Oldest revtitle Z-A
<
/



h
t
o6 soundex_db



 type:

stringp

 used by:
p
) htfuzzy and 5 htsearchs

 default:
l
 ${database_base}.soundex.db

 description:
0
8 The database file used for the fuzzy "soundex" search) algorithm. This database is created by 1 htfuzzy and used byl6 htsearch.
e
 example:
k
& soundex_db: ${database_base}.snd.db
e
>


e
d
t6 star_blank



 type:o
f
stringu
a
 used by:

5 htsearch
r
 default:
i
% ${image_url_prefix}/star_blank.gif

 description:
d
: This specifies the URL to use to display a blank of the' same size as the star defined in thet; star_image attribute or in the/8 star_patterns attribute.

 example:
<
 a
<= star_blank: http://www.somewhere.org/icons/elephant.gife
<
>
>




6 star_image



 type:a
d
stringd
>
 used by:
;
5 htsearch

 default:
r
 ${image_url_prefix}/star.gifl
t
 description:

8 This specifies the URL to use to display a star. This7 allows you to use some other icon instead of a star.  (We like the star...)
9 The display of stars can be turned on or off with theh8 use_star_image8 attribute and the maximum number of stars that can be! displayed is determined by thet= max_stars attribute.
t7 Even though the image can be changed, the ALT value & for the image will always be a '*'.
r
 example:
s
R n
a= star_image: http://www.somewhere.org/icons/elephant.gif=
<



<
<
r< star_patterns

r

 type:
m
 string list

 used by:

5 htsearch

 default:
d
 <empty>a
_
 description:
u
5 This attribute allows the star image to be changedr9 depending on the URL or the match it is used for. Thisi9 is mainly to make a visual distinction between matcheso2 on different web sites. The star image could be9 replaced with the logo of the company the match refersi to.
8 It is advisable to keep all the images the same size8 in order to line things up properly in a short result listing.
n: The format is simple. It is a list of pairs. The first8 element of each pair is a pattern, the second element* is a URL to the image for that pattern.
/
 example:
<
 a r  /
n star_patterns: p) http://www.sdsu.edu /sdsu.gif \
# http://www.ucsd.edu /ucsd.gif<
f
l
a



d
e2 startday

n

 type:
e
integer
h
 used by:
i
5 htsearchr

 default:

 <empty>h

 description:
d
6 This specifies the day of the cutoff start date for: search results. If the start or end date are specified,5 only results with a last modified date within this6 range are shown. The startday can be specified from7 within the configuration file, and can be overriddenr: with the "startday" input parameter in the search form.9 If a negative number is given, it is taken as relative/6 to the current date. Relative days can span several1 months or even years if desired. A startday ofc5 -90 will select matching documents modified within  the last 90 days.
/
 example:

 startday: 1
f
i




i# p start_ellipsese

r

 type:u
i
string

 used by:
_
5 htsearch

 default:

/ <b><tt>... </tt></b>i
<
 description:
<
9 When excerpts are displayed in the search output, this6 string will be prepended to the excerpt if there is8 text before the text displayed. This is just a visual8 reminder to the user that the excerpt is only part of the complete document.
e
 example:

 start_ellipses: ...

r


t

o$  start_highlight

g

 type:c
n
stringr

 used by:

5 htsearch>
<
 default:
s
 <strong>r
d
 description:
m
, When excerpts are displayed in the search2 output, matched words will be highlighted using, this string and , end_highlight. You should ensure that/ highlighting tags are balanced, that is, anyo0 formatting tags that this string opens should be closed by end_highlight.
t
 example:
i
0 start_highlight: <font color="#FF0000">
<




s
.6 startmonth



 type:l
i
integer
h
 used by:

5 htsearch
u
 default:
h
 <empty>/

 description:
<
8 This specifies the month of the cutoff start date for: search results. If the start or end date are specified,5 only results with a last modified date within thisa8 range are shown. The startmonth can be specified from7 within the configuration file, and can be overriddenr< with the "startmonth" input parameter in the search form.9 If a negative number is given, it is taken as relativem9 to the current month. Relative months can span several years if desired.

 example:
l
 startmonth: 2
h
r



d
d4 start_url

e

 type:s
e
 string list
b
 used by:
h
! htdig
"
 default:
<
 http://www.htdig.org/
a
 description:

8 This is the list of URLs that will be used to start a5 dig when there was no existing database. Note thatt# multiple URLs can be given here.a6
Note also that the value of start_url will be the default value for4 limit_urls_to, so if4 you set start_url to the URLs for specific files,7 rather than a site or subdirectory URL, you may need>5 to set limit_urls_to to something less restrictivee2 so htdig doesn't reject links in the documents.
r
 example:

a m
< start_url: http://www.somewhere.org/alldata/index.html
h
r
_


a
b
4 startyear



 type:t
t
integer
h
 used by:
e
5 htsearcht
n
 default:
n
 <empty>d
t
 description:

7 This specifies the year of the cutoff start date for<: search results. If the start or end date are specified,5 only results with a last modified date within this>7 range are shown. The startyear can be specified from7 within the configuration file, and can be overriddenp; with the "startyear" input parameter in the search form. 9 If a negative number is given, it is taken as relative  to the current year.p
s
 example:
<
 startyear: 1995
a



m

t( " substring_max_words

t

 type:g
a
integer
i
 used by:

5 htsearch

 default:
s
 25s
i
 description:
/
: The Substring fuzzy algorithm could potentially match a4 very large number of words. This value limits the9 number of words each substring pattern can match. Note 8 that this does not limit the number of documents that are matched in any way.
>
 example:
h
 substring_max_words: 100<
<



&
y
a' n! synonym_dictionaryi

t

 type:n

strings
T
 used by:
t
% htfuzzyT
g
 default:
e
 ${common_dir}/synonyms>

 description:

4 This points to a text file containing the synonym9 dictionary used for the synonyms search algorithm.
.6 Each line of this file has at least two words. The5 first word is the word to replace, the rest of the.$ words are synonyms for that word.

 example:

 d
n, synonym_dictionary: /usr/dict/synonyms
u
u




e6 synonym_db

r

 type:s
<
string>

 used by:

9 htsearch and % htfuzzy=
a
 default:
u
 ${common_dir}/synonyms.db
;
 description:
n
6 Points to the database that 9 htfuzzy creates when the synonymsd algorithm is used.
6 htsearch3 uses this to perform synonym dictionary lookups.t
f
 example:
w
& synonym_db: ${database_base}.syn.db

e


a
r
& syntax_error_file



 type:h
t
string
0
 used by:
e
5 htsearchi

 default:
>
 ${common_dir}/syntax.html
t
 description:

7 This points to the file which will be displayed if a - boolean expression syntax error was found.>
f
 example:

f
/4 syntax_error_file: ${common_dir}/synerror.html
<
i



W
e
d: template_map

t

 type:
T
 quoted string list

 used by:
m
5 htsearch

 default:

5 Long builtin-long builtin-long Short builtin-shortn builtin-short
r
 description:

7 This maps match template names to internal names andr5 template file names. It is a list of triplets. The 9 first element in each triplet is the name that will bec: displayed in the FORMAT menu. The second element is the2 name used internally and the third element is a' filename of the template to use.
7 There are two predefined templates, namely % builtin-long and t4 builtin-short. If the filename is one of( those, they will be used instead.
8 More information about templates can be found in the5 htsearchl documentation.h
d
 example:
t
 l  g
/ template_map:d 0 Short short ${common_dir}/short.html \
& Normal normal builtin-long \
/ Detailed detail ${common_dir}/detail.htmld




<
"
h< template_name



 type:;

string<
i
 used by:
c
5 htsearchI

 default:

 builtin-longi

 description:
f
9 Specifies the default template if none is given by the ( search form. This needs to map to the, template_map.

 example:
.
 template_name: long
i
.



a
>D template_patterns



 type:s
t
 string list
d
 used by:

5 htsearchu
>
 default:
t
 <empty>
u
 description:
t
; This attribute allows the results template to be changed 9 depending on the URL or the match it is used for. Thisa9 is mainly to make a visual distinction between matchest: on different web sites. The results for each site could3 thus be shown in a style matching that site.
f3 The format is simply a list of pairs. The firsts8 element of each pair is a pattern, the second element9 is the name of the template file for that pattern.
m8 More information about templates can be found in the5 htsearcht documentation.
>/ Normally, when using this template selection<+ method, you would disable user selectiont+ of templates via the format inpute0 parameter in search forms, as the two methods2 were not really designed to interact. Templates. selected by URL patterns would override any2 user selection made in the form. If you want to1 use the two methods together, see the notes ong? combiningd) them for an example of how to do this."
a
 example:
e
 & ; / l
 template_patterns: 7 http://www.sdsu.edu ${common_dir}/sdsu.html \
1 http://www.ucsd.edu ${common_dir}/ucsd.htmlf
i
i
>



a
s8 text_factor

w

 type:n

numberi
s
 used by:

! htdige
>
 default:

 1
d
 description:
>
6 This is a factor which will be used to multiply the8 weight of words that are not in any special part of a: document. Setting a factor to 0 will cause normal words4 to be ignored. The number may be a floating point2 number. See also the 5 heading_factor_[1-6], >4 title_factor, and " keywords_factor attributes.
y
 example:
r
 text_factor: 0
s
n


r
t
a0 timeout



 type:>

number

 used by:

! htdig<
t
 default:
n
 30o
t
 description:

8 Specifies the time the digger will wait to complete a1 network read. This is just a safeguard against , unforeseen things like the all too common2 transformation from a network to a notwork.
( The timeout is specified in seconds.
>
 example:

 timeout: 42
o
t



r
n: title_factor

T

 type:a
t
numbere
o
 used by:
e
! htdig>

 default:
y
 100
d
 description:
l
6 This is a factor which will be used to multiply the8 weight of words in the title of a document. Setting a2 factor to 0 will cause words in the title to be: ignored. The number may be a floating point number. See& also the & heading_factor_[1-6] attribute.

 example:
a
 title_factor: 12>
u
>


m

< translate_amp

s

 type:f
e
boolean
t
 used by:
h
! htdig>
h
 default:

 true>
a
 description:
:
C If set to false, the entity &amp; (or &#38;) will not berI translated into its ASCII equivalent &. If translation were takingd
 example:
T
 translate_amp: falsee
i
b


e
s
B translate_latin1

f

 type:r
m
boolean

 used by:
/
! htdigt
n
 default:
s
 true
d
 description:

8 If set to false, the SGML entities for ISO-8859-1 (or7 Latin 1) characters above &nbsp; (or &#160;)t7 will not be translated into their 8-bit equivalents.d5 This attribute should be set to false when using ag4 locale that doesn't use the4 ISO-8859-1 character set, to avoid these entities7 being mapped to inappropriate 8-bit characters shownr2 in a different character set in search results.
s
 example:

 translate_latin1: false
T
e



s
a@ translate_lt_gt

r

 type:t
b
boolean
t
 used by:
e
! htdiga

 default:
n
 truef
h
 description:
i
E If set to false, the entities &lt; (or &#60;) and &gt; N (or &#62;) will not be translated into their ASCII equivalents < andI >. If translation were taking place, an excerpt containing < and M > might be misinterpreted by the browser and look unrecognizable to theh+ user. This isn't a problem with versions G 3.1.5 and later of htsearch, which convert the translated characters>` back into the proper entities. For this reason, translating these entities is now the default behavior.

 example:

 translate_lt_gt: falsed
r
r


s
a
/> translate_quot

/

 type:e
t
boolean
f
 used by:
s
! htdigr
t
 default:
d
 true<
/
 description:
d
D If set to false, the entity &quot; (or &#34;) will not beJ translated into its ASCII equivalent ". If translation were takingO place, an excerpt containing a " might be misinterpreted by the browser J and look unrecognizable to the user. This isn't a problem with versionsF 3.1.5 and later of htsearch, which convert the translated characterE back into the &quot; entity. For this reason, translating thisd& entity is now the default behavior.

 example:
t
 translate_quot: false
e




t
v* $ uncoded_db_compatible

u

 type:i

boolean
p
 used by:
s
" htdig,$ htdump,$ htload,( htnotify,, htmerge and htsearch
t
 default:
e
 truea

 description:
r
6 At the cost of time for extra database accesses and- not getting the full effect of the optionsr7 common_url_parts and 4 url_part_aliases,0 read databases where some or all URLs are not3 encoded at all through these options.
3 Note that the database still needs to be rebuilt , if either or both of common_url_parts and7 url_part_aliases were non-empty when it was built or:6 modified, or if they were set to anything else than the current values.
2 If a to-string in url_part_aliases can5 occur in normal URLs, this option should be set to+ false to eliminate surprises.
s
 example:
<
" uncoded_db_compatible: false

p



n

s2 url_list

i

 type:d
d
stringt
>
 used by:
p
! htdiga
c
 default:
f
 ${database_base}.urls

 description:
m
 This file is only created ifbD create_url_list is set to5 true. It will contain a list of all URLs that were> seen.
,
 example:
d
 url_list: /tmp/urls
>
e



>
_0 url_log



 type:i
s
string<
>
 used by:

! htdig
y
 default:
t
 ${database_base}.log
<
 description:

> If htdig is run with the -l option: and interrupted, it will write out its progress to this= file. Note that if it has a large number of URLs to write,n< it may take some time to exit. This can especially happen; when running update digs and the run is interrupted soon  after beginning.
e
 example:
2
 url_log: /tmp/htdig.progress<
r
n


e
"
c%  url_part_aliasesa

d

 type:

 string list

 used by:

" htdig,$ htdump,$ htload,( htnotify,, htmerge and htsearch
d
 default:
r
 <empty>

 description:
1
1 A list of translations pairs from and>1 to, used when accessing the database./' If a part of an URL matches with the0 from-string of each pair, it will be5 translated into the to-string just before>2 writing the URL to the database, and translated4 back just after reading it from the database.
3 This is primarily used to provide an easy way toi) rename parts of URLs for e.g. changingt0 www.example.com/~htdig to www.htdig.org. Two0 different configuration files for digging and1 searching are then used, with url_part_aliases(. having different from strings, but% identical to-strings.
o See also common_url_parts.
e1 Strings that are normally incorrect in URLs ort& very seldom used, should be used as3 to-strings, since extra storage will be 2 used each time one is found as normal part of a5 URL. Translations will be performed with prioritys( for the leftmost longest match. Each1 to-string must be unique and not be a6 part of any other to-string. It also helps6 to keep the to-strings short to save space2 in the database. Other than that, the choice of3 to-strings is pretty arbitrary, as they5 just provide a temporary, internal encoding in the1 databases, and none of the characters in theser( strings have any special meaning.
0 Note that when this attribute is changed, the3 database should be rebuilt, unless the effect of<0 "moving" the affected URLs in the database is" wanted, as described above.
4 Please note: Don't just copy the2 example below into a single configuration file.% There are two separate settings ofe4 url_part_aliases below; the first one is2 for the configuration file to be used by htdig,7 htmerge, and htnotify, and the second one is for thea- configuration file to be used by htsearch.h- In this example, htdig will encode the URL 8 "http://search.example.com/~htdig/contrib/stuff.html"2 as "*sitecontrib/stuff*2" in the databases, and htsearch will decode it asa0 "http://www.htdig.org/contrib/stuff.htm".
4 As of version 3.1.6, you can also do more complex rewriting of URLs using9 url_rewrite_rules ande< search_rewrite_rules.

 example:

 / d  & e  o
< url_part_aliases:> i4 http://search.example.com/~htdig/ *site \
* http://www.htdig.org/this/ *1 \
 .html *2l
p url_part_aliases: ( http://www.htdig.org/ *site \
* http://www.htdig.org/that/ *1 \
 .htm *2
t
t


o
i
& url_rewrite_rules

>

 type:
t
 string list

 used by:

! htdiga
<
 default:

 <empty>t

 description:
y
* This is a list of pairs, regex+ replacement used to permanentlyu2 rewrite URLs as they are indexed. The left hand0 string is a regex; the right hand string is a0 literal string with embedded placeholders for0 fragments that matched inside brackets in the/ regex. \0 is the whole matched string, \1 toa3 \9 are bracketted substrings. The backslash mustb5 be doubled-up in the attribute setting to get pastz0 the variable expansion parsing. Rewrite rules0 are applied sequentially to each incoming URL2 before normalization occurs. Rewriting does not/ stop once a match has been made, so multiplee) rules may affect a given URL. See alsod3 url_part_aliasesd- which allows URLs to be of one form duringl+ indexing and translated for results, andv; search_rewrite_rulesm7 which allows URLs to be rewritten in search results.>
p
 example:

   h
< url_rewrite_rules: "- (.*)\\?JServSessionIdroot=.* \\1 \
a1 (.*)\\&JServSessionIdroot=.* \\1 \
 (.*)&context=.* \\1
t
e



<
u
!  use_doc_dater

/

 type:e
d
boolean
i
 used by:
r
! htdign
r
 default:
,
 false
o
 description:
r
0 If set to true, htdig will use META date tags1 in documents, overriding the modification dateu1 returned by the server. Any documents that doe/ not have META date tags will retain the lasts0 modified date returned by the server or found on the local file system.0 As of version 3.1.6, in addition to META date+ tags, htdig will also recognize dc.date,b( dc.date.created and dc.date.modified.

 example:

 use_doc_date: true>
p



n

s) <# use_meta_description

p

 type:r
/
boolean
e
 used by:
.
5 htsearch

 default:

 false
e
 description:
c
< If set to true, any META description tags will be used as< excerpts by htsearch. Any documents that do not have META2 descriptions will retain their normal excerpts.
/
 example:

 use_meta_description: true




>
<
n# d use_star_imaget

r

 type:e

boolean
"
 used by:
>
5 htsearch.
/
 default:

 true
e
 description:
i
1 If set to true, the s8 star_image attribute is used to display upto9 max_stars images forl each match.
p
 example:

 use_star_image: no

p




  user_agent

<

 type:a
r
stringt
<
 used by:
p
! htdigt

 default:

 htdig
t
 description:
<
? This allows customization of the user_agent: field sent whent, the digger requests a file from a server.

 example:
h
 user_agent: htdig-diggeru




y
m
/% > valid_extensions 

r

 type:
c
 string list
o
 used by:
>
! htdige
<
 default:
t
 <empty>d
u
 description:
T
1 This is a list of extensions on URLs which area< the only ones considered acceptable. This list is used to: supplement the MIME-types that the HTTP server provides: with documents. Some HTTP servers do not have a correct2 list of MIME-types and so can advertise certain7 documents as text while they are some binary format.h< If the list is empty, then all extensions are acceptable,A provided they pass other criteria for acceptance or rejection.l; If the list is not empty, only documents with one of thel% extensions in the list are parsed. 9 See also bad_extensions.i
s
 example:

& valid_extensions: .html .htm .shtml
f
r


l
s
e& valid_punctuation

c

 type:y

stringv
r
 used by:
s
% htdig ands5 htsearcha

 default:
,
 .-_/!#$%^&'
f
 description:
e
6 This is the set of characters which will be deleted7 from the document before determining what a word is..8 This means that if a document contains something like5 Andrew's the digger will see this as u Andrews.
8 The same transformation is performed on the keywords the search engine gets.
 See also the extra_word_characters attribute.x
h
 example:
i
 valid_punctuation: -'
i



d
c
u0 version



 type:"
e
stringu
e
 used by:
_
5 htsearch

 default:

VERSION
_
 description:

* This specifies the value of the VERSION2 variable which can be used in search templates.4 The default value of this attribute is determined0 at compile time, and will not normally be set in configuration files.
t
 example:
i
 version: 3.1.2PL1
/





0 word_db

w

 type:<

stringd
>
 used by:

: htdig, 9 htmerge and  htsearch,
>
 default:
;
 ${database_base}.words.db
e
 description:
p
8 This is the main word database. It is an index of all4 the words to a list of documents that contain the6 words. This database can grow large pretty quickly.
i
 example:
t
( word_db: ${database_base}.allwords.db
a
i


c
s
.4 word_list

s

 type:.
l
stringq
o
 used by:

% htdig andh% htmergey
i
 default:
l
 ${database_base}.wordlist
t
 description:
l
6 This is the input file that 5 htmerge uses to create the main words databasen8 specified by word_db.2 This file gets about as large as the main words: database. If this file exists when htdig is running, it7 will append data to this file. htmerge will then useo6 the existing data and the appended data to create a% completely new main word database.e

 example:
/
 >
/ word_list: ${database_base}.allwords.text_
<
>
>


+Last modified: $Date: 2002/01/27 05:33:19 $e rÿÿ
! htdign
r
 default:
,
 false
o
 description:
r
0 If set to true, htdig will use META date tags1 in documents, overriding the modification dateu1 returned by the server. Any documents that doe/ not have META date tags will retain the lasts0 modified date returned by the server or found on th