j  � ht://Dig Table of Contents � 0 H 

6
Contents

: General
b * ht://Dig
{ * Mirrors
v * Features and Requirements
j * Where to get it
i * Installation
i * Configuration
m * Running ht://Dig
\ * FAQ
u > Mailing list
j * Uses of ht://Dig
k * License information

< Reference
b * rundig
` * htdig
b * htdump
b * htload
d * htmerge
f * htnotify
d * htfuzzy
n > htsearch
y > Configuration file
c * META tags

8 Other
l * How it works
h * Contributors
j * Release notes
c * ChangeLog
^ * TODO
g * Bug Reporting
* Related Projects
> Contributed Work
* Developer Site

D
 Quick Search:
 ' , . 
ÿÿs, the PostScript that acroread generated was very difficult to parse into indexable text. Also, the built-in PDF support expected PDF documents to use the same character encoding as is defined in your current locale, which isn't always the case. The external converters, which use pdftotext, were developed to overcome these problems. xpdf 0.90 is free software, and its pdftotext utility works very well as an indexing tool. It also converts various PDF encodings to the Latin 1 set. It is the opinion of the developers that this is the preferred method. However, some users still prefer to stick with acroread, as it works well for them, and is a little easier to set up if you've already installed Acrobat.

Also, pdftotext still has some difficulty handling text in landscape orientation, even with its new -raw option in 0.90, so if you need to index such text in PDFs, you may still get better results with acroread. The pdf_parser attribute has been removed from the 3.2 beta releases of htdig, so to use acroread with htdig 3.2.0b3 or other 3.2 betas, use the acroconv.pl external converter script from our web site.

See also question 5.2 below and question 1.13 above.

4.10. How do I index documents in other languages?

The first and most important thing you must do, to allow ht://Dig to properly support international characters, is to define the correct locale for the language and country you wish to support. This is done by setting the locale attribute (see question 5.8). The next step is to configure ht://Dig to use dictionary and affix files for the language of your choice. These can be the same dictionary and affix files as are used by the ispell software. A collection of these is available from Geoff Kuenning's International Ispell Dictionaries page, and we're slowly building a collection of word lists on our web site.

For example, if you install German dictionaries in common/german, you could use these lines in your configuration file:

locale:               de_DE
lang_dir:             ${common_dir}/german
bad_word_list:        ${lang_dir}/bad_words
endings_affix_file:   ${lang_dir}/german.aff
endings_dictionary:   ${lang_dir}/german.0
endings_root2word_db: ${lang_dir}/root2word.db
endings_word2root_db: ${lang_dir}/word2root.db

You can build the endings database with htfuzzy endings. (This command may actually take days to complete, for releases older than 3.1.2. Current releases use faster regular expression matching, which will speed this up by a few orders of magnitude.) Note that the "*.0" files are not part of the ispell dictionary distributions, but are easily made by concatenating the partial dictionaries and sorting to remove duplicates (e.g.: "cat * | sort | uniq > lang.0" in most cases). You will also need to redefine the synonyms file if you wish to use the synonyms search algorithm. This file is not included with most of the dictionaries, nor is the bad_words file.

If you put all the language-specific dictionaries and configuration files in separate directories, and set all the attribute definitions accordingly in each search config file to access the appropriate files, you can have a multilingual setup where the user selects the language by selecting the "config" input parameter value. In addition to the attributes given in the example above, you may also want custom settings for these language-specific attributes: date_format, iso_8601, method_names, no_excerpt_text, no_next_page_text, no_prev_page_text, nothing_found_file, page_list_header, prev_page_text, search_results_wrapper (or search_results_header and search_results_footer), sort_names, synonym_db, synonym_dictionary, syntax_error_file, template_map, and of course database_dir or database_base if you maintain multiple databases for sites of different languages. You could also change the definition of common_dir, rather than making up a lang_dir attribute as above, as many language-specific files are defined relative to the common_dir setting.

Current versions of ht://Dig only support 8-bit characters, so languages such as Chinese and Japanese, which require 16-bit characters, are not currently supported.

Didier Lebrun has written a guide for configuring htdig to support French, entitled Comment installer et configurer HtDig pour la langue française. His "kit de francisation" is also available on our web site.

See also question 4.2 for tips on customizing htsearch, and question 4.6 for tips where to find bad_words files.

4.11. How do I get rotating banner ads in search results?

While htsearch doesn't currently provide a means of doing SSI on its output, or calling other CGI scripts, it does have the capability of using environment variables in templates.

The easiest way to get rotating banners in htsearch is to replace htsearch with a wrapper script that sets an environment variable to the banner content, or whatever dynamically generated content you want. Your script can then call the real htsearch to do the work. The wrapper script can be written as a shell script, or in Perl, C, C++, or whatever you like. You'd then need to reference that environment variable in header.html (or wrapper.html if that's what you're using), to indicate where the dynamic content should be placed.

If the dynamic content is generated by a CGI script, your new wrapper script which calls this CGI would then have to strip out the parts that you don't want embedded in the output (headers, some tags) so that only the relevant content gets put into the environment variable you want. You'd also have to make sure this CGI script doesn't grab the POST data or get confused by the QUERY_STRING contents intended for htsearch. Your script should not take anything out of, or add anything to, the QUERY_STRING environment variable.

An alternative approach is to have a cron job that periodically regenerates a different header.html or wrapper.html with the new banner ad, or changes a link to a different pre-generated header.html or wrapper.html file. For other alternatives, see question 4.7.

4.12. How do I index numbers in documents?

By default, htdig doesn't treat numbers without letters as words, so it doesn't index them. To change this behavior, you must set the allow_numbers attribute to true, and rebuild your index from scratch using rundig or htdig with the -i option, so that bare numbers get added to the index.

4.13. How can I call htsearch from a hypertext link, rather than from a search form?

If you change the search.html form to use the GET method rather than POST, you can see the URLs complete with all the arguments that htsearch needs for a query. Here is an example:
http://www.grommetsRus.com/cgi-bin/htsearch?config=htdig&restrict=&exclude=&method=and&format=builtin-long&words=grapple+grommets which can actually be simplified to:
http://www.grommetsRus.com/cgi-bin/htsearch?method=and&words=grapple+grommets with the current defaults. The "&" character acts as a separator for the input parameters, while the "+" character acts as a space character within an input parameter. In versions 3.1.5 or 3.2.0b2, or later, you can use a semicolon character ";" as a parameter separator, rather than "&", for HTML 4.0 compliance. Most non-alphanumeric characters should be hex-encoded following the convention for URL encoding (e.g. "%" becomes "%25", "+" becomes "%2B", etc). Any htsearch input parameter that you'd use in a search form can be added to the URL in this way. This can be embedded into an <a href="..."> tag.
See also question 5.21.

4.14. How do I restrict a search to only meta keywords entries in documents?

First of all, you do not do this by using the "keywords" field in the search form. This seems to be a frequent cause of confusion. The "keywords" input parameter to htsearch has absolutely nothing to do with searching meta keywords fields. It actually predates the addition of meta keyword support in 3.1.x. A better choice of name for the parameter would have been "requiredwords", because that's what it really means - a list of words that are all required to be found somewhere in the document, in addition to the words the user specifies in the search form.

To restrict a search to meta keywords only, you must set all factors other than keywords_factor to 0, and for 3.1.x, you must then reindex your documents. In the 3.2 betas, you can change factors at search time without needing to reindex. Future 3.2 releases will also offer the ability to restrict the search in the query itself. Note that changing the scoring factors in this way will only alter the scoring of search results, and shift the low or zero scores to the end of the results when sorting by score (as is done by default). The results with scores of zero aren't actually removed from the search results, although this will be done in future 3.2 releases.

4.15. Can I use meta tags to prevent htdig from indexing certain files?

Yes, in each HTML file you want to exclude, add the following between the <HEAD> and </HEAD> tags: