j 3 ht://Dig Frequently Asked Questions0  $

Frequently Asked Questions



W ht://Dig Copyright © 1995-2002 The ht://Dig Group
8 Please see the file COPYING for license information.


J

This FAQ is compiled by the ht://Dig developers and the- most recent version is available at <http://www.htdig.org/FAQ.html>.4 Questions (and answers!) are greatly appreciated.< Please send questions and/or answers to the ht://Dig usery mailing list at: <htdig-general@lists.sourceforge.net>.



Questions



1. General

H 1.1. Can I search the internet with ht://Dig?
G 1.2. Can I index the internet with ht://Dig?
? 1.3. What's the difference between htdig and ht://Dig?
9 1.4. I sent mail to Andrew or Geoff or. Gilles, but I never got a response!
C 1.5. I sent a question to the mailing list but I never got a response!
G 1.6. I have a great idea/patch for ht://Dig!
: 1.7. Is ht://Dig Y2K compliant?
H 1.8. I think I found a bug. What should I do?
< 1.9. Does ht://Dig support phrase or near matching?
C 1.10. What are the practical and/or theoretical limits of ht://Dig?
? 1.11. Do any ISPs offer ht://Dig as part of& their web hosting services?
M 1.12. Can I use ht://Dig on a commercial website?
> 1.13. Why do you use a non-free product to index PDF files?
? 1.14. Why do you have all those SourceForge! logos on your website?
I 1.15. My question isn't answered here. Where should I go for help?
@ 1.16. Why do the developers get annoyed whenL I e-mail questions directly to them rather than the mailing list?



2. Getting ht://Dig

F 2.1. What's the latest version of ht://Dig?
K 2.2. Are there binary distributions of ht://Dig?
D 2.3. Are there mirror sites for ht://Dig?
= 2.4. Is ht://Dig available by ftp?
= 2.5. Are patches around to upgrade between versions?
8 2.6. Is there a Windows 95/98/2000/NT version of ht://Dig?
A 2.7. Where can I find the documentation for my version of ht://Dig?



3. Compiling

> 3.1. When I compile ht://Dig I get an error about libht.a.
8 3.2. I get an error about -lg
? 3.3. I'm compiling on Digital Unix and I get4 mesages about "unresolved" and "db_open."
? 3.4. I'm compiling on FreeBSD and I get lots9 of messages about '___error' being unresolved.
J 3.5. I'm compiling on HP/UX and I get a complaint about' "Large Files not supported."
D 3.6. I'm compiling on Solaris and when I run the A programs I get complaints about not finding libstdc++.
< 3.7. I'm compiling on IRIX and I'm having4 database problems when I run the program.



4. Configuration

? 4.1. How come I can't index my site?
= 4.2. How can I change the output format of htsearch?
I 4.3. How do I index pages that start with '~'?
= 4.4. Can I use multiple databases?
? 4.5. OK, I can use multiple databases. Can I merge them into one?
; 4.6. Wow, ht://Dig eats up a lot of disk% space. How can I cut down?
9 4.7. Can I use SSI or other CGIs in my htsearch results?
> 4.8. How do I index Word, Excel, PowerPoint# or PostScript documents?
9 4.9. How do I index PDF files?
; 4.10. How do I index documents in other languages?
= 4.11. How do I get rotating banner ads in search results?
F 4.12. How do I index numbers in documents?
B 4.13. How can I call htsearch from a hypertext0 link, rather than from a search form?
A 4.14. How do I restrict a search to only meta) keywords entries in documents?
C 4.15. Can I use meta tags to prevent htdig from" indexing certain files?
E 4.16. How do I get htsearch to use the star image< in a different directory than the default /htdig?
C 4.17. How do I get htdig or htsearch to rewrite& URLs in the search results?
5 4.18. What are all the options in, htdig.conf, and are there others?
< 4.19. How do I get more than 10 pages of+ 10 search results from htsearch?
< 4.20. How do I restrict a search to only/ certain subdirectories or documents?
: 4.21. How can I allow people to search' while the index is updating?
; 4.22. How can I get htdig to ignore the/ robots.txt file or meta robots tags?
: 4.23. How can I get htdig not to index4 some directories, but still follow links?
< 4.24. How can I get rid of duplicates in search results?



5. Troubleshooting

C 5.1. I can't seem to index more than X documents in a directory.
8 5.2. I can't index PDF files.
B 5.3. When I run "rundig," I get a message about* "DATABASE_DIR" not being found.
A 5.4. When I run htmerge, it stops with an "out! of diskspace" message.
@ 5.5. I have problems running rundig from cron under Linux.
< 5.6. When I run htmerge, it stops with an* "Unexpected file type" message.
C 5.7. When I run htsearch, I get lots of Internal Server Errors (#500).
? 5.8. I'm having problems with indexing words$ with accented characters.
; 5.9. When I run htmerge, it stops with a& "Word sort failed" message.
E 5.10. When htsearch has a lot of matches, it runs extremely slowly.
E 5.11. When I run htsearch, it gives me a count of< matches, but doesn't list the matching documents.
D 5.12. I can't seem to index documents with names+ like left_index.html with htdig.
F 5.13. I get Premature End of Script Headers errors! when running htsearch.
@ 5.14. I get Segmentation faults when running& htdig, htsearch or htfuzzy.
D 5.15. Why does htdig 3.1.3 mangle URL parameters0 that contain bare "&" characters?
> 5.16. When I run htmerge, it stops with anE "Unable to open word list file '.../db.wordlist'" message.
J 5.17. When using Netscape, htsearch always returns the "No match" page.
A 5.18. Why doesn't htdig follow links to other/$ pages in JavaScript code?
B 5.19. When I run htsearch from the web server,- it returns a bunch of binary data.
hO 5.20. Why are the betas of 3.2 so slow at indexing?
u> 5.21. Why does htsearch use ";" instead ofC "&" to separate URL parameters for the page buttons?
<4 5.22. Why does htsearch show the> "&" character as "&amp;" in search results?
? 5.23. I get Internal Server or Unrecognizedv2 character errors when running htsearch.
5 5.24. I took some settings out oft/ my htdig.conf but they're still set.
c6 5.25. When I run htdig on my site,( it misses entire directories.
= 5.26. What do all the numbers and symbolsl' in the htdig -v output mean?
< 5.27. Why is htdig rejecting some of the! links in my documents?
e> 5.28. When I run htdig or htmerge, I get aJ "DB2 problem...: missing or empty key value specified" message.
6 5.29. When I run htdig on my site,3 it seems to go on and on without ending.



Answers



1. General

= 1.1. Can I search the internet withs ht://Dig?
1;

No, ht://Dig is a system for indexing and searching a > finite (not necessarily small) set of sites or intranet. It? is not meant to replace any of the many internet-wide search1 engines.

9< 1.2. Can I index the internet with ht://Dig?
e/

No, as above, ht://Dig is not meant as anh. internet-wide search engine. While there is> theoretically nothing to stop you from indexing as> much as you wish, practical considerations (e.g. time, disk, space, memory, etc.) will limit this.

F 1.3. What's the difference between htdig and ht://Dig?
H

The complete ht://Dig package consists of several programs, one ofB which is called "htdig." This program performs the "digging" orI indexing of the web pages. Of course an index doesn't do you much goodp< without a program to sort it, search through it, etc.

= 1.4. I sent mail to Andrew or Geoff: or Gilles, but I never got a response!
C

Andrew no longer does much work on ht://Dig. He has started a>< company, called ContigoB Software and is quite busy with that. To contact any of the* current developers, send mail to <htdig-dev>.e@ This list is intended primarily for the discussion of current. and future development of the software.

7

Geoff and Gilles are currently the maintainers ofn; ht://Dig, but they are both volunteers. So while they dos9 read all the e-mail they receive, they may not respond#C immediately. Questions about ht://Dig in general, and especiallyr> questions or requests for help in configuring the software,! should be posted to the <htdig-general>< mailing list. When posting a followup to a message on the; list, you should use the "reply to all" or "group reply"a> feature of your mail program, to make sure the mailing list> address is included in the reply, rather than replying only to the author of the message.6 See also question 1.16 and theC mailing list> page.

J 1.5. I sent a question to the mailing list but I) never got a response!
gA

Development of ht://Dig is done by volunteers. Since we alleA have other jobs, it make take a while before someone gets backm@ to you. Please be patient and don't hound the volunteers withA direct or repeated requests. If you don't get a response after ) 3 or 4 days, then a reminder may help. 3 See also question 1.16.

< 1.6. I have a great idea/patch for ht://Dig!
sB

Great! Development of ht://Dig continues through suggestionsD and improvements from users. If you have an idea (or even better,B a patch), please send it to the ht://Dig mailing list so othersE can use it. For suggestions on how to submit patches, please checkF0 the Guidelines forB Patch Submissions. If you'd like to make a feature request,= you can do so through the ht://Dig bugu database

.J 1.7. Is ht://Dig Y2K compliant?


K ht://Dig should be y2k compliant since it never stores dates asrJ two-digit years. Under ht://Dig's copyright (GPL), there is no warrantyB whatsoever as permitted by law. If you would like an iron-clad,@ legally-binding guarantee, feel free to check the source codeF itself. Versions prior to 3.1.2 did have a problem with the parsingF of the Last-Modified header returned by the HTTP server, which willB cause incorrect dates to be stored for documents modified afterD February 28, 2000 (yes, it didn't recognize 2000 as a leap year).F Versions prior to 3.1.5 didn't correctly handle servers that returnC two digit years in the Last-Modified header, for years after 99. 3 These problems are fixed in the current release.6 If you discover something else, please let us know!

C 1.8. I think I found a bug. What should I1 do?
tC

Well, there are probably bugs out there. You have two options"C for bug-reporting. You can either mail the ht://Dig mailing listam at <htdig-general@lists.sourceforge.net> ori7 better yet, report it to the buga' database, which ensures it won'te9 become lost amongst all of the other mail on the list.lC Please try to include as much information as possible, includingaC the version of ht://Dig, the OS, and anything else that might be = helpful. Often, running the programs with one "-v" or morer4 (e.g. "-vvv") gives useful debugging information.D If you are unsure whether the problem is a bug or a configuration- problem, you should discuss the problem oneQ <htdig-general>d5 (after carefully reading the FAQ and searching thehC mail archive) and patch archive,, of course)ID to sort out what it is. The mailing list has a wider audience, soC you're more likely to get help with configuration problems there/. than by reporting them to the bug database.

o?

Whether reporting problems to the bug database or mailingf9 list, we cannot stress enough the importance of alwayse> indicating which version of ht://Dig you are running. ThereA are still a lot of users, ISPs and software distributors usingm= older versions, and there have been a lot of bug fixes ande@ new features added in recent versions. Knowing which version> you're running is absolutely essential in helping to find aA solution. If you're unsure if your version is current, or what<> fixes and features have been added in more recent versions,) please see the D release notes. See also question 2.1.

C 1.9. Does ht://Dig support phrase or nearr matching?
:

Phrase searching has been added for the 3.2 release,; which is currently in the beta phase (3.2.0b3 as of thisa> writing). Near or proximity matching will probably be added in a future beta.

iJ 1.10. What are the practical and/or theoretical' limits of ht://Dig?
wA

The code itself doesn't put any real limit on the number of.> pages. There are several sites in the hundreds of thousands= of pages. As for practical limits, it depends a lot on hown@ many pages you plan on indexing. Some operating systems limitA files to 2 GB in size, which can become a problem with a large@ database. There are also slightly different limits to each of? the programs. Right now htmerge performs a sort on the words ; indexed. Most sort programs use a fair amount of RAM and = temporary disk space as they assemble the sorted list. Thet> htdig program stores a fair amount of information about the@ URLs it visits, in part to only index a page once. This takes@ a fair amount of RAM. With cheap RAM, it never hurts to throw> more memory at indexing larger sites. In a pinch, swap will7 work, but it obviously really slows things down.

n7

The 3.2 development code helps with many of these ? limitations. In paticular, it generates the databases on thes6 fly, which means you don't have to sort them before< searching. Additionally, the new databases are compressed< significantly, making them usually around 50% the size of" those in previous versions.

F 1.11. Do any ISPs offer ht://Dig as part of/ their web hosting services?
f>

Yes. A list of such ISPs is available hereo

t8 1.12. Can I use ht://Dig on a' commercial website?
g<

Sure! The GNU GPL license has noC restrictions on use. So you are free to use ht://Dig however youh> want on your website, personal files, etc. The license only5 restricts distribution. So if you're planning on ac? commercial software product that includes ht://Dig, you wille? have to provide source code including any modifications uponq request.h

n: 1.13. Why do you use a non-free/ product to index PDF files?
r

: We don't. You can use the "acroread"4 program to index PDF files, but this is no longer< recommended. Initially this program was the only reliable6 way to extract data from PDF files. However, the xpdf package is a? reliable, free software package for indexing and viewing PDFg= files. See question 4.9 for details ona: using xpdf to index PDF files. We do not advocate using; acroread any longer because it is a proprietary product.r< Additionally it is no longer reliable at extracting data.

oF 1.14. Why do you have all those SourceForge* logos on your website?
<

SourceForge is a: new service for open source software. You can host your7 project on SourceForge servers and use many of theire8 services like bug-tracking and the like. The ht://Dig> project currently uses SourceForge for a mirror of the main website at htdig.sourceforge.neth; as well as a mirror of ht://Dig releases and contributedl work.

g .B 1.15. My question isn't answered here. / Where should I go for help?
e

E Before you go anywhere else, think of other ways of phrasing your G question. Many times people have questions that are very similar to J other FAQ and while we try to phrase the queries in the FAQ closely to H the most common questions, we obviously can't get them all! The next E place to check is the documentation itself. In particular, take a vJ look at the list of configuration attributes, particularly the list by name and by program. There are a tL lot of them, but chances are there's something that might fit your needs.. You should also take a close look at all of) htsearch'spD documentation, especially the section "HTML form" which describesE all the CGI input parameters available for controlling the search,t; including limiting the search to certain subdirectories.r@ You can find the answer yourself to almost all "how can I..."C questions by exploring what the various configuration attributesl+ and search form input parameters can do.t

o

J Finally, if you've exhausted all the online documentation, there's the X htdig-general mailing list. L There are hundreds of users subscribed and chances are good that someone > has had a similar problem before or can suggest a solution.

G 1.16. Why do the developers get annoyed whenrU I e-mail questions directly to them rather than the mailing list?
eP

The htdig-general; mailing list exists for dealing with questions about the ? software, its installation, configuration, and problems witht? it. E-mailing the developers directly circumvents this forume< and its benefits. Most annoyingly, it puts the onus on anC individual to answer, even if that individual is not the best or B most qualified person to answer. This is not a one-man show. ItQ also circumvents the archivingo% mechanism of the mailing list,< so not only do subscribers not see these private messages@ and replies, but future users who may run into the exact same@ problems won't see them. Remember that the developers are allC volunteers, and they don't work for free for your benefit alone.a< They volunteer for the benefit of the whole ht://Dig user@ community, so don't expect extra support from them outside of= that community. See also questions 1.4e# and 1.5.

w


2. Getting ht://Dig

eV 2.1. What's the latest version of ht://Dig?
<

The latest version is 3.1.6 as of this writing. A beta: version of the 3.2 code, 3.2.0b3 is also available, for those who wish to test it.); You can find out about the latest version by reading thee! releasei notes.

m>

Note that if you're running any version= older than 3.1.5 (including 3.2.0b1) on a public web site,f; you should upgrade immediately, as older versions have a"? rather serious security hole which is explained in detail in< this advisoryi. which was sent to the Bugtraq mailing list.C Another slightly less serious, but still troubling security holerB exists in 3.1.5 and older (including 3.2.0b3 and older), so youF should upgrade if you're running one of these. You can view details! on this vulnerability from the L bugtraq mailing list.

r@ 2.2. Are there binary distributions of ht://Dig?
=

We're trying to get consistent binary distributions forv? popular platforms. Contributed binary releases will go in r' the contributed binaries sectione2 and contributions should be mentioned to the htdig-general mailing list.5

Anyone who would like to make consistent binarys= distributions of ht://Dig at least should signup to the htdig-announce mailing list.

iT 2.3. Are there mirror sites for ht://Dig?
1

Yes, see our mirrors < listing. If you'd like to mirror the site, please see; the mirroring guide.

rM 2.4. Is ht://Dig available by ftp?
>

Yes. You can find the current versions and several older versions at various <mirror sites>o5 as well as the other locations mentioned in the download page.

D 2.5. Are patches around to upgrade between versions?
mC

Most versions are also distributed as a patch to the previouse? version's source code. The most recent exception to this wasb= version 3.1.0b1. Since this version switched from the GDBM A database to DB2, the new database package needed to be shippedaG with the distribution. This made the potential patch almost as large C as the regular distribution. Update patches resumed with versions> 3.1.0b2. You can also find archives of patches submitted toA the htdig mailing lists, to fix specific bugs or add features,l< at Joe Jah's " htdig-patches ftp site.

? 2.6. Is there a Windows 95/98/2000/NTr( version of ht://Dig?
B

The ht://Dig package can be built on the Win32 platform whenD using the Cygwin package. For details, see the contributed guide,5 > Idiot's Guide to Installing ht://Dig on Win32.

aH 2.7. Where can I find the documentation for my( version of ht://Dig?
D

The documentation for the most recent stable release is always? posted at www.htdig.org.n@ The documentation for the latest beta release can be found atY http://www.htdig.org/dev/htdig-3.2/.s8 In all releases, the documentation is included in theE htdoc subdirectory of the source distribution, so.H you always have access to the documentation for your current version.

o


3. Compiling

K 3.1. When I compile ht://Dig I get an error about libht.a
sF

This usually indicates that either libstdc++ is not installed orC is installed incorrectly. To get libstdc++ or any other GNU too,a check ftp://ftp.gnu.org/gnu/.6 Note that the most recent versions of gcc come with/ libstdc++ included and are available from http://gcc.gnu.org/

aH 3.2. I get an error about -lg
?

This is due to a bug in the Makefile.config.in of versiont@ 3.1.0b1. Remove all flags "-ggdb" in Makefile.config.in. Then6 type "./config.status" to rebuild the Makefiles and7 recompile. This bug is fixed in version 3.1.0b2.

oF 3.3. I'm compiling on Digital Unix and I get= mesages about "unresolved" and "db_open."
t(

Answer contributed by George Adams- <learningapache@my-dejanews.com>

r@

What you're seeing are problems related to the Berkeley DB@ library. htdig needs a fairly modern version of db, which isB why it ships with one that works. (see that -L../db-2.4.14/dist2 line? That's where htdig's db library is).
< The solution is to modify the c++ command so it explicityA references the correct libdb.a . You can do this by replacingr/ the "-ldb" directive in the c++ command withpC "../db-2.4.14/dist/libdb.a" This problem has been resolved as ofs version 3.1.0.

tF 3.4. I'm compiling on FreeBSD and I get lotsI of messages about '___error' being unresolved.
uH

Answer contributed by Laura Wingerd <laura@perforce.com>
@ I got a clean build of htdig-3.1.2 on FreeBSD 2.2.8 by taking? -D_THREAD_SAFE out of CPPFLAGS, and setting LIBS to null, inu db/dist/configure.

oQ 3.5. I'm compiling on HP/UX and I get a complaint abouty0 "Large Files not supported."
L

The db/ pacakge, included with ht://Dig seems to be unable to completeG on HP/UX 10.20 in particular. After running the top-level configure r( script, cd into db/dist and type:

- ./configure --disable-bigfileu4

Then continue with the normal compilation.

 K 3.6. I'm compiling on Solaris and when I run the nJ programs I get complaints about not finding libstdc++.
F

Answer contributed by Adam Rice <adam@newsquest.co.uk>

I

The problem is that the Solaris loader can't find the library. The rW best thing to do is set the LD_RUN_PATH environment variable during compileeL to the directory where libstdc++.so.2.8.1.1 lives. This tells the linker ' to search that directory at runtime.>

eG

Note that LD_RUN_PATH is not to be confused with LD_LIBRARY_PATH.iB The latter is parsed at run-time, while LD_RUN_PATH essentiallyE compiles in a library path into the executable, so that it doesn'teD need a LD_LIBRARY_PATH setting to find its libraries. This allows> you to avoid all the complexities of setting an environment6 variable for a CGI program run from the server.

 D 3.7. I'm compiling on IRIX and I'm having = database problems when I run the program.


< It is not entirely clear why these problems occur, though4 they seem to only happen when older compilers are? used. Several people have reported that the problems go away J when using the latest version of gcc.

e


4. Configuration

O 4.1. How come I can't index my site?
t;

There are a variety of reasons ht://Dig won't index a=B site. To get to the bottom of things, it's advisable to turn onB some debugging output from the htdig program. When running from9 the command-line, try "-vvv" in addition to any otherrA flags. This will add debugging output, including the responsese from the server.

3

See also questions 5.25,s; 5.27, 5.16 andb! 5.18.

._ 4.2. How can I change the output format of htsearch?
V

Answer contributed by: Malki Cymbalista <Malki.Cymbalista@weizmann.ac.il>

E

You can change the output format of htsearch by creating differentcDheader, footer and result files that specify how you want the outputBto look. You then create a configuration file that specifies which@files to use. In the html document that links to the search, you,specify which configuration file to use.

6

So the configuration file would have the lines:


r2search_results_header: ${common_dir}/ccheader.html2search_results_footer: ${common_dir}/ccfooter.html'template_map:  Long long builtin-long \b*               Short short builtin-short \:               Default default ${common_dir}/ccresult.htmltemplate_name: Default
J

You would also put into the configuration file any other lines from the6default configuration file that apply to htsearch.

,

The files ${common_dir}/ccheader.html andD${common_dir}/ccfooter.html and ${common_dir}/ccresult.html would be6tailored to give the output in the desired format.

I

Assuming your configuration file is called cc.conf, the html file thatoDlinks to the search has to set the config parameter equal to cc. Thefollowing line would do it:
eE<input type="hidden" name="config" value="cc">

B

Note: Don't just add the line above to your@ search form without checking if there isn't already a similarA line giving the config attribute a different value. The samplee? search.html form that comes with the package includes a lineoC like this already, giving "config" the default value of "htdig". A If it's there, modify it instead of adding another definition. C The config input parameter doesn't need to be hidden either, and&D you may want to define it as a pull-down list to select different8 databases (see question 4.4).

Y 4.3. How do I index pages that start with '~'?
t

E ht://Dig should index pages starting with '~' as if it was another G web browser. If you are having problems with this, check your serverhA log files to see what file the server is attempting to return.a

aM 4.4. Can I use multiple databases?
;

Yes, though you may find it easier to have one largerhB database and use restrict or exclude fields on searches. To use; multiple databases, you will need a config file for eachi( database. Then each file will set the8 database_dir orD database_base attribute to@ change the name of the databases. The config file is selectedA by the config input field in the search form.5
See also questions 4.2 andi! 4.20.

uF 4.5. OK, I can use multiple databases. Can I( merge them into one?
@

As of version 3.1.0, you can do this with the -m option to* htmerge.

B 4.6. Wow, ht://Dig eats up a lot of disk. space. How can I cut down?
>

There are several ways to cut down on disk space. One is? not to use the "-a" option, which creates work copies of the29 databases. Naturally this essentially doubles the diskeJ usage. If you don't need to index and search at the same time, you can L ignore this flag. Changing configuration variables can also help cut down0 on disk usage. Decreasing max_head_length and? max_meta_description_length will cut down on the size of the . excerpts stored (in fact, if you don't have( use_meta_description set, you can set? max_meta_description_length to 0!). Other techniques includel= removing the db.wordlist file and adding more words to theh bad_words file.

-

The University of Leipzig has publishedpF D word lists containing the 100, 1000 and 10000 most often used? words in English, German, French and Dutch. No copyrights ornC restrictions seem to be applied to the downloadable files. TheseuC can be very handy when putting together a bad_words file. Thanksa% to Peter Asemann for this tip.

@ 4.7. Can I use SSI or other CGIs in my% htsearch results?
1:

Not really. Apache will not parse CGI output for SSI statements (See the Apache: FAQ). Thus,the htsearch CGI does not understand SSI' markup and thus cannot include othere> CGIs. However, it is possible doing it the other way round:= you can have the htsearch results included in your dynamic  page.

.

7 The Apache project has mentioned that this will be a E feature added to the Apache 2.0 version, currently in development.e

<

The easiest approach in the meantime is using SSI with the help of the href="attrs.html#script_name">script_name configuration: file attribute. See the contrib/scriptname/ directory for a small example using SSI.

?

For CGI and PHP, you need a "wrapper" script tog6 do that. For perl script examples, see the files in6 contrib/ewswrap. The PHP guide (see contributed2? guides) not only describes a wrapper script for PHP, butB7 also offers a step by step tutorial to the basics of3& ht://dig and is well worth reading.B For other alternatives, see question 4.11.

bE 4.8. How do I index Word, Excel, PowerPointo, or PostScript documents?


This must be done with anrJ external parser or converter.0 A sample of such an external converter is the, contrib/doc2html/doc2html.pl Perl script.E It will parse Word, PostScript, PDF and other documents, when used F with the appropriate document to text converters. It uses catdoc toD parse Word documents, and ps2ascii to parse PostScript files. The= comments in the Perl script and accompanying documentatione6 indicate where you can obtain these converters.

G

Versions of htdig before 3.1.4 don't support external converters,o7 so you have to use an external parser script such as B contrib/parse_doc.pl (or better yet, upgrade htdig if you can).F External converter scripts are simpler to write and maintain than a@ full external parser, as they just convert input documents toE text/plain or text/html, and pass that back to htdig to be parsed.tA Parsing is more consistent across document types with externalpA converters, because the final work is done by htdig's internalu@ parsers. External parser scripts tend to be hacks that don'tC recognize a lot of the parsing attributes in your htdig.conf, so H they have to be hacked some more when you change your attributes.

?

The most recent versions of parse_doc.pl, conv_doc.pl andi/ the doc2html package are available on our web site.
tC See below for an example of doc2html.pl, or see the comments intI 4.9. How do I index PDF files?
e"

This too can be done with anJ external parser or converter,@ in combination with the pdftotext program that is part of theB xpdf 0.90 package. A5 sample of such a converter is the doc2html.pl Perlwebt site.

E

For example, you could put this in your configuration file:

g
uvexternal_parsers: application/msword->text/html /usr/local/bin/doc2html.pl \P                  application/postscript->text/html /usr/local/bin/doc2html.pl \G                  application/pdf->text/html /usr/local/bin/doc2html.plt
G

You would also need to configure the script to indicate where allpD of the document to text converters are installed. See the DETAILS: file that comes with doc2html for more information.

G

Versions of htdig before 3.1.4 don't support external converters, 7 so you have to use an external parser script such aseB contrib/parse_doc.pl (or better yet, upgrade htdig if you can).2 See question 4.8 above.

D

Whether you use this external parser or converter, or acroreadE with the pdf_parser attribute, 8 to successfully index PDF files be sure to set the max_doc_size attribute tot= a value larger than the size of your largest PDF file. PDFa9 documents can not be parsed if they are truncated.

9

This also raises the questions of why two differento; methods of indexing PDFs are supported, and which methodi? is preferred. The built-in PDF support, which uses acroreada? to convert the PDF to PostScript, was the first method which ? was provided. It had a few problems with it: acroread is notg; open source, it is not supported on all systems on which ; ht://Dig can run, and for some PDFs, the PostScript thata@ acroread generated was very difficult to parse into indexableA text. Also, the built-in PDF support expected PDF documents toa@ use the same character encoding as is defined in your currentA locale, which isn't always thewE case. The external converters, which use pdftotext, were developed:B to overcome these problems. xpdf 0.90 is free software, and its9 pdftotext utility works very well as an indexing tool.o= It also converts various PDF encodings to the Latin 1 set.i7 It is the opinion of the developers that this is the"> preferred method. However, some users still prefer to stick< with acroread, as it works well for them, and is a little< easier to set up if you've already installed Acrobat.

@

Also, pdftotext still has some difficulty handling text in@ landscape orientation, even with its new -raw option in 0.90,? so if you need to index such text in PDFs, you may still geteB better results with acroread. The pdf_parser attribute has beenB removed from the 3.2 beta releases of htdig, so to use acroread= with htdig 3.2.0b3 or other 3.2 betas, use the acroconv.pl>( external converter script from our web site.

.9

See also question 5.2 below and 0 question 1.13 above.

B 4.10. How do I index documents in other languages?
5

The first and most important thing you must do,n6 to allow ht://Dig to properly support international6 characters, is to define the correct locale for the: language and country you wish to support. This is done8 by setting the locale8 attribute (see question 5.8). The; next step is to configure ht://Dig to use dictionary andn9 affix files for the language of your choice. These canm< be the same dictionary and affix files as are used by the< ispell software. A collection of these is available from Geoff Kuenning'stG m? International Ispell Dictionaries page, and we're slowlyw0 building a collection of word lists on our web site.

iG

For example, if you install German dictionaries in common/german,e< you could use these lines in your configuration file:


o;locale:               de_DElNlang_dir:             ${common_dir}/germanRbad_word_list:        ${lang_dir}/bad_wordsXendings_affix_file:   ${lang_dir}/german.affVendings_dictionary:   ${lang_dir}/german.0\endings_root2word_db: ${lang_dir}/root2word.db\endings_word2root_db: ${lang_dir}/word2root.db


H You can build the endings database with htfuzzy endings.8 (This command may actually take days to complete, forA releases older than 3.1.2. Current releases use faster regulari@ expression matching, which will speed this up by a few orders; of magnitude.) Note that the "*.0" files are not part oft> the ispell dictionary distributions, but are easily made by? concatenating the partial dictionaries and sorting to removed@ duplicates (e.g.: "cat * | sort | uniq > lang.0"> in most cases). You will also need to redefine the synonyms> file if you wish to use the synonyms search algorithm. ThisA file is not included with most of the dictionaries, nor is thei bad_words file.

*

If you put all the language-specific@ dictionaries and configuration files in separate directories,< and set all the attribute definitions accordingly in each> search config file to access the appropriate files, you can@ have a multilingual setup where the user selects the language? by selecting the "config" input parameter value. In addition = to the attributes given in the example above, you may alsos? want custom settings for these language-specific attributes:s4 date_format,. iso_8601,6 method_names,< no_excerpt_text,@ no_next_page_text,@ no_prev_page_text,B nothing_found_file,> page_list_header,: prev_page_text,I search_results_wrapperrK (or search_results_header M and search_results_footer),t2 sort_names,2 synonym_db,B synonym_dictionary,@ syntax_error_file,D template_map, and of course8 database_dir or> database_base if you@ maintain multiple databases for sites of different languages.* You could also change the definition of> common_dir, rather thanE making up a lang_dir attribute as above, as many language-specifict< files are defined relative to the common_dir setting.

5

Current versions of ht://Dig only support 8-bit>? characters, so languages such as Chinese and Japanese, whicht> require 16-bit characters, are not currently supported.

@

Didier Lebrun has written a guide for configuring htdig to support French, entitledmK mM Comment installer et configurer HtDig pour la langue française.51 His "kit de francisation" is also available one ourh web site.

G

See also question 4.2 for tips on customizingdH htsearch, and question 4.6 for tips where to find bad_words files.

eD 4.11. How do I get rotating banner ads in# search results?
>?

While htsearch doesn't currently provide a means of doingt@ SSI on its output, or calling other CGI scripts, it does haveB the capability of using environment variables in templates.

<

The easiest way to get rotating banners in htsearch is9 to replace htsearch with a wrapper script that sets an>: environment variable to the banner content, or whatever? dynamically generated content you want. Your script can then C call the real htsearch to do the work. The wrapper script can beuA written as a shell script, or in Perl, C, C++, or whatever you>? like. You'd then need to reference that environment variable @ in header.html (or wrapper.html if that's what you're using),> to indicate where the dynamic content should be placed.

C

If the dynamic content is generated by a CGI script, your newlC wrapper script which calls this CGI would then have to strip outcA the parts that you don't want embedded in the output (headers,sA some tags) so that only the relevant content gets put into the ? environment variable you want. You'd also have to make sureP@ this CGI script doesn't grab the POST data or get confused by? the QUERY_STRING contents intended for htsearch. Your scriptr; should not take anything out of, or add anything to, theu) QUERY_STRING environment variable.

iE

An alternative approach is to have a cron job that periodically ? regenerates a different header.html or wrapper.html with thei@ new banner ad, or changes a link to a different pre-generated@ header.html or wrapper.html file. For other alternatives, see( question 4.7.

V 4.12. How do I index numbers in documents?
=

By default, htdig doesn't treat numbers without letterso& as words, so it doesn't index them., To change this behavior, you must set the7 allow_numbers ? attribute to true, and rebuild your index from scratch usingu? rundig or htdig with the -i option, so that bare numbers getr added to the index.

I 4.13. How can I call htsearch from a hypertextt9 link, rather than from a search form?
e>

If you change the search.html form to use the GET method? rather than POST, you can see the URLs complete with all the E arguments that htsearch needs for a query. Here is an example:
o•http://www.grommetsRus.com/cgi-bin/htsearch?config=htdig&restrict=&exclude=&method=and&format=builtin-long&words=grapple+grommetsh+ which can actually be simplified to:
eQhttp://www.grommetsRus.com/cgi-bin/htsearch?method=and&words=grapple+grommets #= with the current defaults. The "&" character acts as aH> separator for the input parameters, while the "+" character7 acts as a space character within an input parameter.B In versions 3.1.5 or 3.2.0b2, or later, you can use a semicolonC character ";" as a parameter separator, rather than "&", foro HTML 4.0 compliance.iC Most non-alphanumeric characters should be hex-encoded followinga? the convention for URL encoding (e.g. "%" becomes "%25", "+"p? becomes "%2B", etc). Any htsearch input parameter that you'df< use in a search form can be added to the URL in this way.9 This can be embedded into an <a href="..."> tag.a8
See also question 5.21.

H 4.14. How do I restrict a search to only meta2 keywords entries in documents?
D

First of all, you do not do this by using the: "keywords" field in the search form. This seems to be a> frequent cause of confusion. The "keywords" input parameter? to htsearch has absolutely nothing to do with searching metac> keywords fields. It actually predates the addition of meta= keyword support in 3.1.x. A better choice of name for theaA parameter would have been "requiredwords", because that's whata@ it really means - a list of words that are all required to be@ found somewhere in the document, in addition to the words the) user specifies in the search form.

iB

To restrict a search to meta keywords only, you must set all> factors other than keywords_factor to 0, and for 3.1.x, you> must then reindex your documents. In the 3.2 betas, you can< change factors at search time without needing to reindex.> Future 3.2 releases will also offer the ability to restrictA the search in the query itself. Note that changing the scoringtE factors in this way will only alter the scoring of search results,aB and shift the low or zero scores to the end of the results whenD sorting by score (as is done by default). The results with scoresD of zero aren't actually removed from the search results, although0 this will be done in future 3.2 releases.

J 4.15. Can I use meta tags to prevent htdig from+ indexing certain files?
C

Yes, in each HTML file you want to exclude, add the followingh7 between the <HEAD> and </HEAD> tags:

,
9 <META NAME="robots" CONTENT="noindex, follow">p
nI

Doing so will allow htdig to still follow links to other documents,aG but will prevent this document from being put into the index itself. A You can also use "nofollow" to prevent following of links. SeetE the section on Recognized META informationtE for more details. For documents produced automatically by MhonArc,,E you can have that line inserted automatically by putting it in therI MhonArc resource file, in the sections IDXPGBEGIN and TIDXPGBEGIN.

"

You can also use the; noindex_start andA noindex_end attributes to E define one set of tags which will mark sections to be stripped out"D of documents, so they don't get indexed, or you can mark sectionsB with the non-DTD <noindex> and </noindex> tags.

L 4.16. How do I get htsearch to use the star imageE in a different directory than the default /htdig?
e

You must set either theH image_url_prefix attribute,= or both star_blank ands9 star_image in your_C htdig.conf, to refer to the URL path for these files. You should @ also set this URL path similarly in in common/header.html and@ common/wrapper.html, as they also refer to the star.gif file.A If you want to relocate other graphics, such as the buttons or ? the ht://Dig logo, you should change all references to thesef' in htdig.conf and common/*.html.

<J 4.17. How do I get htdig or htsearch to rewrite/ URLs in the search results?
&

This can be done by using the url_part_aliases= configuration file attribute. You have to set up differentt: configuration files for htdig and htsearch, to define a8 different setting of this attribute for each one.

@

A large number of users insist on ignoring that last point@ and try to make do with just one definition, either for htdig> or htsearch, or sometimes for both. This seems to stem from> a fundamental misunderstanding of how this attribute works,= so perhaps a clarification is needed. The url_part_aliasesn@ attribute uses a two stage process. In the first stage, htdig> encodes the URLs as they go into the database, by using the= pairs in url_part_aliases going from left to right. In thetA second stage, htsearch decodes the encoded URLs taken from thea> database, by using the pairs in url_part_aliases going fromA right to left. If you have the same value for url_part_aliases > in htdig and htsearch, you end up with the same URLs in the; end. If you modify the first string (the from string) in ? the pairs listed in url_part_aliases for htsearch, then wheneC htsearch decodes the URLs it ends up rewriting part of them.

.B

While you might think that if you don't use url_part_aliases? in htdig, then you can use it in htsearch to alter unencodedv> URLs, the reality is that if you don't encode parts of URLs? using url_part_aliases, they still get encoded automaticallyhD by the common_url_partsB attribute. This helps to reduce the size of your databases. So,? trying to use url_part_aliases only in htsearch doesn't workd> because there are no unencoded URLs in the database, so theG right hand strings in the pairs you define won't match anything.

n9

You also can't put two different definitions of thep@ url_part_aliases attribute in a single configuration file, asA some users have attempted. When you define an attribute twice,s> the second definition merely overrides the first. Pay close0 attention to the description and examples for> url_part_aliases.8 You must put one definition of this attribute in yourC configuration file for htdig, htmerge (or htpurge) and htnotify,a> and a different definition of it in your configuration file for htsearch.

< 4.18. What are all the options in5 htdig.conf, and are there others?
nB

In ht://Dig's terminology, the settings in its configurationF files are called configuration attributes,= to distinguish them from command line"@ options, CGI input parametersE and template variables. There aret> many, many attributes that can be set to control almost all> aspects of indexing, searching, customization of output and? internationalization. All attributes have a built-in default F setting, and only a subset of these appear in the sample htdig.confD file. See the documentation for all default values for attributesB not overridden in the configuration file, and for help on using any of them.e3 See also question 1.15.

C 4.19. How do I get more than 10 pages of 4 10 search results from htsearch?
E

There are two attributes that control the number of matches peraE page and the total number of pages. The number of matches per pageu3 can be set in your configuration file, using thewH matches_per_page attribute,@ or in your search form, using the7 matchesperpage input parameter.

c.

The number of pages is controlled by theD maximum_pages attribute in" your search configuration file.C The current default for maximum_pages is 10 because the ht://Diga? package comes with 10 images, with numbers 1 through 10, foreA use as page list buttons. If we increased the limit, we'd havew? to field a whole lot more questions from users irate because A only the first 10 buttons are graphics, and the rest are text.'C If you want more than 10 pages of results, change maximum_pages,T# but you may also want to set thefA page_number_text andaC no_page_number_text > attributes in your search configuration file to nothing, or? remove them, to use text rather than images for the links tog other pages.

i8

In version of htsearch before 3.1.4, maximum_pages< limited only the number of page list buttons, and not the@ actual number of pages. This was changed because there was noA means of limiting the total number of pages, but this ended uptC frustrating users who wanted the ability to have more pages than ? buttons. In 3.2.0b3 we will introduce a maximum_page_buttonsa" attribute for this purpose.

C 4.20. How do I restrict a search to onlye8 certain subdirectories or documents?
B

That depends on whether you want to protect certain parts of@ your site from prying eyes, or just limit the scope of searchC results to certain relevant areas. For the latter, you just need C to set the restrict or excludelB input parameter in the search form.? This can be done using hidden input fields containing preset < values, text input fields, select lists, radio buttons or? checkboxes, as you see fit. If you use select lists, you canc@ propagate the choices to select lists in the follow-up search forms using theA build_select_listse configuration attribute.o5
See also question 4.4.

8

If you wish to keep secure and non-secure areas on: your site separate, and avoid having unauthorized users> seeing documents from secure areas in their search results,< that takes a bit more effort. You certainly can't rely on= the restrict and excludei= parameters, or even the config parameter,n; as any parameter in a search form can also be overriddenu7 by the user in a URL with CGI parameters. The safeste= option would be to host the secure and non-secure areas onm? separate servers with independent installations of htsearch,a= each with its own ht://Dig database, but that is often tool= costly or impractical an option. The next best thing is toi< host them on the same site, but make sure that everything= is very clearly separated to prevent any leakage of securen> data. You should maintain separate databases for the secure? and public areas of your site, by setting up different htdigs< configuration files for each area. Use different settings7 of the start_url, 7 limit_urls_toe9 and database_dir 8 configuration attributes, and possibly even differentC common_dir settings as well.uA Make sure your database_dir, and even your common_dir, are notr? in any directories accessible from the web server. Run htdigaA and htmerge (or rundig) with each separate configuration file,L# to build your two databases.

>

The tricky part is to make sure your htsearch program isA secure. You don't want to use the same htsearch for the secureI< and public sites, because otherwise the public site could? access the configuration for the secure database, making itsfA data publicly accessible. You must either compile two separateaB versions of htsearch, with different settings of the CONFIG_DIR< make variable, or you must make a simple wrapper@ script for htsearch that overrides the compiled-in CONFIG_DIR? setting by a different setting of the CONFIG_DIR environmentr< variable. Make sure the CONFIG_DIR for the secure area is< not a subdirectory of the CONFIG_DIR for the public area.? In this way, you can maintain separate directories of configt< files for the public and secure sites, so that the secure@ config files are not accessible from the public htsearch.

C

Put the htsearch binary or wrapper script for the secure sitesB in a different ScriptAlias'ed cgi-bin directory than the public? one, and protect the secure cgi-bin with a .htaccess file ord? in your server configuration. Alternatively, you can put thei@ secure program, let's call it htssearch, in the same cgi-bin,A but protect that one CGI program in your server configuration,, e.g.:


n#<Location /cgi-bin/htssearch>tAuthType Basic
AuthName ....vAuthUserFile ...AuthGroupFile ...i<Limit GET POST>require group fooL</Limit></Location> 
B

This describes the setup for an Apache server. You'd need toA work out an equivalent configuration for your server if you'ree not running Apache.

A 4.21. How can I allow people to search0 while the index is updating?
K

Answer contributed by Avi Rappoport <avirr@searchtools.com>

s?

If you have enough disk space for two copies of the indexhC database, use -a with the htdig and htmerge processes. This will>> make use of a copy of the index database with the extension9 ".work", and update the copy instead of the originals.eA This way, htsearch can use those originals while the update is ? going on. When it's done, you can move the .work versions to A replace the originals, and htsearch will use them. The currents? rundig script will do this for you if you supply the -a flag ? to it. However, rundig builds the database from scratch eachu> time you run it. If you want to update an alternate copy of the database, see theM contributedl rundig.sh script.

B 4.22. How can I get htdig to ignore the8 robots.txt file or meta robots tags?
'

You can't, and you shouldn't. TherJ B Standard for Robot Exclusion exists for a very good reason,G and any well behaved indexing engine or spider should conform to it.iB If you have a problem with a robots.txt file, you really shouldE take it up with the site's webmaster. If they don't have a problemA with you indexing their site, they shouldn't mind setting up a A User-agent entry in their robots.txt file with a name you botht@ agree on. The user agent setting that htdig uses for matching/ entries in robots.txt can be changed via theoF robotstxt_name attribute in your config file.

@

If you have a problem with a robots meta tag in a documentA (see question 4.15) you should take it upr@ with the author or maintainer of that page. These tags are anE all or nothing deal, as they can't be set up to allow some enginesoD and disallow others. If htdig encounters them, it has to give the> page's creator the benefit of the doubt and honour them. If@ exceptions to the rule are wanted, this should be done with a. robots.txt file rather than a meta tag.

A 4.23. How can I get htdig not to index<= some directories, but still follow links?
bC

You can simply add the directory name to your robots.txt filee? or to the exclude_urls4.15) to prevent indexing,> and will contain links to all your files in this directory.@ The drawback of this is that you must maintain the index.html= file yourself, as it won't be automatically updated as new ( files are added to the directory.

@

The other technique you can use, if you want the directory> index to be made by the web server, is to get the server to? insert the robots meta tag into the index page it generates.i$ In Apache, this is done using theZ HeaderNameb and IndexOptionsA directives in the directory's .htaccess file.o For example:

g
   HeaderName .htrobots  2   IndexOptions FancyIndexing SuppressHTMLPreamble
$

and in the .htrobots file:


<HTML><head>m4<META NAME="robots" CONTENT="noindex, follow">-<title>Index of /this/dir</title> 
</head>c
@

If you don't mind getting just one copy of each directory,A but want to suppress the multiple copies generated by Apache'suA FancyIndexing option, you can either turn off FancyIndexing orl; you can add "?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D" toJ the bad_querystr attributeC (without the quotes) to suppress the alternately sorted views of  the directory.

nC 4.24. How can I get rid of duplicates inl# search results?
,A

This depends on the cause of the duplicate documents. htdige> does keep track of the URLs it visits, so it never puts the; same URL more than once in the database. So, if you have_? duplicate documents in your search results, it's because then< same document appears under different URLs. Sometimes the? URLs vary only slightly, and in subtle ways, so you may havev@ to look hard to find out what the variation is. Here are some; common reasons, each requiring a different solution.

r g


5. Troubleshooting

J 5.1. I can't seem to index more than X documents# in a directory.
r;

This usually has to do with the default document sizeh7 limit. If you set e* max_doc_size in your config file toB something enough to read in the directory index (try 100000 forB 100K) this should fix this problem. Of course this will require? more memory to read the larger file. Don't set it to a valueA larger than the amount of memory you have, and never more than : about 2 billion, the maximum value of a 32-bit integer.7 If htdig is missing entire directories, see question ! 5.25.

nH 5.2. I can't index PDF files.
@

As above, this usually has to do with the default documentA size. What happens is ht://Dig will read in part of a PDF filee7 and try to index it. This usually fails. Try settinga5 max_doc_size.1 in your config file to a larger value than the<@ size of your largest PDF file. Don't go overboard, though, asA you don't want to overflow a 32-bit integer (about 2 billion),o@ and you don't want to allocate much more memory than you need% to store the largest document.

c>

There is a bug in Adobe Acrobat Reader version 4, in its= handling of the -pairs option, which causes a segmentationh@ violation when using it with htdig 3.1.2 or earlier. There is> a workaround for this as of version 3.1.3 - you must remove= the -pairs option from your pdf_parser definition, if it'stA there. However, acroread version 4 is still very unstable (onh> Linux, anyway) so it is not recommended as a PDF parser. AnA alternative is to use an external converter with the xpdf 0.90v@ package installed on your system, as described in question 4.9 above.

I 5.3. When I run "rundig," I get a message about 3 "DATABASE_DIR" not being found.
o=

This is due to a bug in the Makefile.in file in versionrA 3.1.0b1. The easiest fix is to edit the rundig file and changeu@ the line "TMPDIR=@DATABASE_DIR@" to set TMPDIR to a directory@ with a large amount of temporary disk space for htmerge. This' bug is fixed in version 3.1.0b2.

HH 5.4. When I run htmerge, it stops with an "out* of diskspace" message.
A

This means that htmerge has run out of temporary disk spacefB for sorting. Either in your "rundig" script (if you run htmergeC through that) or before you run htmerge, set the variable TMPDIRr. to a temp directory with lots of space.

G 5.5. I have problems running rundig from cronl under Linux.
@

This problem commonly occurs on Red Hat Linux 5.0 and 5.1,C because of a bug in vixie-cron. It causes htmerge to fail with an7 "Word sort failed" error. It's fixed in Red Hat 5.2.t< You can install vixie-cron-3.0.1-26.{arch}.rpm from a 5.2A distribution to fix the problem on 5.0 or 5.1. A quick fix for>D the problem is to change the first line of rundig to "#!/bin/ash"D which will run the script through the ash shell, but this doesn't$ solve the underlying problem.

C 5.6. When I run htmerge, it stops with an"3 "Unexpected file type" message.
eC

Often this is because the databases are corrupt. Try removingsB them and rebuilding. If this doesn't work, some have found thatC the solution for question 3.2 works for thise8 as well. This should be fixed in version 3.1.0b2.

J 5.7. When I run htsearch, I get lots of Internal) Server Errors (#500).
ND

If you are running under Solaris, see 3.6.7
See also question 5.13.

yF 5.8. I'm having problems with indexing words- with accented characters.
u

< Most of the time, this is caused by either not setting or incorrectly setting the locale attribute. The default locale: for most systems is the "portable" locale, which strips9 everything down to standard ASCII. Most systems expecto/ something like locale: en_US orc> locale: fr_FR. Locale files are often found in; /usr/share/locale or the $LANGUAGEnE environment variable. See also question 4.10.t

mB 5.9. When I run htmerge, it stops with a/ "Word sort failed" message.
nC

There are three common causes of this. First of all, the sortL? program may be running out of temporary file space. Fix this @ by freeing up some space where sort puts its temporary files,@ or change the setting of the TMPDIR environment variable to aA directory on a volume with more space. A second common problemn@ is on systems with a BSD version of the sort program (such asB FreeBSD or NetBSD). This program uses the -T option as a recordC separator rather than an alternate temporary directory. On thesee@ systems, you must remove the TMPDIR environment variable from@ rundig, or change the code in htmerge/words.cc not to use the@ -T option. A third cause is the cron program on Red Hat Linux@ 5.0 or 5.1. (See question 5.5 above.)

L 5.10. When htsearch has a lot of matches, it runs% extremely slowly.
7

When you run htsearch with no customization, on al9 large database, and it gets a lot of hits, it tends toh: take a long time to process those hits. Some users with9 large databases have reported much higher performance,r: for searches that yield lots of hits, by setting the backlink_factor attributeB in htdig.conf to 0, and sorting by score. The scores calculated? this way aren't quite as good, but htsearch can process hitseB much faster when it doesn't need to look up the db.docdb record? for each hit, just to get the backlink count, date or title,y; either for scoring or for sorting. This affects versionsi? 3.1.0b3 and up. In version 3.2, currently under development,a= the databases will be structured differently, so it shoulde% perform searches more quickly.

gL 5.11. When I run htsearch, it gives me a count ofE matches, but doesn't list the matching documents.
r@

This most commonly happens when you run htsearch while the; database is currently being rebuilt or updated by htdig.aE If htdig and htmerge have run to completion, and the problem stillyD occurs, this is usually an indication of a corrupted database. If; it's finding matches, it's because it found the matching @ words in db.words.db. However, it isn't finding the documentB records themselves in db.docdb, which would suggest that either> db.docdb, or db.docs.index (which maps document IDs used in? db.words.db to URLs used to look up records in db.docdb), ise? incomplete or messed up. You'll likely need to rebuild youra= database from scratch if it's corrupted. Older versions oft; ht://Dig were susceptible to database corruption of thisy; sort. Versions 3.1.2 and later are much more stable.

lA

Another possible cause of this problem is unreadable resultn@ template files. If you define external template files via the@ template_map attribute,A rather than using the builtin-short or builtin-long templates,o< and the file names are incorrect or the files do not have= read permission for the user ID under which htsearch runs,h< then htsearch won't be able to display the results. Also,= all directories leading up to these template files must be@ searchable (i.e. executable) by htsearch, or it won't be able to open the files.

K 5.12. I can't seem to index documents with names/4 like left_index.html with htdig.
4

There is a bug in the implementation of the href="attrs.html#remove_default_doc">remove_default_docC attribute in htdig versions 3.1.0, 3.1.1 and 3.1.2, which causes0> it to match more than it should. The default value for thisC attribute is "index.html", so any URL in which the filename ends ? with this string (rather than matches it entirely) will haveoA the filename stripped off. This is fixed in version 3.1.3.

.M 5.13. I get Premature End of Script Headers errors.* when running htsearch.
:

This happens when htsearch dies before putting out aB "Content-Type" header. If you are running Apache under Solaris,H first try the solution described in question 3.6.A If that doesn't work, or you're running on another system, trysF running "htsearch -vvv" directly from the command line to see whereC and why it's failing. It should prompt you for the search words, as well as the format.o>
If it works from the command line, but not from the webD server, it's almost certainly a web server configuration problem.C Check your web server's error log for any information related to @ htsearch's failure. One increasingly common problem is Apache: configurations which expect all CGI scripts to be Perl,? rather than binary executables or other scripts, so they uses, "perl-handler" rather than "cgi-handler".2
See also questions 5.7,? 5.14 and 5.23.

iG 5.14. I get Segmentation faults when runningn/ htdig, htsearch or htfuzzy.
aE

Despite a great deal of debugging of these programs, we haven'taH been able to completely eliminate all such problems on all platforms.C If you're running htsearch or htfuzzy on a BSDI system, a commonoA cause of core dumps is due to a conflict between the GNU regex G code bundled in htdig 3.1.2 and later, and the BSD C or C++ library.l@ The solution is to use the BSD library's own rx code instead,= using version 3.1.6 or newer as summarized by Joe Jah:

g =H

This solution may work on some other platforms as well (we haven'tD heard one way or the other), but will definitely not work on someE platforms. For instance, on libc5-based Linux systems, the bundleduA regex code works fine by default, but using libc5's regex codea causes core dumps.

p<

Users of Cobalt Raq or Qube servers have complained ofC segmentation faults in htdig. Apparently this is due to problemss@ in their C++ libraries, which are fixed in their experimental@ compiler and libraries. The following commands should install the packages you need:

a
T rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/binutils-2.8.1-3C1.mips.rpm
N rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-1.0.2-9.mips.rpm
R rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-c++-1.0.2-9.mips.rpm
R rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-g77-1.0.2-9.mips.rpm
S rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-objc-1.0.2-9.mips.rpm
dS rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-2.8.0-9.mips.rpm
Y rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-devel-2.8.0-9.mips.rpm
[ rpm -Uvh --force ftp://ftp.cobaltnet.com/pub/products/current/RPMS/gcc-2.7.2-C2.mips.rpm 
oI

You may have to remove the libg++ package, if you have it installed G before installing libstdc++, because of conflicts in these packages.uE Be sure to do a "make clean" before a "make", to remove any objects8 files compiled with the old compiler and headers.

D

For other causes of segmentation faults, or in other programs,G getting a stack backtrace after the fault can be useful in narrowingcE down the problem. E.g.: try "gdb /path/to/htsearch /path/to/core",=D then enter the command "bt". You can also try running the programD directly under the debugger, rather than attempting a post-mortemD analysis of the core dump. Options to the program can be given onD gdb's "run" command, and after the program is suspended on fault,E you can use the "bt" command. This may give you enough information/G to find and fix the problem yourself, or at least it may help others>> on the htdig mailing list to point out what to do next.

K 5.15. Why does htdig 3.1.3 mangle URL parametersr9 that contain bare "&" characters?
:

This is a known bug in 3.1.3, and is fixed with this> ? patch. You can apply the patch by entering into the mainl: source directory for htdig-3.1.3, and using the command/ "patch -p0 < /path/to/HTML.cc.0". This iso& also fixed as of version 3.1.4.

E 5.16. When I run htmerge, it stops with anrN "Unable to open word list file '.../db.wordlist'" message.
?

The most common cause of this error is that htdig did noteA manage to index any documents, and so it did not create a words? list. You should repeat the htdig or rundig command with thei2 -vvv option to see where and why it is failing., See question 4.1.

Q 5.17. When using Netscape, htsearch always returns thea$ "No match" page.
B

Check your search form. Chances are there is a hidden input = field with no value defined. For example, one user had
s7 <input type=hidden name=restrict>e% in his search form, instead of
F <input type=hidden name=restrict value="">J The problem is that Netscape sets the missing value to a default of " "H (two spaces), rather than an empty string. For the restrict parameter,I this is a problem, because htsearch won't likely find any URLs with twotF spaces in them. Other input parameters may similarly pose a problem.

aC

Another possibility, if you're running 3.2.0b1 or 3.2.0b2, isvC that you need to make the db.words.db_weakcmpr file writeable byoB the user ID under which the web server runs. This is a bug, and& will be fixed in the next beta.

H 5.18. Why doesn't htdig follow links to other- pages in JavaScript code?
e9

There probably isn't any indexing tool in existanceg= that follows JavaScript links, because they don't know hown@ to initiate JavaScript events. Realistically, it would take aC full JavaScript parser in order to be able to figure out all theiC possible URLs that the code could generate, something that's way,E beyond the means of any search engine. You have a few options:

h I 5.19. When I run htsearch from the web server,d6 it returns a bunch of binary data.
C

Your server is returning the contents of the htsearch binary.e! Common causes of this are:

a ?

By default, Apache is usually configured with one cgi-bine@ directory as ScriptAlias, so all your CGI programs must go in? there, or have a .cgi suffix on them. Your configuration may1 differ, however.

1< 5.20. Why are the betas of 3.2 so% slow at indexing?
>

< As the release notes for these versions suggest, they are: somewhat unoptimized and are made available for testing? Since the 3.2 code indexes all locations of words to support6? phrase searching and other advanced methods, this additional = data slows down the indexer. To compensate, the code has aa9 cache configured by the wordlist_cache_size attribute.o< As of this writing, the word database code will slow down= considerably when the cache fills up. Setting the cache as 6 large as possible provides considerable performance; improvement. Development is in progress to improve cacher performance. 

aE 5.21. Why does htsearch use ";" instead ofnL "&" to separate URL parameters for the page buttons?
<

In versions 3.1.5 and 3.2.0b2, and later, htsearch was: changed to use a semicolon character ";" as a parameter@ separator for page button URLs, rather than "&", for HTML@ 4.0 compliance. It now allows both the "&" and the ";" asA separators for input parameters, because the CGI specification@ still uses the "&". This change may cause some PHP or CGI? wrapper scripts to stop working, but these scripts should bel< similarly changed to recognize both separator characters.> For the definitive reference on this issue, please refer to1 section B.2.2 of W3C's HTML 4.0 Specification,3D = Ampersands in URI attribute values. We're all a little1A tired of arguing about it. If you don't like the standard, youuB can change the Display::createURL() code yourself to ignore it.8
See also question 4.13.

; 5.22. Why does htsearch show thefG "&" character as "&amp;" in search results?
c6

In version 3.1.5, htsearch was fixed to properly9 re-encode the characters &, <, >, and "r: into SGML entities. However, the default value for the8 translate_amp,; translate_lt_gte= and translate_quotpC attributes is still false, so these entities don't get convertede> by htdig. If you set these three attributes to true in your8 htdig.conf and reindex, the problem will go away.

>

In the 3.2 betas there was a bug in the HTML parser thatA caused it to fail when attempting to translate the "&amp;"/E entity. This has been fixed in 3.2.0b3. The translate_* attributeso are gone as of 3.2.0b2.

F 5.23. I get Internal Server or Unrecognized; character errors when running htsearch.
t=

An increasingly common problem is Apache configurationsp> which expect all CGI scripts to be Perl, rather than binary; executables or other scripts, so they use "perl-handler">= rather than "cgi-handler". The fix is to create a separatel? directory for non-Perl CGI scripts, and define it as such insB your httpd.conf file. You should define it the same way as your? existing cgi-bin directory, but use "cgi-handler" instead ofdB "perl-handler". In any case, you should check your web server's? error log for any information related to htsearch's failure. 2
See also questions 5.7,? 5.14 and 5.13.

n< 5.24. I took some settings out of8 my htdig.conf but they're still set.
A

All configuration file attributes have compiled-in, defaultg> values. Taking an attribute out of the file is not the same= thing as setting it to an empty string, a 0, or a value ofo5 false. See question 4.18.

t= 5.25. When I run htdig on my site, 1 it misses entire directories.
e@

First of all, htdig doesn't look at directories itself. ItA is a spider, and it follows hypertext links in HTML documents.wB If htdig seems to be missing some documents or entire directory> sub-trees of your site, it is most likely because there are= no HTML links to these documents or directories. (See alsot5 question 5.18.) If htdig doesa< not come across at least one hypertext link to a document6 or directory, and it's not explicitly listed in the? start_url attribute, then.= this document or directory is essentially hidden from views= to htdig, or to any web browser or spider for that matter.dA You can only get htdig to index directories, without providingeB your own files with links to the contents of these directories,A by using your web server's automatic index generation feature./? In Apache, this is done with the mod_autoindex module, whichr= is usually compiled-in by default, and is enabled with the A "Indexes" option for a given directory hierarchy. For example, A you can put these directives in your Apache configuration:

o
i/<Directory "/path/to/your/document/root">i3    Options Indexes FollowSymLinks Includes ExecCGI&</Directory>
?

This will cause Apache to automatically generate an index > for any directory that does not have an index.html or other; "DirectoryIndex" file in it. Other web servers will have = similar features, which you should look for in your server documentation.

@

As an alternative to relying on the web server's autoindex9 feature, you can compose a list of all the unreachablev@ documents, or write a program to do so, and feed that list as? part of htdig's start_urlu? attribute. Here is an example of simple shell script to make1= a file of URLs you can use with a configuration entry likep4 start_url: `/path/to/your/file`:


bAfind /path/to/your/document/root -type f -name \*.html -print | \tK    sed -e 's|/path/to/your/document/root/|http://www.yourdomain.com/|' > \s        /path/to/your/file
?

Other reasons why htdig might be missing portions of yours; site might be that they fall out of the bounds specified,> by the limit_urls_to@ attribute (which takes on the value of start_url by default),) they are explicitly excluded using thet@ exclude_urls attribute,7 or they are disallowed by a robots.txt file (see thep? htdig documentation for notes abouts9 robot exclusion) or by a robots meta tag (see questions? 4.15). If htdig seems to be missing the ; last part of a large directory or document, see question B 5.1. For reasons why htdig may be rejecting1 some links to parts of your site, see questiont! 5.27.

tD 5.26. What do all the numbers and symbols0 in the htdig -v output mean?
9

Output from htdig -v typically looks like this:

a
@	  

The first number is the number of documents parsed so far,> the second is the DocID for this document, and the third is@ the hop count of the document (number of hops from one of theA start_url documents). After the URL, it shows a "*" for a link(> in the document that it already visited (or at least queued= for retrieval), a "+" for a new link it just queued, and aa= "-" for a link it rejected for any of a number of reasons. < To find out what those reasons are, you need to run htdig? with at least 3 "v" options, i.e. -vvv. If there are no "*", A "+" or "-" symbols after the URL, it doesn't mean the document ? was not parsed or was empty, but only that no links to otherr& documents were found within it.

C 5.27. Why is htdig rejecting some of the* links in my documents?
>

When htdig parses documents and finds hypertext links toA other documents (hrefs), it may reject them for any of severalf? reasons. To find out what those reasons are, you need to runs= htdig with at least 3 "v" options, i.e. -vvv. Here are thetC meanings of some of the messages you might see at this verbosityt level.


r*
Not an http or relative link!
>
In versions 3.1.5 and earlier, only "http://" URLs, or+ URLs relative to those, are allowed.
p8
Item in the exclude list: item # n
>
A substring of the URL matches one of the items in the4 exclude_urls6 attribute. The given item number will indicate which8 pattern matched, starting at 1. The 3.2.0 betas do not give the item number.
"
Extension is invalid!
>
The file name extension or suffix matches one of those listed in theT8 bad_extensions attribute.
r$
Extension is not valid!
E
The file name extension or suffix does not match one of thoses listed in the < valid_extensions' attribute, if any are specified.
sG
Invalid Querystring! or
item in bad query list
sB
The URL contains a query string which matches one of those listed in thew4 bad_querystr attribute.
i#
URL not in the limits!
H
No substring of the URL entirely matches one of the items in the6 limit_urls_to; attribute. The purpose of this attribute is to keep htdig>: from attempting to index the entire World Wide Web.
,
forbidden by server robots.txt!
B
A substring of the URL matches one of the items disallowed% in the servers robots.txt file. SeecI t9 A Standard for Robot Exclusion. This message existsa? only in the 3.2.0 betas. In 3.1.5 and earlier, this conditionr0 is only caught later, resulting in the message; "robots.txt: discarding 'URL'" from htdig, and ae8 later "Deleted: no excerpt" message from htmerge.
$
url rejected: (level 2)
H
No substring of the URL entirely matches one of the items in the< limit_normalized= attribute. All the other rejections above will be indicated/; as level 1. The 3.2.0 betas give the much more meaningfula0 message 'not in "limit_normalized" list!'

6

See also question 5.25.

E 5.28. When I run htdig or htmerge, I get aiS "DB2 problem...: missing or empty key value specified" message.
:

The most common cause of this error is that htdig or: htmerge rejected any documents that had been put in the@ database, leaving an empty database. You need to find out the> reasons for the rejection of these documents. See questions9 4.1, 5.25 andi! 5.27.

i= 5.29. When I run htdig on my site, < it seems to go on and on without ending.
B

There are some things that can cause htdig to run on without> ending, especially when indexing dynamic content (ASP, PHP,@ SSI or CGI pages). This usually involves htdig getting caught< in an infinite virtual hierarchy. A sure sign of> this is if the current size of your database is much largerA than the total size of the site you are indexing, or if in the A verbose output of htdig (see question 4.1)lA you see the same URLs come up again and again with only subtle.@ variations. In any case, you must figure out the reason htdig? keeps revisiting the same documents using different URLs, as'@ explained in question 4.24, and set your9 exclude_urls and @ bad_querystr attributes; appropriately to stop htdig from going down those paths. 

u
r
/ Last modified: $Date: 2002/01/31 17:45:36 $a ma