j  -� ht://Dig: Features and System requirements� � 0  

# Features and System requirements



W ht://Dig Copyright © 1995-2002 The ht://Dig Group
8 Please see the file COPYING for license information.




Features



? Here are some of the major features of ht://Dig. They are in no particular order.



S * Intranet searching

9 ht://Dig has the ability to search through many servers* on a network by acting as a WWW browser.

K * It is free

( The whole system is released under the2 GNU General Public License

] * Robot exclusion is supported

M The 2 Standard for Robot Exclusion is supported by ht://Dig.

] * Boolean expression searching

3 Searches can be arbitrarily complex using boolean expressions.

\ * Configurable search results

7 The output of a search can easily be tailored to your- needs by means of providing HTML templates.

P * Fuzzy searching

6 Searches can be performed using various configurable4 algorithms. Currently the following algorithms are! supported (in any combination):
    
  •  exact
  • 
  • soundex
  • 
  • metaphone
  • 
  • ! common word endings (stemming)
  • 
  • synonyms
  • 
  •  accent stripping
  • 
  •  substring and prefix
  • 


a * Searching of HTML and text files

1 Both HTML documents and plain text files can be1 searched. Searching of other file types will be supported in future versions.

U * Keywords can be added to HTML documents

7 Any number of keywords can be added to HTML documents5 which will not show up when the document is viewed.7 This is used to make a document more like to be found2 and also to make it appear higher in the list of matches.

U * Email notification of expired documents

9 Special meta information can be added to HTML documents5 which can be used to notify the maintainer of those1 documents at a certain time. It is handy to get9 reminded when to remove the "New" images from a certain page, for example.

b * A Protected server can be indexed

5 ht://Dig can be told to use a specific username and8 password when it retrieves documents. This can be used1 to index a server or parts of a server that are' protected by a username and password.

V * Searches on subsections of the database

2 It is easy to set up a search which only returns5 documents whose URL matches a certain pattern. This7 becomes very useful for people who want to make their6 own data searchable without having to use a separate search engine or database.

Z * Full source code included

4 The search engine comes with full source code. The9 whole system is released under the terms and conditions5 of the GNU Public License version 2.0

g * The depth of the search can be limited

9 Instead of limiting the search to a set of machines, it8 can also be restricted to documents that are a certain8 number of "mouse-clicks" away from the start document.

b * Full support for the ISO-Latin-1 character set

8 Both SGML entities like 'à' and ISO-Latin-1) characters can be indexed and searched.





! Requirements to build ht://Dig



/ ht://Dig was developed under Unix using C++.



> For this reason, you will need a Unix machine, a C compiler@ and a C++ compiler. (The C compiler is needed to compile some of the GNU libraries)



? Unfortunately the developers only have access to a couple of= different Unix machines. Most development is done on Linux> systems with gcc/g++, but ht://Dig has been tested on these machines (and compilers):

 ; There are reports of ht://Dig working on a number of other= platforms. If you've compiled ht://Dig on a platform that is; not listed here, please let the developers know via the htdig-dev mailing list.

libstdc++



A If you plan on using g++ to compile ht://Dig, you have to make= sure that libstdc++ has been installed. For older versions/ of gcc/g++ (2.8.X and prior), libstdc++ is a@ separate package from gcc/g++. You can get libstdc++ from the= GNU software archive.



 GNU-style (Berkeley) 'make'



? The building relies heavily on the make program. The problem< with this is that not all make programs are the same. The> requirement for the make program is that it understands the 'include' statement as in

include somefile


; The Berkeley 4.4 make program supplied in many BSD-style+ systems doesn't use this syntax, instead it wants

# .include "somefile"


> and hence it cannot be used to build ht://Dig. Many systems> include a "gmake" command that will work instead.



> If your make program doesn't understand the right 'include', syntax, it is best if you get and installD GNU make before you try> to compile everything. The alternative is to change all the Makefiles.




 Disk space requirements



= The search engine will require lots of disk space to store= its databases. Unfortunately, there is no exact formula to> compute the space requirements. It depends on the number of; documents you are going to index but also on the various4 options you use. To give you an idea of the space9 requirements, here is what I have deduced from our own/ database size at San Diego State University.H



? If you keep around the wordlist database (for update digging; instead of initial digging) I found that multiplying thee? number of documents covered by 12,000 will come pretty close> to the space required.o

7

" We have about 13,000 documents:


>         13,000i         12,000 x1    -----------"    156,000,000h
 or about 150 MB.P

@ Without the wordlist database, the factor drops down to about 7500:

<
a         13,000F          7,500 x<     ----------o     97,500,000e
 or about 93 MB.

9 Keep in mind that we keep at most 50,000 bytes of each@ document. This may seen a lot, but most documents aren't very? big and it gives us a big enough chunk to almost always showt an excerpt of the matches.n

r

< You may find that if you store most of each document, the> databases are almost the same size, or even larger than the: documents themselves! Remember that if you're storing a< significant portion of each document (say 50,000 bytes as? above), you have that requirement, plus the size of the wordeB database and all the additional information about each document2 (size, URL, date, etc.) required for searching.

c
, Last modified: $Date: 2002/01/28 03:56:10 $ / ÿÿ/dd>
] * Boolean expression searching