Problems with indexing
I 'd really appreciate a hand with this script.
After working though the instructions, all the correspondence in the mail
list archives and other sources I am still getting this when starting the
index:
Checking for old temp files...
Building string of special characters...
Loading 'no index' regular expressions:
-
frontpage2.html
-
frontpage.html
[etc.]
Loading stopwords...371 stopwords
loaded.
Starting crawler...
Note: I will not visit more than
$HTTP_MAX_PAGES=150 pages.
Loading
http://www.quinacrine.com/robots.txt...
Error: Couldn't get
'http://www.quinacrine.com/robots.txt': response code 500
Not using any robots.txt.
Error: Couldn't get
'http://www.quinacrine.com/index.html': response code 500
Crawler finished: indexed 0 files, 0 terms
(0 different terms).
Ignored 0 files because of
conf/no_index.txt
Ignored 0 files because of
robots.txt
I thought it might be the structure of the site that was the
problem. The pages are not in the
root but in the 'web' directory, like so:
root
.config
.sessions
cgi-bin
logs
web
In cgi-bin I have these:
searchsite
conf
data
Perlfect
temp
templates
I installed manually by necessity and need to index though http for the
same reason. All syntax,
permissions, and other rules that I can find check out. Unix server.
The main sections of config.pl look like this now:
$DOCUMENT_ROOT =
'http://www.quinacrine.com/';
# The
base url of your site (normally that's the URL which
#
corresponds to $DOCUMENT_ROOT).
$BASE_URL =
'http://www.quinacrine.com';
# The
url in which Perlfect Search is located (usually somewhere in
cgi-bin/).
$CGIBIN = "/cgi-bin/searchsite/";
# The
full-path of the directory where Perlfect Search is
installed.
$INSTALL_DIR =
'/nfs/cust/5/80/46/564085/cgi-bin/searchsite/';
#
Only files with these extensions should be indexed (case-sensitive).
#
This is only relevant for file system indexing, when you index files
via
#
http you need to set <at> HTTP_CONTENT_TYPES instead.
[re-index]
<at> EXT
= ("html", "htm", "shtml", "txt");
[Password section]
###########################################################################
###
http configuration
###
You only need this if you want to index your pages via
http
#
Where you want the indexer to start via http. Leave empty
if
# you
want to index the files in the filesystem
($DOCUMENT_ROOT).
# **
WARNING **: Do not use for foreign servers! It might use too
many
#
resources on other people's servers. [re-index]
#
example: $HTTP_START_URL = 'http://localhost/';
$HTTP_START_URL =
'http://www.quinacrine.com/index.html';
Thinking that the file structure could be the issue, I put a copy of
robots.txt in the root, still the 500 response. I've left $HTTP_START_URL = blank, used 'http://www.quinacrine.com/'
as well as other things I could think of to break this jam.
Thanks for your help in advance,
Roger Growe