Documentation
SUPER SPIDER: V1.0
Preleminary Material for review
System Requirements
Preliminaries
- Determine the path to PERL 5 on your web
server host. Note that some web hosting companies run both PERL 4 and PERL 5.
Make ABSOLUTELY sure you are not setting this up under PERL 4. Ask your
administrator if you are not sure.
- Download the tarfile for this program and
save it to your desktop.
- Unpack the tar archive on your desktop using a
program that unpacks UNIX TAR ARCHIVES. If you don't have such a program then download
WINZIP FREE from SHAREWARE.COM.
- After you have unpacked the TAR archive you
will have a collection of folders and files on your desktop. Now you have to do some
basic editing of each of these files (or at least some of them). Use a text editor
such as wordpad, notepad, BBEdit, simpletext, or teachtext to edit the files. These
are NOT WORD PROCESSOR DOCUMENTS they are just simple TEXT files so don't save them as
word processor documents or save them with extentions such as .txt or they will NOT WORK.
Note that there may be a some files inside of folders which are "blank".
This is normal.
Preparing the CGI scripts
Define Path To PERL 5
The first step is to open up each and every
file that has a .cgi extention and edit line number one of each script. Each of the
cgi scripts is written in perl 5. For your scripts to run they must know where perl 5 is
installed on your web server. The path to perl 5 is defined to a cgi script in the first
line of the file. In each of the cgi scripts the first line of code looks something like
this:
#!/usr/bin/perl
If the path to perl 5 on your web server is
different from /usr/bin/perl you must edit the first line of each cgi script to reflect
the correct path. If the path to perl 5 is the same no changes are necessary. If you do
not know the path to perl 5 ask the webmaster or system administrator at your server site.
Configure the .cgi files
configure.cgi
Edit the variables inside configure.cgi
- $url =
"http://www.startingdomain.com/startpage.html";
- $spiderdepth = 10;
- $initialtarget = 0;
- $mysqldatabase = "mysql database
name";
- $mysqlusername = "mysql user
name";
- $mysqlpassword = "mysql password";
$url is the initial page you will point the
spider to.
$spiderdepth is the number of recursive
probes that the software will perform. Try it out with 10 before you set it to some
insane value like 1,000,000 to get an idea of how this works. This is an exponential
value and you will run out of hard drive space if you are not prepared.
$initialtarget is the LINK NUMBER you are
starting from (this value is only used if you stop and restart the spider or if the spider
connects to a dead url and dumps out a restart message). You probably will never use
this variable unless you are doing hardcore spidering.
Upload files and set permissions
Upload the spider.cgi and configure.cgi files
into a directory of your webserver you can execute from. Set permissions for both
files to 755.
Create database and upload your .sql file
Make sure you have setup your mysql database
and uploaded the domains.sql file before starting. To see your data backup the
results. If you have no clue how mysql works you need to readup. Start here.
Results
There are two tables of results. One is
the "domains" table, the other is "targets" table. The domains
table is a list of every unique domain name the spider finds. The targets table is a
list of every unique url the spider finds. Pretty simple.