Documentation
REGEXFILTER: V1.1
System Requirements
- Perl 5
- Ability to run cgi scripts outside of your
cgi-bin
- Telnet is nice but not required
Preliminaries
- Determine the path to PERL 5 on your web
server host. Note that some web hosting companies run both PERL 4 and PERL 5.
Make ABSOLUTELY sure you are not setting this up under PERL 4. Ask your
administrator if you are not sure.
- Unpack the tar archive on your desktop using a
program that unpacks UNIX TAR ARCHIVES. If you don't have such a program then download
WINZIP FREE from SHAREWARE.COM.
- After you have unpacked the TAR archive you
will have a collection of folders and files on your desktop. Now you have to do some
basic editing of each of these files (or at least some of them). Use a text editor
such as wordpad, notepad, BBEdit, simpletext, or teachtext to edit the files. These
are NOT WORD PROCESSOR DOCUMENTS they are just simple TEXT files so don't save them as
word processor documents or save them with extensions such as .txt or they will NOT WORK.
Note that there may be a some files inside of folders which are "blank".
This is normal.
Preparing the CGI scripts
Define Path To PERL 5
The first step is to open up each and every
file that has a .cgi extention and edit line number one of each script. Each of the
cgi scripts is written in perl 5. For your scripts to run they must know where perl 5 is
installed on your web server. The path to perl 5 is defined to a cgi script in the first
line of the file. In each of the cgi scripts the first line of code looks something like
this:
#!/usr/bin/perl
If the path to perl 5 on your web server is
different from /usr/bin/perl you must edit the first line of each cgi script to reflect
the correct path. If the path to perl 5 is the same no changes are necessary. If you do
not know the path to perl 5 ask the webmaster or system administrator at your server site.
Configure the .cgi files
filter.cgi
This is the only file you need to edit.@validsuffix = (htm,html,shtml);
$rooturl="http://www.yourdomain.com/path/to/directory/to/spider/from/";
$pathroot =
"/full/path/to/root/html/directory/of/domain";
$urltopath =
"http://www.yourdomain.com";
$spiderfile =
"/full/path/to/files.txt";
$urllistfile =
"/full/path/to/urls.txt";
@validsuffix is a comma separated
list of the file types you want the script to filter. In this configuration
(htm,html,shtml) page1.htm, page2.html, page2.shtml would all be targeted by the cgi
script. Another example (htm,html,shtml,txt) would also filter anything.txt
$rooturl is the URL to the location
of the filter.cgi directory
$pathroot is the FULL PATH to the
ROOT HTML directory of your domain. Note that it is NOT the FULL PATH to the
INSTALLATION directory.
$urltopath is just your root URL - DO
NOT END THIS WITH A BACKSLASH!
$spiderfile is the full path to
files.txt. After the script runs files.txt will contain a list of the paths to each
file spidered.
$urllistfile is the full path to
urls.txt. After the script runs urls.txt will contain a list of all the urls hit by
the spider.
Upload Your Edited Files and Setup for
Execution
- Upload all of the files into the directory you
want to spider from. The spider will seek into every subdirectory from this point
inwards until it has hit every file.
- Chmod filter.cgi to 755, Chmod urls.txt and
files.txt to 666 or 777
- If you want to run this from the browser you
may have to set the permissions of your html files or spider targets to 666 or 777.
This is not really convenient from ftp but is extremely simple to do by telnet. If
you don't have telnet access try using our virtual telnet
program. To chmod all files from a directory and all those in its included
subdirectories use the "recursive" chmod command. Examples follow below:
- chmod -R -f 777 *htm*
- chmod -R -f 666 *htm*
- chmod -R -f 777 *
- Note that the * is a wildcard and doing a
recursive chmod such as *htm* will keep you from frying your cgi script permissions
accidentally
Execute the CGI Script
- To execute from the browser simply type in the
URL to filter.cgi and hit "enter"
- To execute from virtual telnet or real telnet
the command would be
- cd /path/to/filter.cgi;perl filter.cgi
Modifying the CGI Script for Other
Applications
- Before executing this script it is HIGHLY
ADVISED that you backup your site just in case you do something stupid accidentally!
- The default setup is set to search out and
delete all useless tags Microsoft FrontPage generates in your HTML. To perform a
different search and replace, or search and delete, edit the filter.cgi subroutine called
"nukemicrosoft". Find the line inside of filter.cgi that looks like this
- $lines=~ s/<!--msnavigation-->//g;
- These are the lines that do the actual search
& replace functions. You can add more lines if you want. There are 2 lines
in the default version.
- The format is simply $lines =~ s/SEEK/REPLACEWITH/g;
- Each $lines command you add will substitute
every occurrence of SEEK with REPLACEWITH
- If you want to simply DELETE phrases just make REPLACEWITH blank
like so $lines =~ s/SEEK//g;
- Here are 2 additional examples. The
first one replaces all instances of an old phone number (111-111-1111) with a new one
(222-222-2222). The second example deletes all instances of the phrase "brand
new".
- $lines=~ s/111-111-1111/222-222-2222/g;
- $lines=~ s/brand new//g;
- One final note is you can use this with a
crontab file if you publish on a regular basis
- Another application is you can use this to
rotate keywords on your site like so:
- Setup for example 3 installs of the script
(filter1.cgi, filter2.cgi, filter3.cgi:files1.txt,files2.txt,files3.txt,etc...)
- Setup the crontab file to run each version
at unique times say filter1.cgi on Monday, 2 on Wed, etc)
- That way each time those search engine
spiders hit your site you will get different results and weightings
- These are just a few ideas of how you can
use this powerful script. The rest is up to your imagination and creativity!
|