Documentation
REGEXFILTER:  V1.1


System Requirements

  • Perl 5
  • Ability to run cgi scripts outside of your cgi-bin
  • Telnet is nice but not required

Preliminaries

  • Determine the path to PERL 5 on your web server host.  Note that some web hosting companies run both PERL 4 and PERL 5.  Make ABSOLUTELY sure you are not setting this up under PERL 4.  Ask your administrator if you are not sure.
  • Unpack the tar archive on your desktop using a program that unpacks UNIX TAR ARCHIVES. If you don't have such a program then download WINZIP FREE from SHAREWARE.COM
  • After you have unpacked the TAR archive you will have a collection of folders and files on your desktop.  Now you have to do some basic editing of each of these files (or at least some of them).  Use a text editor such as wordpad, notepad, BBEdit, simpletext, or teachtext to edit the files.  These are NOT WORD PROCESSOR DOCUMENTS they are just simple TEXT files so don't save them as word processor documents or save them with extensions such as .txt or they will NOT WORK.   Note that there may be a some files inside of folders which are "blank".   This is normal.

Preparing the CGI scripts

Define Path To PERL 5

The first step is to open up each and every file that has a .cgi extention and edit line number one of each script.  Each of the cgi scripts is written in perl 5. For your scripts to run they must know where perl 5 is installed on your web server. The path to perl 5 is defined to a cgi script in the first line of the file. In each of the cgi scripts the first line of code looks something like this:

#!/usr/bin/perl

If the path to perl 5 on your web server is different from /usr/bin/perl you must edit the first line of each cgi script to reflect the correct path. If the path to perl 5 is the same no changes are necessary. If you do not know the path to perl 5 ask the webmaster or system administrator at your server site.  

Configure the .cgi files

filter.cgi

This is the only file you need to edit.

  • @validsuffix = (htm,html,shtml);
  • $rooturl="http://www.yourdomain.com/path/to/directory/to/spider/from/";
  • $pathroot = "/full/path/to/root/html/directory/of/domain";
  • $urltopath = "http://www.yourdomain.com";
  • $spiderfile = "/full/path/to/files.txt";
  • $urllistfile = "/full/path/to/urls.txt";
  • @validsuffix is a comma separated list of the file types you want the script to filter.  In this configuration (htm,html,shtml) page1.htm, page2.html, page2.shtml would all be targeted by the cgi script.  Another example (htm,html,shtml,txt) would also filter anything.txt
  • $rooturl is the URL to the location of the filter.cgi directory
  • $pathroot is the FULL PATH to the ROOT HTML directory of your domain.  Note that it is NOT the FULL PATH to the INSTALLATION directory.
  • $urltopath is just your root URL - DO NOT END THIS WITH A BACKSLASH!
  • $spiderfile is the full path to files.txt.  After the script runs files.txt will contain a list of the paths to each file spidered.
  • $urllistfile is the full path to urls.txt.  After the script runs urls.txt will contain a list of all the urls hit by the spider.

Upload Your Edited Files and Setup for Execution

  • Upload all of the files into the directory you want to spider from.  The spider will seek into every subdirectory from this point inwards until it has hit every file.
  • Chmod filter.cgi to 755, Chmod urls.txt and files.txt to 666 or 777
  • If you want to run this from the browser you may have to set the permissions of your html files or spider targets to 666 or 777.   This is not really convenient from ftp but is extremely simple to do by telnet.   If you don't have telnet access try using our virtual telnet program.  To chmod all files from a directory and all those in its included subdirectories use the "recursive" chmod command.  Examples follow below:
    • chmod -R -f 777 *htm*
    • chmod -R -f 666 *htm*
    • chmod -R -f 777 *
  • Note that the * is a wildcard and doing a recursive chmod such as *htm* will keep you from frying your cgi script permissions accidentally

Execute the CGI Script

  • To execute from the browser simply type in the URL to filter.cgi and hit "enter"
  • To execute from virtual telnet or real telnet the command would be
    • cd /path/to/filter.cgi;perl filter.cgi

Modifying the CGI Script for Other Applications

  • Before executing this script it is HIGHLY ADVISED that you backup your site just in case you do something stupid accidentally!
  • The default setup is set to search out and delete all useless tags Microsoft FrontPage generates in your HTML.  To perform a different search and replace, or search and delete, edit the filter.cgi subroutine called "nukemicrosoft".  Find the line inside of filter.cgi that looks like this
    • $lines=~ s/<!--msnavigation-->//g;
  • These are the lines that do the actual search & replace functions.  You can add more lines if you want.  There are 2 lines in the default version.
  • The format is simply $lines =~ s/SEEK/REPLACEWITH/g;
  • Each $lines command you add will substitute every occurrence of SEEK with REPLACEWITH
  • If you want to simply DELETE phrases just make REPLACEWITH blank like so $lines =~ s/SEEK//g;
  • Here are 2 additional examples.  The first one replaces all instances of an old phone number (111-111-1111) with a new one (222-222-2222).  The second example deletes all instances of the phrase "brand new".
    • $lines=~ s/111-111-1111/222-222-2222/g;
    • $lines=~ s/brand new//g;
  • One final note is you can use this with a crontab file if you publish on a regular basis
  • Another application is you can use this to rotate keywords on your site like so:
    • Setup for example 3 installs of the script (filter1.cgi, filter2.cgi, filter3.cgi:files1.txt,files2.txt,files3.txt,etc...)
    • Setup the crontab file to run each version at unique times say filter1.cgi on Monday, 2 on Wed, etc)
    • That way each time those search engine spiders hit your site you will get different results and weightings
  • These are just a few ideas of how you can use this powerful script.  The rest is up to your imagination and creativity!