Web Applications
Web Apps. Overview
Get Page
Analyse Page
Site Mapper
Custom 404 Error
Next section...
Techmiscellanea!
Source Code
getpage.pl source
analpage.pl source
sitemap.pl source
dandy404.pl source
dwd.pm source
dwd.conf source

Web Applications

sitemap.pl Site Mapper

The previous application »Analyse Page was capable of returning the specified page and then parsing its HTML source code into a data structure suitable for further processing or analysis.

Whereas the previous program only displayed the processed data as proof of principle, this one will use the data structure to find the page links and also process those pages too, a true spider.

The ultimate goal of this utility is twofold; firstly it will generate a map of the site, a handy navigation feature for many websites. Secondly it can also generate a list of all possible URLS or pages, this data can be very useful to other website management utilities and error handling.

The basic program function is discussed below, refer to the links in the sidebar for the source code or click here to »run the program...

Program Function

Initially the program presents a form and invites the user to enter a website URL, idealy the top-level index page for the site and to select a display option. The program then uses the subroutines developed in the previous example, (»Page Analyser) to locate all of the links within the page, add them to an array and then goes on to fetch and parse each of the pages linked from within the specified site. Once this process has completed the resulting site-map or list of URLs is then displayed according to the chosen option.

The program relies upon the subroutines demonstrated previously, most of the remaining program function is straight-forward Perl with a few additional subroutines added to the dwd.pm module as required.

The program ignores links within a page, in the case of links to a point within a page the anchor reference is dropped and just the plain page URL is retained. This avoids multiple links to different parts of the same page.

The scan only covers the URLs within the specified website, there is no attempt to fetch or scan external pages. Without this critical behaviour the program would, in principle at least, scan the entire Internet if left to run indefinitely!

If the page does not appear to contain any links then it is re-parsed, this time to look at the meta tags to see if there is a re-direction, this is something that search engines rarely do, if the program needs to do this the chances are that the pages it finds will not have been found by many search engines! It should also be noted that apart from that special function the program will otherwise see your site in much the same way that any search-engine robot will, if this program does not find your pages (JavaScript generated links perhaps) then neither will the search engine. If the number of pages listed is less than expected this may be your problem.

The program will display each URL as a link itself unless that link could not be fetched. These are displayed 'plain-text' with a prepended ! to indicate that it is a bad link.

Indirectly this program also serves as a link-checker, another very handy tool for any website!

There are a number of safeguards and differences in behaviour depending on whether or not this program is run against this website or an external one. External calls are limited to a set number of pages, mainly so as not to waste the bandwidth for this website scanning someone else's website!

Program Options

The program options are as follows:
Scan this website... This checkbox will, if ticked, over-ride any specified URL and automatically scan the index page for this website: http://web.dandylife.org/index.shtml

Don't scan CGI programs... If this checkbox is ticked any resources identified as CGI will not be scanned thus reducing the load on the server. CGI resources are identified according to the following criteria:
1: If the URL contains /cgi-bin/
2: If the URL contains any of the characters: & ? ;
3: If the filename has any of the following extensions: pl pm cgi asp php vb aspx exe bat cmd sh tcl
This is all handled within the program subroutine is_cgi(), if you decide to implement this program then you may need to modify this if your website setup is significantly different to this one.

Ignore CGI program arguments... If ticked this will truncate all CGI URL arguments making different calls to the same program effectively the same. This can reduce the load on the server, however if the website relies heavily on CGI programs for its main content this may lead to an incomplete scan. Nevertheless this option is set by default, untick it if you want a thorough scan.

Generate site map for display... This is the default display mode, a full map is generated of the links and displayed in a hierarchical fashion, their indentation gives a clue as to how the pages relate to their parents.

Generate site map as SSI (and display)... This is the same as the above option in terms of its functionality, however the information is also written to an SSI file so that it can be preserved and included within the site map page for this website. Safeguards and checks wihin the program ensure that the file is only written if all pages (including any CGI programs with their arguments) were scanned, and of course that it is for this website!

Generate list of all links from specified site... This display option generates a plain list of all links found within the specified website and includes links to external sites.

Generate list of all pages within specified site... This generates a plain list of all URLs to pages within this site, effectively a list of all pages.

Program Availability

You are welcome to take a copy of this program and use it within your own website provided that such use is entirely non-commercial. The design and structure should allow you to easily modify the output to match your own website, see the CGI Scripting pages for more details on how to do this.

You will need to copy three files; sitemap.pl, dwd.pm and dwd.conf, open each of these and copy-paste directly to a text file.

You can modify the program to run in any way that you want it to, or pick apart the key routines to write your own version, in addition the core subroutines used by this program form the basis of many other useful website utilities; in particular a means to customise your 404 (page not found) errors, this is the subject of the next article and CGI application; dandy404.pl.

Show Style-Switcher...