Dandy Web Design : Techdocs: Web Applications: getpage.pl...

Web Applications: Web Apps. Overview; Get Page; Analyse Page; Site Mapper; Custom 404 Error
Next section...: Techmiscellanea!

Source Code: getpage.pl source; analpage.pl source; sitemap.pl source; dandy404.pl source; dwd.pm source; dwd.conf source

Web Applications

getpage.pl

This very simple application will initially display a simple form into which you are invited you to enter a URL. On submission the program will GET the page from the requested URL and display its source code. If it cannot do this it will return an error message.

Once the source code of a page has been fetched it can be parsed in many ways. The most obvious use is as a spider, all of the links to other pages are located and added to a list to be scanned.

Strictly speaking this is not a useful application as it stands, however it is a precursor to so many other programs that require pages to be fetched across the 'net. The page source code can also be passed to an analyser, perhaps to check the code, build keyword lists and so on, the list of possibilities is endless.

The basic program function is discussed below, refer to the links in the sidebar for the source code or click here to »run the program...

Program Function

The program has a single output page which contains HTML for a form, an error message and the request page source code. These parts of the page are populated or left blank as determined by the program usage and result.

Functionally the program is very simple, it creates a 'user agent' and with this fetches the requested page and performs a few checks on it, if these conditions are not all met an error message is generated.

That the page can be fetched at all...
That the return URL matches the request URL
That the content type is text/html or text/xhtml

The user agent also requires a few pre-set values for its 'name', 'return email' and 'timeout'. These are set into the dwd.conf file and read into the external configuration. Once the user agent has completed its task it returns to the main program with the HTML for the requested page. It also returns true/false according to its outcome so the calling routine can react accordingly.

All of the user agent functionality has been written to a single subroutine; useragent_fetchpage() which is in the shared module dwd.pm. (See source code links in sidebar)

The page content is quickly processed using a simple function; dehtml_string() (also in dwd.pm) to convert all HTML reserved characters (& < > ") into their relevant entities (& < > ") and replacing all the line feeds with <br /> so that it can be displayed cleanly.

As previously mentioned this program is a precursor to more sophisticated and useful applications, the most obvious of which is some form of spidering.

In the next application we will get it to 'pull' the page apart and examine the HTML code to extract the values which we may be interested in.

DandyWebDesign

Web Applications

getpage.pl

Program Function