Web Applications
Web Apps. Overview
Get Page
Analyse Page
Site Mapper
Custom 404 Error
Next section...
Techmiscellanea!
Source Code
getpage.pl source
analpage.pl source
sitemap.pl source
dandy404.pl source
dwd.pm source
dwd.conf source

Web Applications

analpage.pl Page Analyser

This builds directly onto the previous application »Get Page - getpage.pl

Here we go on to develop potential uses for the returned source code...

The key operation here is to parse the HTML so that the Perl program can access each element. Once the HTML has been represented as a data structure that the program can interpret almost anything can be done with it.

Note! As this is still a demo it will not have the sophisticated code required to read invalid HTML. Making a program read correctly written valid code is easy, but if the code has errors then the program will need to anticipate the errors. There is only one way to code correctly but an almost infinite number of possible errors! The program could be made to trap some errors, but this requires additional and at times complex code which is well beyond the scope of this example application.

So for now we assume valid (x)HTML only, otherwise GIGO rules apply!

The basic program function is discussed below, refer to the links in the sidebar for the source code or click here to »run the program...

Program Function

Perl has some very sophisticated pattern matching routines, using these the required tags can be picked out. This is where HTML validity issues creep in, if the code contains errors these pattern matches will probably fail.

It should also be noted that valid HTML code can be written in a variety of 'styles', whitespace usage can vary considerably, attributes may come in odd orders and so on. Provided the source code matches the required HTML specification then any application should be reasonably expected to be able to read it correctly.

This application is really intended for xHTML1.0, earlier HTML specs allowed mixed case elements and attributes, bareword attributes, unquoted values, different closing tag rules and more. Nevertheless the application should be able to make sense of these specs and certainly manage to extract the main values.

The application relies upon a single subroutine; parsehtml() which will be reused and so has been added to dwd.pm This takes an array reference and a string containing the HTML, parses the HTML and pushes the elements into the array.

All HTML code takes the following form:
[text<tag>text<tag>text<tag>text<tag>text]
where 'text' can be an empty string, ie with two adjacent elements or tags. Note also that there is an implicit 'text' element at the start and end of the snippet.

By reading the HTML in this way we can ensure that the resulting array has some special properties, namely that the first value, array[0] is the implied text before the first tag, array[1] is the first tag itself, (this should be the DOCTYPE line) for a valid page, and from thereonwards the HTML elements occupy the odd numbered values in the array, and the even values are the 'text' between the HTML elements, all of which are assumed to exist with "" as an implicit value.

The pattern match is based entirely around locating the < and > delimiters for each element and also upon the fact that elements never contain angle brackets, and that angle brackets never appear in the text. These are requirements of valid HTML, if the code is invalid then the interpreation is likely to be wrong. If this was part of a validation application then this mismatch could itself be detected and serve to warn of an error in the code.

The pattern match is as follows: /(<[^>]+>)([^<]*)/
This looks for a pair of angle brackets, with any number of 'not right angle brackets' between them, followed by any number of 'not left angle brackets'. When such a pair is located it is extracted and stored in an array.

HTML comments pose a minor problem, these are the only elements which do not have a pair of angle brackets at either end. The first tag is <!--, the the enclosed text, then the end tag, -->.

Rather than try some very complex pattern matching the solution is to make use of Perl's very fast pattern-matching and substitution and temporarily 'repair' the tags, <!-- becomes <!--> and --> becomes <-->. As each of these tags is written to the array it is changed back to its correct value.

Once the array has been built a second pass over it locates the commented text and enclosed HTML tags and reassembles the entire comment into one tag, and re-writes the array. Now the array contains just 'live' HTML, text and comments as single HTML elements which can now be parsed or ignored as appropriate.

text<!--text<tag>text<!--text<tag>text<tag>text-->text<tag>text-->text

If comments are nested (as above) then the comment will be read as ending on the first occurrence of the end tag --> as highlighted. This is the correct interpretation, if HTML comments are nested in this way then they are just plain wrong - GIGO again!

Once the program has completed it displays the array elements. A second subroutine in the dwd.pm module; parseattrib() takes each HTML element in turn and breaks it down into its individual attributes. These are written to a referenced hash with the attribute name as the hash key and its value as the hash value. The HTML element type is also written to the hash with the key 'element'.

Line-feeds are substituted for with \n so they can be seen in the final output, however this is strictly for display and forms no part of the core processing.

This very simple data structure and the two subroutines to create it are an ideal first step towards any form of HTML code validation program, code tidier, tree-generator, etc, or any other application that needs to extract the HTML elements and attributes. We have now stored all of the HTML in such a way that we can easily extract all of the values we require using these subroutines.

This will be developed further in the next application examples where we will focus on a specific task and build a website spider with the ultimate goal of generating an automated site map, a useful feature of any large website, with many other uses too!

Show Style-Switcher...