thesitemapper

Documentation

Documentation ....

thesitemapper is able to create an HTML Site Map which displays page titles, descriptions, and urls to be used as an index page for a web site and an XML Site Map to be used with search engines to help them identify new and used pages.

The application can be set up to automatically crawl a number of web sites, create an XML site map and html site map for each web site, and then ftp them to the required location on the appropriate web server. An in built scheduler enables the crawler to start at predefined times for each site.

General principles

Each site is set up independently, and you can set up as many sites as you wish. Each site can then be identified to be active so that crawling the site will automatically generate an HTML site map and XML site map when the crawler has completed. You may also manually generate the site maps at any time without re-crawling.

Clicking on the 'Crawl All sites ...' button will crawl all sites that have been identified as being active. When the site is first created, the site is always identified as active.

Clicking on the 'Start crawl' button associated with a particular site will only crawl that site.

Once crawling has completed, the results can be automatically ftp'd to the destination server and search engines pinged to indicate a new XML map.

There are a number of formats to choose from for the html page displays. These include single columns, 2-column, multi-page A to Z and various combinations of those. You may also create your own template web page to match the look of your site - the results can then be automatically inserted into the template web page.

Web settings page


web settings

  • Create XML Site Map - tick to automatically create an XML Site Map when the crawler is run.
  • Create HTML Site Map - tick to automatically create an HTML Site Map when the crawler is run.
  • Use robots.txt file - tick to make the crawler obey the robots.txt file at the root of your web site.
  • When a New Site is created all the above three boxes are ticked by default.
  • Site name - any appropriate name.
  • Root Site URL - identifies the domain of the site and is usually something like http://www.mysite.com
  • Root Page URL - identifies the page from where the crawling will start and will be something like http://www.mysite.com/default.asp or http://www.mysite.com/pages/default.htm. The crawler will start at this page and crawl all folders below.
  • Exclude directories - list of the folders that you want to exclude - this will exclude all child folders as well. Enter one per line in the form /foldername/ to exclude all files from the foldername folder.
  • Advanced settings button - enter the urls of files which you wish to exclude from the results. The urls will not appear in the HTML or XML displays but they will still be crawled and so are any sub folders linked to the files.

Google Analytic's report

When you tick the box 'Check Google Analytics' on the Web Settings Page, the page report listing identifies if you have Google Analytics installed on the pages.

This is designed to help you configure Google Analytics, whether using the older urchin.js code or if you recently upgraded to the new ga.js tracking code. This diagnostic tool identifies pages on your web site that have GA tracking code properly installed. This makes it easy for you to isolate the pages with tracking problems, fix them, and effectively manage your Google Analytics installation.

Crawl settings page


crawl settings

This page sets up various crawling parameters.

  • Excluded file extensions - A list of file extensions that you want to exclude from the crawl e.g. "inc txt" and so on. Each entry is separated by a space and note that you do NOT enter dots.
  • Crawling interval (millisecs) - A time interval between pages to reduce the load on the server.
  • Map output folder - By default the map files are created in the same folder as thesitemapper.exe file. Select a destination folder for the files.
  • HTML Site Map filename - Enter a suitable filename for the HTML site map. By default the filename is HTMLSiteMap_ followed by the record id of the site followed by .html
  • XML Site Map filename - Enter a suitable filename for the XML site map. By default the filename is XMLSiteMap_ followed by the record id of the site followed by .xml
  • Case sensitive - Tick to make the crawl case sensitive which is usually required on Linux type servers where file names are case sensitive.

XML site map page


xml site map settings

For a complete description of the meaning of these settings, refer to http://www.sitemaps.org/protocol.php

You may set the Change Frequency, File last modified and Priority.

HTML Site map page


html site map settings

  • First choose the elements that you want to be displayed. The title will always be displayed, with the option of choosing the page description, url and page extract.
  • Page description is the contents of a meta tag in the head of the web page. By default this is the description meta tag but you can change this to any meta tag name that you wish.
  • The title is the title of the web page. For those pages which do not have a title, and for Word and pdf documents, this will be the url of the page.
  • The URL is simply the full url of the page.
  • The page extract is a part of the page which can be up to 255 characters.
  • For web pages of type .doc, .xls, .pdf and .rtf the page title is always the url of the page and the description is the initial text of the page.

Layout :

  • Default Vertical List - Can have categories or no categories
  • 2-column vertical table - ALWAYS displays categories
  • Multi-page A-Z - produces multiple pages one for each letter.

Fonts, Colors, etc button :

You may select the formatting of each page element by clicking on the Fonts, Colors etc button. You may either enter a css style name - which will need to exist in a style sheet for it to render correctly - or you may select fixed fonts, size and colors for the elements.

Other formatting button :

This button displays a set of options which may be used to alter other formatting definitions such as table cell padding, table cell spacing and so on.

When you create a new site, the format settings for fonts, colors etc are pre-defined to give a standard looking display.

Folder Alias button :

When you crawl the web site, the folder names are extracted and stored with the url. The folder names may then be displayed on the html site map to categorize the displays. However, the folder names are not always appropriate and the Folder Alias button allows you to enter a different folder name which will appear on the html site map.

HTML Templates :

If you wish to use your own web page layout in the form of an HTML page, enter the following at the point in the template file where you want the html site map to be displayed :

<!-- THESITEMAPPER -->

Then enter the file name into the "Template file" text box. When the html site map is created, it will place the site map at that point in the template file.

FTP Settings page


ftp settings

When a site is crawled, you can set the application to automatically ftp the results to your web server on completion.

First enter in your FTP settings, FTPHost, Username and Password. The XML Site Map will be ftp'd to the root of the web site.

Automatically FTP site maps when created - Tick this box to automatically ftp all the site maps when they are created.

Notify (ping) search URLs on completion - Tick this box so that the search sites are automatically pinged when the site maps are created and after they have been ftp'd to your site.

Set up Ping URLs page

More and more search engines are using the XML site map method and this form enables you to add new urls yourself, just add the root url to the list.

Enter the remote path for HTML site map on the server - This will be a folder name where you want the site map to be ftp'd to.

FTP XML Site Map and FTP HTML Site Map - These allow you to manually ftp the generated site maps to you web server - useful when you want to test the ftp system.

Ping Search URLs - This allows you to manually ping the search engines.

Scheduler page


scheduler

Ticking the 'Enable for this site' will Enable the scheduler. Choose the days when it should run and the time it should start from.

You may also use the Windows Scheduler to schedule the crawl. This is done using the command line as described by clicking here.

Results page

The results page is a simple display of the XML Site Map and also provides a validation of the XML Site Map.

Excluding text from the crawler

If you wish to exclude text from the crawler, such as menu, footer or other non relevant information, then use the following comments :

<!-- exclude_start -->
   text to be excluded
<!-- exclude_end -->

Command Line use

You may run the application from the Windows Command line using :

thesitemapper.exe

To start the crawl use

thesitemapper.exe crawl

which will cause all sites to be crawled and all indexes to be created.

Putting this into the Windows Scheduler will allow you to run the application at defined times without using the inbuilt scheduler system.

find out more :

The full registered version costs $30 U.S. Dollars

To go to the purchase page : click here.

A trial version is available by clicking here.

Enquiries : If you have any questions about the product, go to the contacts page by clicking here.