Web site crawler and link checker (free)

In a previous post I provided a utility called LinkChecker that is a web site crawler and link checker. The idea behind LinkChecker is that you can include it in your continuous integration scripts and thus check your web site either regularly or after every deployment and unlike a simple ping check this one will fail if you’ve broken any links within your site or have seo issues. It will also break just once for every site change and then be fixed the next time you run it. This feature means that in a continuous integration system like TeamCity you can get an email or other alert each time your site (or perhaps your competitor’s site) changes.

As promised in that post, a new version is now available. There’s many improvements under the covers but one obvious new feature is the ability to dump all the text content of a site into a text file. Simply append -dump filename.txt to the command line and you’ll get a complete text dump of any site. The dump includes page titles and all visible text on the page (it excludes embedded script and css automatically). It also excludes any element with an ID or CLASS that includes one of the words “footer”, “header”, “sidebar”, “feedback” so you don’t get lots of duplicate header and footer information in the dump. I plan to make this more extensible in future to allow other words to be added to the ignore list.

One technique you can use with this new ‘dump’ option is to dump a copy of your site after each deployment and then check it into source control. Now if there’s every any need to go back to see when a particular word or paragraph was changed on your site you have a complete record. You could for example use this to maintain a text copy of your WordPress blog, or perhaps to keep an eye on someone else’s blog or Facebook page to see when they added or removed a particular story.

Download the new version here:- LinkCheck \<– Requires Windows XP or later with .NET4 installed, unzip and run

Please consult the original article for more information.

LinkCheck is free, it doesn’t make any call backs, doesn’t use any personal data, use at your own risk. If you like it please make a link to this blog from your own blog or post a link to Twitter, thanks!

Fri Jan 14 2011 07:16:20 GMT-0800 (Pacific Standard Time)

Disqus goes here