IIS
Continuous Link and SEO Testing – Announcing LinkCheck2
Sep 21st
First there was Continuous Integration, then there was Continuous Deployment, now there’s Continuous Testing.
Testing can (and should) be integrated throughout your web site development process: automated unit-testing on developer’s machines, automated unit testing during the continuous integration builds and then further automated testing after your continuous deployment process has deployed the site to a server.
Sadly, once deployed, most sites get only a cursory test through a service like Monastic that pings one or more URLs on your site to check that the site is still alive.
BUT, how do you know if your site is still working from a user’s perspective or from an SEO perspective? Serious bugs can creep in from seemingly small changes that aren’t in code but are in the markup to a site, these are often not tested by any of the aforementioned tests. For example, a designer editing HTML markup for your site could accidentally break the sign up link off the main entry page, or the page you had carefully crafted to be SEO optimized around a specific set of keywords could accidentally lose one of those words and thus loses rank in search engines causing your traffic to go down. Would you even know if this has happened?
Based on a small test I ran on some local startup web sites, the answer appears to be ‘no’. These sites often had broken links and poorly crafted titles (from an SEO perspective). Of course they could have used any of the many SEO services that can check your site to see if it has broken links or poorly crafted titles and descriptions (e.g. seomoz.com), but that’s often a manual process and there’s no way to link such tests into your existing continuous integration process.
What would be nice would be if you could include a ‘Continuous Link and SEO test’ on your Continuous Integration Server. This test could be triggered after each deployment and it could also run as a scheduled task, say every hour, to check that your web site is up and that all public pages are behaving correctly from a links and SEO perspective. It would also be nice if there was some way to get a quick report after each deployment confirming what actually changed on the site: pages added, pages removed, links added, links removed.
This is what my latest utility ‘LinkCheck2′ does. It’s a Window command line application that produces a report, and it will set an error code if it finds anything amiss. You can run it from the command line for a one off report or call it from your continuous integration server. The error code can be used by most CI servers to send you an alert. If you are using the changes feature you’ll get an alert when something changes and then on the next run it will automatically clear.
LinkCheck2 also includes the ability to define a ‘link contract’ on your site. This is a meta tag you add to a page to say ‘this page must link to these other pages’. LinkCheck2 will verify that this contract has been met and that none of your critical site links have been dropped by accident when someone was editing the markup.
At the moment LinkCheck2 checks all links and performs a small number of SEO tests (mostly around the length of titles). If there is interest in this tool I may expand the SEO capabilities, please send me your feedback and requests.
Use of LinkChecker.exe is subject to a license agreement: in a nutshell: commercial use is permitted, redistribution is not. Please contact me for details.
How to stop IIS7 from handling 404 errors so you can handle them in ASP.NET
Apr 17th
IIS7 has lots of places you could look to make this change: you might start off looking to see if it’s an advanced option on your application pool, no, so then you try looking at the web site itself and the option .NET Error Pages. That has to be it, surely! So you try every option there Mode=On, Mode=Off, Mode=Remote Only. Nothing works so you consult the help for those items only learn that “Mode” is to “Select a mode for the error pages: On, Off, or Remote Only.” You can see now why help writers at Microsoft are so well paid – who would have guessed that Mode = Remote Only sets the Mode to Remote Only!
Now you are really frustrated but luckily you landed on my blog post here where you learned that the true path to 404 happiness is a simple change to your web.config:
<system.webServer> <httpErrors errorMode="Detailed" />
A simple web crawler in C# using HtmlAgilityPack
Mar 10th
using System; using System.Collections.Generic; using System.Linq; using System.Text; using HtmlAgilityPack; using System.Net; namespace LinkChecker.WebSpider { /// <summary> /// A result encapsulating the Url and the HtmlDocument /// </summary> public abstract class WebPage { public Uri Url { get; set; } /// <summary> /// Get every WebPage.Internal on a web site (or part of a web site) visiting all internal links just once /// plus every external page (or other Url) linked to the web site as a WebPage.External /// </summary> /// <remarks> /// Use .OfType WebPage.Internal to get just the internal ones if that's what you want /// </remarks> public static IEnumerable<WebPage> GetAllPagesUnder(Uri urlRoot) { var queue = new Queue<Uri>(); var allSiteUrls = new HashSet<Uri>(); queue.Enqueue(urlRoot); allSiteUrls.Add(urlRoot); while (queue.Count > 0) { Uri url = queue.Dequeue(); HttpWebRequest oReq = (HttpWebRequest)WebRequest.Create(url); oReq.UserAgent = @"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"; HttpWebResponse resp = (HttpWebResponse)oReq.GetResponse(); WebPage result; if (resp.ContentType.StartsWith("text/html", StringComparison.InvariantCultureIgnoreCase)) { HtmlDocument doc = new HtmlDocument(); try { var resultStream = resp.GetResponseStream(); doc.Load(resultStream); // The HtmlAgilityPack result = new Internal() { Url = url, HtmlDocument = doc }; } catch (System.Net.WebException ex) { result = new WebPage.Error() { Url = url, Exception = ex }; } catch (Exception ex) { ex.Data.Add("Url", url); // Annotate the exception with the Url throw; } // Success, hand off the page yield return new WebPage.Internal() { Url = url, HtmlDocument = doc }; // And and now queue up all the links on this page foreach (HtmlNode link in doc.DocumentNode.SelectNodes(@"//a[@href]")) { HtmlAttribute att = link.Attributes["href"]; if (att == null) continue; string href = att.Value; if (href.StartsWith("javascript", StringComparison.InvariantCultureIgnoreCase)) continue; // ignore javascript on buttons using a tags Uri urlNext = new Uri(href, UriKind.RelativeOrAbsolute); // Make it absolute if it's relative if (!urlNext.IsAbsoluteUri) { urlNext = new Uri(urlRoot, urlNext); } if (!allSiteUrls.Contains(urlNext)) { allSiteUrls.Add(urlNext); // keep track of every page we've handed off if (urlRoot.IsBaseOf(urlNext)) { queue.Enqueue(urlNext); } else { yield return new WebPage.External() { Url = urlNext }; } } } } } } ///// <summary> ///// In the future might provide all the images too?? ///// </summary> //public class Image : WebPage //{ //} /// <summary> /// Error loading page /// </summary> public class Error : WebPage { public int HttpResult { get; set; } public Exception Exception { get; set; } } /// <summary> /// External page - not followed /// </summary> /// <remarks> /// No body - go load it yourself /// </remarks> public class External : WebPage { } /// <summary> /// Internal page /// </summary> public class Internal : WebPage { /// <summary> /// For internal pages we load the document for you /// </summary> public virtual HtmlDocument HtmlDocument { get; internal set; } } } }
Shaving seconds off page load times
May 19th